Thursday, April 7, 2016

Fraud Detection When User Signup

As a social website, how to predict fraud user when they signup? How to decide which elements are critically related to fraud detect based on historical data?

The demo below tried to setup a lightweight solution for fraud detection by Spark ML.

In the sample, get user's demographic:
age, ethnic, vids, ips, emails, caption_len, bodytype, profile_initially_seeking, is_fraud
[notes]: vids=signup vid repeated time,  ips=signup ip repeated time, emails=signup email repeated time, caption_len=length of profile caption.

We want to exclude some high-correlated elements,  and elements of less related to is_fraud. In Spark,  the code as follow:
import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics

# df => RDD
val rdd = df.map{case Row(...) => Vectors.dense(...)}
val correlMatrix: Matrix = Statistics.corr(data, "pearson")

The result:










From matrix above, ethnic is less related to is_fraud. We will exclude ethnic element when do modeling.

Then use Random Forest algorithm to train training set and predict testing set.
val data = df.map{
      case Row(pnum: Int, age: Int, ethnic: Int, vids: Int, ips: Int, emails:Int,  caption_len: Int, bodytype:Int, profile_initially_seeking:Int, is_fraud: Int) =>
        LabeledPoint(is_fraud.toDouble, Vectors.dense(age.toDouble,  vids.toDouble, ips.toDouble, emails.toDouble, caption_len.toDouble, bodytype.toDouble, profile_initially_seeking.toDouble))
    }.toDF()

    val labelIndexer = new StringIndexer()
      .setInputCol("label")
      .setOutputCol("indexedLabel")
      .fit(data)

    val featureIndexer = new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexedFeatures")
      .setMaxCategories(32)
      .fit(data)

    val Array(trainingData, testData) = data.randomSplit(Array(0.8, 0.2))

    val rf = new RandomForestClassifier()
      .setLabelCol("indexedLabel")
      .setFeaturesCol("indexedFeatures")
      .setNumTrees(10)

    val labelConverter = new IndexToString()
      .setInputCol("prediction")
      .setOutputCol("predictedLabel")
      .setLabels(labelIndexer.labels)

    val pipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))
    val model = pipeline.fit(trainingData)
    val predictions = model.transform(testData)

    predictions.select("predictedLabel", "label", "features").show(5)

After prediction on testing set, compared the prediction label and label:
true positive: 0.88
true negative: 0.766
accuracy: 0.834

Investigated signup info deeply, we added three more critical variables, which made prediction usable in production:
true positive: 1
true negative: 0.9968
accuracy: 0.9987

No comments:

Post a Comment