The demo below tried to setup a lightweight solution for fraud detection by Spark ML.
In the sample, get user's demographic:
age, ethnic, vids, ips, emails, caption_len, bodytype, profile_initially_seeking, is_fraud
We want to exclude some high-correlated elements, and elements of less related to is_fraud. In Spark, the code as follow:
import org.apache.spark.SparkContext import org.apache.spark.mllib.linalg._ import org.apache.spark.mllib.stat.Statistics # df => RDD val rdd = df.map{case Row(...) => Vectors.dense(...)} val correlMatrix: Matrix = Statistics.corr(data, "pearson")
The result:
From matrix above, ethnic is less related to is_fraud. We will exclude ethnic element when do modeling.
Then use Random Forest algorithm to train training set and predict testing set.
val data = df.map{ case Row(pnum: Int, age: Int, ethnic: Int, vids: Int, ips: Int, emails:Int, caption_len: Int, bodytype:Int, profile_initially_seeking:Int, is_fraud: Int) => LabeledPoint(is_fraud.toDouble, Vectors.dense(age.toDouble, vids.toDouble, ips.toDouble, emails.toDouble, caption_len.toDouble, bodytype.toDouble, profile_initially_seeking.toDouble)) }.toDF() val labelIndexer = new StringIndexer() .setInputCol("label") .setOutputCol("indexedLabel") .fit(data) val featureIndexer = new VectorIndexer() .setInputCol("features") .setOutputCol("indexedFeatures") .setMaxCategories(32) .fit(data) val Array(trainingData, testData) = data.randomSplit(Array(0.8, 0.2)) val rf = new RandomForestClassifier() .setLabelCol("indexedLabel") .setFeaturesCol("indexedFeatures") .setNumTrees(10) val labelConverter = new IndexToString() .setInputCol("prediction") .setOutputCol("predictedLabel") .setLabels(labelIndexer.labels) val pipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, rf, labelConverter)) val model = pipeline.fit(trainingData) val predictions = model.transform(testData) predictions.select("predictedLabel", "label", "features").show(5)
After prediction on testing set, compared the prediction label and label:
true positive: 0.88
true negative: 0.766
accuracy: 0.834
Investigated signup info deeply, we added three more critical variables, which made prediction usable in production:
true positive: 1
true negative: 0.9968
accuracy: 0.9987

No comments:
Post a Comment