Wednesday, March 30, 2016

Predict User First Transaction Using Spark ML

As a website-based business,   can we predict a user will do first transaction, and how?

This article will experiment data pipeline and one of popular ML classification algorithms(Random Forrest) to solve the problem.

1. Supposed in a shopping site,  user will signup by their dob, gender, city, then a series of action happened after such as merchandize view, add favourite, search etc before first purchase.
* user's demographic: age , gendercity
* user's behaviour: searchesviewsfavourites

After ETL, a user's data looks like:
idagegendercitysearchesviewsfavourites
1234536116251983


2. Sampled users based on paid user or not(binary classification). Sampling:
TotalPaidUnpaid
20k8k12k
* Note: same proportion as full dataset.

val Array(trainingData, testData) = data.randomSplit(Array(0.8, 0.2))


3. Load dataset as DataFrame in Spark, split it into training and test sets. Train on first dataset, and then evaluate on test set. The data pipeline as follow:














val rf = new RandomForestClassifier()
      .setLabelCol("indexedLabel")
      .setFeaturesCol("indexedFeatures")
      .setNumTrees(10)

    val labelConverter = new IndexToString()
      .setInputCol("prediction")
      .setOutputCol("predictedLabel")
      .setLabels(labelIndexer.labels)

    val pipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))
    val model = pipeline.fit(trainingData)
    val predictions = model.transform(testData)

    predictions.select("predictedLabel", "label", "features").show(5)


4. After prediction on test dataset, it got ~90% correctness.


reference: interpret random forestspark random forest

No comments:

Post a Comment