Data Pizza: 2016

Saturday, April 16, 2016

Ploting simple plots by using GNUPlot and Scala

Does Scala have any frontend library to plot? ScalaPlot.

This article will demo how to plot a simple scatter chart.

1. Dependencies.

1.1 Install gnuplot.

>brew install pdflib-lite gnuplot

1.2 Download ScalaPlot.jar from here.

2. A simple scatter chart.

2.1 sbt config.

>libraryDependencies += "org.sameersingh.scalaplot" % "scalaplot" % "0.1"

2.2 SimpleApp code.

import org.sameersingh.scalaplot.Implicits._

val x = 0.0 until 2.0 * math.Pi by 0.1
output(PNG("/tmp/", "test"), xyChart(x ->(math.sin(_), math.cos(_))))

2.3 run.

scala -classpath "target/scala-2.11/simple-project_2.11-1.0.jar:scalaplot-0.1.jar" SimpleApp

Then, you will get result test.png:

Thursday, April 7, 2016

Fraud Detection When User Signup

As a social website, how to predict fraud user when they signup? How to decide which elements are critically related to fraud detect based on historical data?

The demo below tried to setup a lightweight solution for fraud detection by Spark ML.

In the sample, get user's demographic:

age, ethnic, vids, ips, emails, caption_len, bodytype, profile_initially_seeking, is_fraud

[notes]: vids=signup vid repeated time, ips=signup ip repeated time, emails=signup email repeated time, caption_len=length of profile caption.

We want to exclude some high-correlated elements, and elements of less related to is_fraud. In Spark, the code as follow:

import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics

# df => RDD
val rdd = df.map{case Row(...) => Vectors.dense(...)}
val correlMatrix: Matrix = Statistics.corr(data, "pearson")

The result:

From matrix above, ethnic is less related to is_fraud. We will exclude ethnic element when do modeling.

Then use Random Forest algorithm to train training set and predict testing set.

val data = df.map{
      case Row(pnum: Int, age: Int, ethnic: Int, vids: Int, ips: Int, emails:Int,  caption_len: Int, bodytype:Int, profile_initially_seeking:Int, is_fraud: Int) =>
        LabeledPoint(is_fraud.toDouble, Vectors.dense(age.toDouble,  vids.toDouble, ips.toDouble, emails.toDouble, caption_len.toDouble, bodytype.toDouble, profile_initially_seeking.toDouble))
    }.toDF()

    val labelIndexer = new StringIndexer()
      .setInputCol("label")
      .setOutputCol("indexedLabel")
      .fit(data)

    val featureIndexer = new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexedFeatures")
      .setMaxCategories(32)
      .fit(data)

    val Array(trainingData, testData) = data.randomSplit(Array(0.8, 0.2))

    val rf = new RandomForestClassifier()
      .setLabelCol("indexedLabel")
      .setFeaturesCol("indexedFeatures")
      .setNumTrees(10)

    val labelConverter = new IndexToString()
      .setInputCol("prediction")
      .setOutputCol("predictedLabel")
      .setLabels(labelIndexer.labels)

    val pipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))
    val model = pipeline.fit(trainingData)
    val predictions = model.transform(testData)

    predictions.select("predictedLabel", "label", "features").show(5)

After prediction on testing set, compared the prediction label and label:
true positive: 0.88
true negative: 0.766
accuracy: 0.834

Investigated signup info deeply, we added three more critical variables, which made prediction usable in production:
true positive: 1
true negative: 0.9968
accuracy: 0.9987

Monday, April 4, 2016

Topic analysis of news with LDA (Latent Dirichlet Allocation) in Spark ML

LDA is one of most popular text analysis algorithms of ML, which will be employed in this demo to do text mining.

1. First step, it will load massive news from Mysql.

val df = sqlContext.read.format("jdbc").
      option("url", url).
      option("driver", driver).
      option("dbtable", "news").
      option("user", user).
      option("password", pwd).
      load()

2. Tokenize and Term Counts.

val corpus: RDD[String] = df.map { case Row(document: String) => document }

val tokenized: RDD[Seq[String]] =  corpus.map(_.toLowerCase.split("\\s")).map(_.filter(_.length > 2).filter(_.forall(java.lang.Character.isLetter)))

val termCounts: Array[(String, Long)] = tokenized.flatMap(_.map(_ -> 1L)).reduceByKey(_ + _).collect().sortBy(-_._2)

// 0.3 is arbitrary percentage of common words.

val numStopwords = (termCounts.length * 0.3).toInt
val vocabArray: Array[String] = termCounts.takeRight(termCounts.length - numStopwords).map(_._1)

val vocab: Map[String, Int] = vocabArray.zipWithIndex.toMap

3. Convert documents into term count vectors, and fitting the model.

val documents: RDD[(Long, Vector)] =
      tokenized.zipWithIndex().map { case (tokens, id) =>
        val counts = new mutable.HashMap[Int, Double]()
        tokens.foreach { term =>
          if (vocab.contains(term)) {
            val idx = vocab(term)
            counts(idx) = counts.getOrElse(idx, 0.0) + 1.0
          }
        }
        (id, Vectors.sparse(vocab.size, counts.toSeq))
      }

    // Set LDA parameters
    val lda = new LDA().setK(10).setMaxIterations(10)

    val ldaModel = lda.run(documents)

    // Print topics, showing top-weighted 3 terms for each topic.
    val topicIndices = ldaModel.describeTopics(maxTermsPerTopic = 3)
    topicIndices.foreach { case (terms, termWeights) =>
      println("TOPIC:")
      terms.zip(termWeights).foreach { case (term, weight) =>
        println(s"${vocabArray(term.toInt)}\t$weight")
      }
      println()
    }

4. Finally, it will display most weighted 3 topics in documents:

TOPIC:
offshore 0.009812238949505817
media 0.009100067491152785
soccer 0.007214104190610413

Documents are associated with multiple topics, which are clustered by the approach above. This clustering can help organize or summarize document collection.

Also, LDA result can be used as features for other ML algorithms.

Friday, April 1, 2016

Sampling Data - Slovin's Formula

Slovins’s formula is used to calculate an appropriate sample size from a population.

Some scenarios, we can't go through the whole population or even don't know what is population size. So sampling is useful to get expected result at certain level.

How to do sampling? and what's the confidence level?

For example:

Let's do a sampling with 1000 population and confidence level as 95%. After calculation, the sample size is 286.

1
2

population = 1000;
e = 0.05;
n = 1000 / (1 + 1000 * 0.05 * 0.05) = 285.7 ~= 286

Wednesday, March 30, 2016

Spark SQL Data Sources API - CSV to Dataframe

Sometimes we need to join data from different sources such as Mysql and CSV. But you don't want to wait for long time to import big CSV into Mysql.

Thanks for Spark, now we can read Mysql table and CSV as dataframes, then join them together conveniently.

The following steps will explain how to read CSV as dataframe in Spark:

1. download spark-csv_2.10-1.3.0.jar and commons-csv-1.2.jar from here.

2. add them into --jars when start spark-shell:

/opt/bigdata/spark/bin/spark-shell --master spark://master:7077 --jars "/opt/bigdata/spark_extra_libs/spark-csv_2.10-1.3.0.jar,/opt/bigdata/spark_extra_libs/commons-csv-1.2.jar" --driver-memory 2G --executor-memory 6G

3. code reference:

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("/tmp/cars.csv")

Predict User First Transaction Using Spark ML

As a website-based business, can we predict a user will do first transaction, and how?