Wednesday, March 30, 2016

Spark SQL Data Sources API - CSV to Dataframe

Sometimes we need to join data from different sources such as Mysql and CSV.  But you don't want to wait for long time to import big CSV into Mysql.

Thanks for Spark, now we can read Mysql table and CSV as dataframes, then join them together conveniently.

The following steps will explain how to read CSV as dataframe in Spark:

1. download spark-csv_2.10-1.3.0.jar and commons-csv-1.2.jar  from here.

2. add them into --jars when start spark-shell:

/opt/bigdata/spark/bin/spark-shell --master spark://master:7077 --jars "/opt/bigdata/spark_extra_libs/spark-csv_2.10-1.3.0.jar,/opt/bigdata/spark_extra_libs/commons-csv-1.2.jar" --driver-memory 2G --executor-memory 6G

3. code reference:

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("/tmp/cars.csv")

No comments:

Post a Comment