Add simple introduction

yu-iskw · Feb 26, 2015 · 2ab56b3 · 2ab56b3
1 parent bda5726
commit 2ab56b3
Show file tree

Hide file tree

Showing 6 changed files with 285 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,26 @@
+# Spark DataFrames Introduction
+
+This is an introduction for Apache Spark DataFrames.
+Apache Spark DataFrames are very useful data analytics tool for data scientists.
+It allow us to analize data more easily and more fast.
+
+Reynold Xin, Michael Armbrust and Davies Liu wrote wrote a nice blog article.
+[the great blog article](https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html)
+.
+And Reynold Xin gave a nice presentation 
+[Spark DataFrames for Large-Scale Data Science](https://www.youtube.com/watch?v=Hvke1f10dL0)
+in Bay Area Spark Meetup at Feb, 17, 2015.
+
+## Set up Environments for This Introduction
+
+- [Set up Local Machine](./doc/setup-local.md)
+- [Set up Spark Cluster on EC2](./doc/setup-cluster.md)
+
+## Spark DataFrames Introduction
+
+[Spark DataFrame Introduction](./doc/dataframe-introduction.md)
+
+## TODO
+
+This introduction introduces to use Spark DataFrames with only Scala.
+We can use Spark DataFrames with python and R (via SparkR).
diff --git a/doc/dataframe-introduction.md b/doc/dataframe-introduction.md
@@ -0,0 +1,135 @@
+# DataFrames Introduction
+
+## Create DataFrames
+
+You should arrange the downloaded JSON data before using Spark DataFrames.
+And You can also create dataframes from Hive, Parquet and RDD.
+
+You know, you can run a Spark shell by executing `$SPARK_HOME/bin/spark-shell`.
+
+If you want to treat JSON files, use `SQLContext.load`.
+And I recommend you to set their aliases at the same time.
+
+### On Local Machine
+
+```
+// Create a SQLContext (sc is an existing SparkContext)
+val context = new org.apache.spark.sql.SQLContext(sc)
+
+// Create a DataFrame for Github events
+var path = "file:///tmp/github-archive-data/*.json.gz"
+val event = context.load(path, "json").as('event)
+
+// Create a DataFrame for Github users
+path = "file:///tmp/github-archive-data/github-users.json"
+val user = context.load(path, "json").as('user)
+```
+
+You can show a schema by `printSchema`.
+
+```
+event.printSchema
+user.printSchema
+```
+
+### On Spark Cluster
+
+```
+// Create a SQLContext (sc is an existing SparkContext)
+val context = new org.apache.spark.sql.SQLContext(sc)
+
+// Create a DataFrame for Github events
+var path = "file:///mnt/github-archive-data/*.json.gz"
+val event = context.load(path, "json").as('event)
+
+// Create a DataFrame for Github users
+path = "file:///mnt/github-archive-data/github-users.json"
+val user = context.load(path, "json").as('user)
+```
+
+## Project (i.e. selecting some fields of a DataFrame)
+
+You can select a column by `dataframe("key")` or `dataframe.select("key")`.
+If you have select multiple columns, use `data.frame.select("key1", "key2")`.
+
+```
+// Select a clumn
+event("public").limit(2).show()
+
+// Use an alias for a column
+event.select('public as 'PUBLIC).limit(2).show()
+
+// Select multile columns with aliases
+event.select('public as 'PUBLIC, 'id as 'ID).limit(2).show()
+
+// Select nested columns with aliases
+event.select($"payload.size" as 'size, $"actor.id" as 'actor_id).limit(10).show()
+```
+
+## Filter
+
+You can filter the data with `filter()`.
+
+```
+// Filter by a condition
+user.filter("name is null").select('id, 'name).limit(5).show()
+user.filter("name is not null").select('id, 'name).limit(5).show()
+
+// Filter by a comblination of two conditions
+// These two expression are same
+event.filter("public = true and type = 'ForkEvent'").count
+event.filter("public = true").filter("type = 'ForkEvent'").count
+```
+
+## Aggregation
+
+```
+val countByType = event.groupBy("type").count()
+countByType.limit(5).show()
+
+val countByIdAndType = event.groupBy("id", "type").count()
+countByIdAndType.limit(10).foreach(println)
+```
+
+## Join
+
+You can join two data sets with `join`.
+And `where` allows you to set conditions to join them.
+
+```
+// Self join
+val x = user.as('x)
+val y = user.as('y)
+val join = x.join(y).where($"x.id" === $"y.id")
+join.select($"x.id", $"y.id").limit(10).show
+
+// Join Pull Request event data with user data
+val pr = event.filter('type === "PullRequestEvent").as('pr)
+val join = pr.join(user).where($"pr.payload.pull_request.user.id" === $"user.id")
+join.select($"pr.type", $"user.name", $"pr.created_at").limit(5).show
+```
+
+## UDFs
+
+You can define UDFs (User Define Functions) with `udf()`.
+
+```
+// Define User Defined Functions
+val toDate = udf((createdAt: String) => createdAt.substring(0, 10))
+val toTime = udf((createdAt: String) => createdAt.substring(11, 19))
+
+// Use the UDFs in select()
+event.select(toDate('created_at) as 'date, toTime('created_at) as 'time).limit(5).show
+```
+
+## Execute Spark SQL
+
+You can manipurate data with not only DataFrame but also Spark SQL.
+
+```
+// Register a temporary table for the schema
+event.registerTempTable("event")
+
+// Execute a Spark SQL
+context.sql("SELECT created_at, repo.name AS `repo.name`, actor.id, type FROM event WHERE type = 'PullRequestEvent'").limit(5).show()
+```
diff --git a/doc/setup-cluster.md b/doc/setup-cluster.md
@@ -0,0 +1,43 @@
+# Set Up Spark Cluster on EC2 and Data Sets
+
+## Check Out Apache Spark from Github
+
+At first, please check out spark from Github on your local machine.
+
+```
+git checkout https://github.com/apache/spark.git && cd spark
+git checkout -b v1.3.0-rc1 origin/v1.3.0-rc1
+```
+
+## Launch a Spark Cluster ver. 1.3 on EC2
+
+`$SPARK_HOME/ec2/spark-ec2` command allow you to launch Spark Clusters on Amazon EC2.
+Where, the checked out directory is defined as `$SPARK_HOME`.
+
+You can launch a Spark Cluster with the below commands in Tokyo region on Amazon EC2.
+So, you should change the region for your location.
+
+```
+cd $SPARK_HOME
+
+REGION='ap-northeast-1'
+ZONE='ap-northeast-1b'
+VERSION='1.3.0'
+MASTER_INSTANCE_TYPE='r3.large'
+SLAVE_INSTANCE_TYPE='r3.8xlarge'
+NUM_SLAVES=5
+SPORT_PRICE=1.0
+SPARK_CLUSTER_NAME="spark-cluster-v${VERSION}-${SLAVE_INSTANCE_TYPE}x${NUM_SLAVES}"
+
+./ec2/spark-ec2 -k ${YOUR_KEY_NAME} -i ${YOUR_KEY} -s $NUM_SLAVES --master-instance-type="$MASTER_INSTANCE_TYPE" --instance-type="$SLAVE_INSTANCE_TYPE" --region="$REGION" --zone="$ZONE" --spot-price=$SPOT_PRICE --spark-version="${VERSION}" --hadoop-major-version=2 launch "$SPARK_CLUSTER_NAME"
+```
+
+## Download Data Sets on EC2
+
+Please log in with ssh to the master instance which was created by `$SPARK_HOME/ec2/spark-ec2`.
+And then execute the shell script to download the data sets for this introduction.
+Off cource, You should check out this introduction from Github on the instance.
+
+```
+bash ./src/bash/download-for-cluster.md
+```
diff --git a/doc/setup-local.md b/doc/setup-local.md
@@ -0,0 +1,37 @@
+# Set up Spark and Data Sets on Your Local Machine
+
+## Download the Data Sets
+
+You can prepare for the data sets by the bash script.
+In order to execute the script, you should install mongoDB to use `bsondump` which converts a BSON format to JSON format.
+As a result of executing the script, the data sets are downloaded in `/tmp/github-archive-data`.
+
+```
+bash src/bash/prepare-data-local.sh
+
+cd /tmp/github-archive-data
+ls -l
+2015-01-01-0.json.gz   2015-01-01-17.json.gz  2015-01-01-4.json.gz
+2015-01-01-1.json.gz   2015-01-01-18.json.gz  2015-01-01-5.json.gz
+2015-01-01-10.json.gz  2015-01-01-19.json.gz  2015-01-01-6.json.gz
+2015-01-01-11.json.gz  2015-01-01-2.json.gz   2015-01-01-7.json.gz
+2015-01-01-12.json.gz  2015-01-01-20.json.gz  2015-01-01-8.json.gz
+2015-01-01-13.json.gz  2015-01-01-21.json.gz  2015-01-01-9.json.gz
+2015-01-01-14.json.gz  2015-01-01-22.json.gz  dump
+2015-01-01-15.json.gz  2015-01-01-23.json.gz  github-users.json
+2015-01-01-16.json.gz  2015-01-01-3.json.gz
+```
+
+## Build Spark on Your Local Machine
+
+You should build Apache Spark on your local machine to use it.
+It takes a long time to compile Spark with `sbt` for the first time.
+So I recommend you to have a break for compiling.
+
+This documentation is based on Spark 1.3 RC1, because Spark 1.3 have not released at Feb, 25 2015 yet.
+
+```
+git checkout https://github.com/apache/spark.git && cd spark
+git checkout -b v1.3.0-rc1 origin/v1.3.0-rc1
+./sbt/sbt clean assembly
+```
diff --git a/src/bash/download-for-cluster.md b/src/bash/download-for-cluster.md
@@ -0,0 +1,22 @@
+#!/bin/bash
+
+# Check wheather you have bsondump command or not
+if [ `which bsondump` == "" ]; then
+  echo "WARN: You don't have bsondump command. You should install mongodbd."
+  exit
+fi
+
+# Make a directory to store downloaded data
+mkdir -p /mnt/github-archive-data/
+cd /mnt/github-archive-data/
+
+# Download Github Archive data at 2015-01-01
+# https://www.githubarchive.org/
+wget http://data.githubarchive.org/2015-01-{01..30}-{0..23}.json.gz
+
+# Download Github user data at 2015-01-29
+# And arrange the data as 'github-users.json'
+wget http://ghtorrent.org/downloads/users-dump.2015-01-29.tar.gz
+tar zxvf users-dump.2015-01-29.tar.gz
+# Replace ObjectId with null. ObjectId is used for mongoDB, not valid JSON.
+bsondump dump/github/users.bson | sed -e "s/ObjectId([^)]*)/null/" > github-users.json
diff --git a/src/bash/download-for-local.md b/src/bash/download-for-local.md
@@ -0,0 +1,22 @@
+#!/bin/bash
+
+# Check wheather you have bsondump command or not
+if [ `which bsondump` == "" ]; then
+  echo "WARN: You don't have bsondump command. You should install mongodbd."
+  exit
+fi
+
+# Make a directory to store downloaded data
+mkdir -p /tmp/github-archive-data/
+cd /tmp/github-archive-data/
+
+# Download Github Archive data at 2015-01-01
+# https://www.githubarchive.org/
+wget http://data.githubarchive.org/2015-01-01-{0..23}.json.gz
+
+# Download Github user data at 2015-01-29
+# And arrange the data as 'github-users.json'
+wget http://ghtorrent.org/downloads/users-dump.2015-01-29.tar.gz
+tar zxvf users-dump.2015-01-29.tar.gz
+# Replace ObjectId with null. ObjectId is used for mongoDB, not valid JSON.
+bsondump dump/github/users.bson | sed -e "s/ObjectId([^)]*)/null/" > github-users.json