This repository was archived by the owner on Dec 27, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
6 changed files
with
285 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# Spark DataFrames Introduction | ||
|
||
This is an introduction for Apache Spark DataFrames. | ||
Apache Spark DataFrames are very useful data analytics tool for data scientists. | ||
It allow us to analize data more easily and more fast. | ||
|
||
Reynold Xin, Michael Armbrust and Davies Liu wrote wrote a nice blog article. | ||
[the great blog article](https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html) | ||
. | ||
And Reynold Xin gave a nice presentation | ||
[Spark DataFrames for Large-Scale Data Science](https://www.youtube.com/watch?v=Hvke1f10dL0) | ||
in Bay Area Spark Meetup at Feb, 17, 2015. | ||
|
||
## Set up Environments for This Introduction | ||
|
||
- [Set up Local Machine](./doc/setup-local.md) | ||
- [Set up Spark Cluster on EC2](./doc/setup-cluster.md) | ||
|
||
## Spark DataFrames Introduction | ||
|
||
[Spark DataFrame Introduction](./doc/dataframe-introduction.md) | ||
|
||
## TODO | ||
|
||
This introduction introduces to use Spark DataFrames with only Scala. | ||
We can use Spark DataFrames with python and R (via SparkR). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,135 @@ | ||
# DataFrames Introduction | ||
|
||
## Create DataFrames | ||
|
||
You should arrange the downloaded JSON data before using Spark DataFrames. | ||
And You can also create dataframes from Hive, Parquet and RDD. | ||
|
||
You know, you can run a Spark shell by executing `$SPARK_HOME/bin/spark-shell`. | ||
|
||
If you want to treat JSON files, use `SQLContext.load`. | ||
And I recommend you to set their aliases at the same time. | ||
|
||
### On Local Machine | ||
|
||
``` | ||
// Create a SQLContext (sc is an existing SparkContext) | ||
val context = new org.apache.spark.sql.SQLContext(sc) | ||
// Create a DataFrame for Github events | ||
var path = "file:///tmp/github-archive-data/*.json.gz" | ||
val event = context.load(path, "json").as('event) | ||
// Create a DataFrame for Github users | ||
path = "file:///tmp/github-archive-data/github-users.json" | ||
val user = context.load(path, "json").as('user) | ||
``` | ||
|
||
You can show a schema by `printSchema`. | ||
|
||
``` | ||
event.printSchema | ||
user.printSchema | ||
``` | ||
|
||
### On Spark Cluster | ||
|
||
``` | ||
// Create a SQLContext (sc is an existing SparkContext) | ||
val context = new org.apache.spark.sql.SQLContext(sc) | ||
// Create a DataFrame for Github events | ||
var path = "file:///mnt/github-archive-data/*.json.gz" | ||
val event = context.load(path, "json").as('event) | ||
// Create a DataFrame for Github users | ||
path = "file:///mnt/github-archive-data/github-users.json" | ||
val user = context.load(path, "json").as('user) | ||
``` | ||
|
||
## Project (i.e. selecting some fields of a DataFrame) | ||
|
||
You can select a column by `dataframe("key")` or `dataframe.select("key")`. | ||
If you have select multiple columns, use `data.frame.select("key1", "key2")`. | ||
|
||
``` | ||
// Select a clumn | ||
event("public").limit(2).show() | ||
// Use an alias for a column | ||
event.select('public as 'PUBLIC).limit(2).show() | ||
// Select multile columns with aliases | ||
event.select('public as 'PUBLIC, 'id as 'ID).limit(2).show() | ||
// Select nested columns with aliases | ||
event.select($"payload.size" as 'size, $"actor.id" as 'actor_id).limit(10).show() | ||
``` | ||
|
||
## Filter | ||
|
||
You can filter the data with `filter()`. | ||
|
||
``` | ||
// Filter by a condition | ||
user.filter("name is null").select('id, 'name).limit(5).show() | ||
user.filter("name is not null").select('id, 'name).limit(5).show() | ||
// Filter by a comblination of two conditions | ||
// These two expression are same | ||
event.filter("public = true and type = 'ForkEvent'").count | ||
event.filter("public = true").filter("type = 'ForkEvent'").count | ||
``` | ||
|
||
## Aggregation | ||
|
||
``` | ||
val countByType = event.groupBy("type").count() | ||
countByType.limit(5).show() | ||
val countByIdAndType = event.groupBy("id", "type").count() | ||
countByIdAndType.limit(10).foreach(println) | ||
``` | ||
|
||
## Join | ||
|
||
You can join two data sets with `join`. | ||
And `where` allows you to set conditions to join them. | ||
|
||
``` | ||
// Self join | ||
val x = user.as('x) | ||
val y = user.as('y) | ||
val join = x.join(y).where($"x.id" === $"y.id") | ||
join.select($"x.id", $"y.id").limit(10).show | ||
// Join Pull Request event data with user data | ||
val pr = event.filter('type === "PullRequestEvent").as('pr) | ||
val join = pr.join(user).where($"pr.payload.pull_request.user.id" === $"user.id") | ||
join.select($"pr.type", $"user.name", $"pr.created_at").limit(5).show | ||
``` | ||
|
||
## UDFs | ||
|
||
You can define UDFs (User Define Functions) with `udf()`. | ||
|
||
``` | ||
// Define User Defined Functions | ||
val toDate = udf((createdAt: String) => createdAt.substring(0, 10)) | ||
val toTime = udf((createdAt: String) => createdAt.substring(11, 19)) | ||
// Use the UDFs in select() | ||
event.select(toDate('created_at) as 'date, toTime('created_at) as 'time).limit(5).show | ||
``` | ||
|
||
## Execute Spark SQL | ||
|
||
You can manipurate data with not only DataFrame but also Spark SQL. | ||
|
||
``` | ||
// Register a temporary table for the schema | ||
event.registerTempTable("event") | ||
// Execute a Spark SQL | ||
context.sql("SELECT created_at, repo.name AS `repo.name`, actor.id, type FROM event WHERE type = 'PullRequestEvent'").limit(5).show() | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# Set Up Spark Cluster on EC2 and Data Sets | ||
|
||
## Check Out Apache Spark from Github | ||
|
||
At first, please check out spark from Github on your local machine. | ||
|
||
``` | ||
git checkout https://github.com/apache/spark.git && cd spark | ||
git checkout -b v1.3.0-rc1 origin/v1.3.0-rc1 | ||
``` | ||
|
||
## Launch a Spark Cluster ver. 1.3 on EC2 | ||
|
||
`$SPARK_HOME/ec2/spark-ec2` command allow you to launch Spark Clusters on Amazon EC2. | ||
Where, the checked out directory is defined as `$SPARK_HOME`. | ||
|
||
You can launch a Spark Cluster with the below commands in Tokyo region on Amazon EC2. | ||
So, you should change the region for your location. | ||
|
||
``` | ||
cd $SPARK_HOME | ||
REGION='ap-northeast-1' | ||
ZONE='ap-northeast-1b' | ||
VERSION='1.3.0' | ||
MASTER_INSTANCE_TYPE='r3.large' | ||
SLAVE_INSTANCE_TYPE='r3.8xlarge' | ||
NUM_SLAVES=5 | ||
SPORT_PRICE=1.0 | ||
SPARK_CLUSTER_NAME="spark-cluster-v${VERSION}-${SLAVE_INSTANCE_TYPE}x${NUM_SLAVES}" | ||
./ec2/spark-ec2 -k ${YOUR_KEY_NAME} -i ${YOUR_KEY} -s $NUM_SLAVES --master-instance-type="$MASTER_INSTANCE_TYPE" --instance-type="$SLAVE_INSTANCE_TYPE" --region="$REGION" --zone="$ZONE" --spot-price=$SPOT_PRICE --spark-version="${VERSION}" --hadoop-major-version=2 launch "$SPARK_CLUSTER_NAME" | ||
``` | ||
|
||
## Download Data Sets on EC2 | ||
|
||
Please log in with ssh to the master instance which was created by `$SPARK_HOME/ec2/spark-ec2`. | ||
And then execute the shell script to download the data sets for this introduction. | ||
Off cource, You should check out this introduction from Github on the instance. | ||
|
||
``` | ||
bash ./src/bash/download-for-cluster.md | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# Set up Spark and Data Sets on Your Local Machine | ||
|
||
## Download the Data Sets | ||
|
||
You can prepare for the data sets by the bash script. | ||
In order to execute the script, you should install mongoDB to use `bsondump` which converts a BSON format to JSON format. | ||
As a result of executing the script, the data sets are downloaded in `/tmp/github-archive-data`. | ||
|
||
``` | ||
bash src/bash/prepare-data-local.sh | ||
cd /tmp/github-archive-data | ||
ls -l | ||
2015-01-01-0.json.gz 2015-01-01-17.json.gz 2015-01-01-4.json.gz | ||
2015-01-01-1.json.gz 2015-01-01-18.json.gz 2015-01-01-5.json.gz | ||
2015-01-01-10.json.gz 2015-01-01-19.json.gz 2015-01-01-6.json.gz | ||
2015-01-01-11.json.gz 2015-01-01-2.json.gz 2015-01-01-7.json.gz | ||
2015-01-01-12.json.gz 2015-01-01-20.json.gz 2015-01-01-8.json.gz | ||
2015-01-01-13.json.gz 2015-01-01-21.json.gz 2015-01-01-9.json.gz | ||
2015-01-01-14.json.gz 2015-01-01-22.json.gz dump | ||
2015-01-01-15.json.gz 2015-01-01-23.json.gz github-users.json | ||
2015-01-01-16.json.gz 2015-01-01-3.json.gz | ||
``` | ||
|
||
## Build Spark on Your Local Machine | ||
|
||
You should build Apache Spark on your local machine to use it. | ||
It takes a long time to compile Spark with `sbt` for the first time. | ||
So I recommend you to have a break for compiling. | ||
|
||
This documentation is based on Spark 1.3 RC1, because Spark 1.3 have not released at Feb, 25 2015 yet. | ||
|
||
``` | ||
git checkout https://github.com/apache/spark.git && cd spark | ||
git checkout -b v1.3.0-rc1 origin/v1.3.0-rc1 | ||
./sbt/sbt clean assembly | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
#!/bin/bash | ||
|
||
# Check wheather you have bsondump command or not | ||
if [ `which bsondump` == "" ]; then | ||
echo "WARN: You don't have bsondump command. You should install mongodbd." | ||
exit | ||
fi | ||
|
||
# Make a directory to store downloaded data | ||
mkdir -p /mnt/github-archive-data/ | ||
cd /mnt/github-archive-data/ | ||
|
||
# Download Github Archive data at 2015-01-01 | ||
# https://www.githubarchive.org/ | ||
wget http://data.githubarchive.org/2015-01-{01..30}-{0..23}.json.gz | ||
|
||
# Download Github user data at 2015-01-29 | ||
# And arrange the data as 'github-users.json' | ||
wget http://ghtorrent.org/downloads/users-dump.2015-01-29.tar.gz | ||
tar zxvf users-dump.2015-01-29.tar.gz | ||
# Replace ObjectId with null. ObjectId is used for mongoDB, not valid JSON. | ||
bsondump dump/github/users.bson | sed -e "s/ObjectId([^)]*)/null/" > github-users.json |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
#!/bin/bash | ||
|
||
# Check wheather you have bsondump command or not | ||
if [ `which bsondump` == "" ]; then | ||
echo "WARN: You don't have bsondump command. You should install mongodbd." | ||
exit | ||
fi | ||
|
||
# Make a directory to store downloaded data | ||
mkdir -p /tmp/github-archive-data/ | ||
cd /tmp/github-archive-data/ | ||
|
||
# Download Github Archive data at 2015-01-01 | ||
# https://www.githubarchive.org/ | ||
wget http://data.githubarchive.org/2015-01-01-{0..23}.json.gz | ||
|
||
# Download Github user data at 2015-01-29 | ||
# And arrange the data as 'github-users.json' | ||
wget http://ghtorrent.org/downloads/users-dump.2015-01-29.tar.gz | ||
tar zxvf users-dump.2015-01-29.tar.gz | ||
# Replace ObjectId with null. ObjectId is used for mongoDB, not valid JSON. | ||
bsondump dump/github/users.bson | sed -e "s/ObjectId([^)]*)/null/" > github-users.json |