Skip to content
This repository was archived by the owner on Dec 27, 2021. It is now read-only.

Commit

Permalink
Add simple introduction
Browse files Browse the repository at this point in the history
  • Loading branch information
yu-iskw committed Feb 26, 2015
1 parent bda5726 commit 2ab56b3
Show file tree
Hide file tree
Showing 6 changed files with 285 additions and 0 deletions.
26 changes: 26 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Spark DataFrames Introduction

This is an introduction for Apache Spark DataFrames.
Apache Spark DataFrames are very useful data analytics tool for data scientists.
It allow us to analize data more easily and more fast.

Reynold Xin, Michael Armbrust and Davies Liu wrote wrote a nice blog article.
[the great blog article](https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html)
.
And Reynold Xin gave a nice presentation
[Spark DataFrames for Large-Scale Data Science](https://www.youtube.com/watch?v=Hvke1f10dL0)
in Bay Area Spark Meetup at Feb, 17, 2015.

## Set up Environments for This Introduction

- [Set up Local Machine](./doc/setup-local.md)
- [Set up Spark Cluster on EC2](./doc/setup-cluster.md)

## Spark DataFrames Introduction

[Spark DataFrame Introduction](./doc/dataframe-introduction.md)

## TODO

This introduction introduces to use Spark DataFrames with only Scala.
We can use Spark DataFrames with python and R (via SparkR).
135 changes: 135 additions & 0 deletions doc/dataframe-introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# DataFrames Introduction

## Create DataFrames

You should arrange the downloaded JSON data before using Spark DataFrames.
And You can also create dataframes from Hive, Parquet and RDD.

You know, you can run a Spark shell by executing `$SPARK_HOME/bin/spark-shell`.

If you want to treat JSON files, use `SQLContext.load`.
And I recommend you to set their aliases at the same time.

### On Local Machine

```
// Create a SQLContext (sc is an existing SparkContext)
val context = new org.apache.spark.sql.SQLContext(sc)
// Create a DataFrame for Github events
var path = "file:///tmp/github-archive-data/*.json.gz"
val event = context.load(path, "json").as('event)
// Create a DataFrame for Github users
path = "file:///tmp/github-archive-data/github-users.json"
val user = context.load(path, "json").as('user)
```

You can show a schema by `printSchema`.

```
event.printSchema
user.printSchema
```

### On Spark Cluster

```
// Create a SQLContext (sc is an existing SparkContext)
val context = new org.apache.spark.sql.SQLContext(sc)
// Create a DataFrame for Github events
var path = "file:///mnt/github-archive-data/*.json.gz"
val event = context.load(path, "json").as('event)
// Create a DataFrame for Github users
path = "file:///mnt/github-archive-data/github-users.json"
val user = context.load(path, "json").as('user)
```

## Project (i.e. selecting some fields of a DataFrame)

You can select a column by `dataframe("key")` or `dataframe.select("key")`.
If you have select multiple columns, use `data.frame.select("key1", "key2")`.

```
// Select a clumn
event("public").limit(2).show()
// Use an alias for a column
event.select('public as 'PUBLIC).limit(2).show()
// Select multile columns with aliases
event.select('public as 'PUBLIC, 'id as 'ID).limit(2).show()
// Select nested columns with aliases
event.select($"payload.size" as 'size, $"actor.id" as 'actor_id).limit(10).show()
```

## Filter

You can filter the data with `filter()`.

```
// Filter by a condition
user.filter("name is null").select('id, 'name).limit(5).show()
user.filter("name is not null").select('id, 'name).limit(5).show()
// Filter by a comblination of two conditions
// These two expression are same
event.filter("public = true and type = 'ForkEvent'").count
event.filter("public = true").filter("type = 'ForkEvent'").count
```

## Aggregation

```
val countByType = event.groupBy("type").count()
countByType.limit(5).show()
val countByIdAndType = event.groupBy("id", "type").count()
countByIdAndType.limit(10).foreach(println)
```

## Join

You can join two data sets with `join`.
And `where` allows you to set conditions to join them.

```
// Self join
val x = user.as('x)
val y = user.as('y)
val join = x.join(y).where($"x.id" === $"y.id")
join.select($"x.id", $"y.id").limit(10).show
// Join Pull Request event data with user data
val pr = event.filter('type === "PullRequestEvent").as('pr)
val join = pr.join(user).where($"pr.payload.pull_request.user.id" === $"user.id")
join.select($"pr.type", $"user.name", $"pr.created_at").limit(5).show
```

## UDFs

You can define UDFs (User Define Functions) with `udf()`.

```
// Define User Defined Functions
val toDate = udf((createdAt: String) => createdAt.substring(0, 10))
val toTime = udf((createdAt: String) => createdAt.substring(11, 19))
// Use the UDFs in select()
event.select(toDate('created_at) as 'date, toTime('created_at) as 'time).limit(5).show
```

## Execute Spark SQL

You can manipurate data with not only DataFrame but also Spark SQL.

```
// Register a temporary table for the schema
event.registerTempTable("event")
// Execute a Spark SQL
context.sql("SELECT created_at, repo.name AS `repo.name`, actor.id, type FROM event WHERE type = 'PullRequestEvent'").limit(5).show()
```
43 changes: 43 additions & 0 deletions doc/setup-cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Set Up Spark Cluster on EC2 and Data Sets

## Check Out Apache Spark from Github

At first, please check out spark from Github on your local machine.

```
git checkout https://github.com/apache/spark.git && cd spark
git checkout -b v1.3.0-rc1 origin/v1.3.0-rc1
```

## Launch a Spark Cluster ver. 1.3 on EC2

`$SPARK_HOME/ec2/spark-ec2` command allow you to launch Spark Clusters on Amazon EC2.
Where, the checked out directory is defined as `$SPARK_HOME`.

You can launch a Spark Cluster with the below commands in Tokyo region on Amazon EC2.
So, you should change the region for your location.

```
cd $SPARK_HOME
REGION='ap-northeast-1'
ZONE='ap-northeast-1b'
VERSION='1.3.0'
MASTER_INSTANCE_TYPE='r3.large'
SLAVE_INSTANCE_TYPE='r3.8xlarge'
NUM_SLAVES=5
SPORT_PRICE=1.0
SPARK_CLUSTER_NAME="spark-cluster-v${VERSION}-${SLAVE_INSTANCE_TYPE}x${NUM_SLAVES}"
./ec2/spark-ec2 -k ${YOUR_KEY_NAME} -i ${YOUR_KEY} -s $NUM_SLAVES --master-instance-type="$MASTER_INSTANCE_TYPE" --instance-type="$SLAVE_INSTANCE_TYPE" --region="$REGION" --zone="$ZONE" --spot-price=$SPOT_PRICE --spark-version="${VERSION}" --hadoop-major-version=2 launch "$SPARK_CLUSTER_NAME"
```

## Download Data Sets on EC2

Please log in with ssh to the master instance which was created by `$SPARK_HOME/ec2/spark-ec2`.
And then execute the shell script to download the data sets for this introduction.
Off cource, You should check out this introduction from Github on the instance.

```
bash ./src/bash/download-for-cluster.md
```
37 changes: 37 additions & 0 deletions doc/setup-local.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Set up Spark and Data Sets on Your Local Machine

## Download the Data Sets

You can prepare for the data sets by the bash script.
In order to execute the script, you should install mongoDB to use `bsondump` which converts a BSON format to JSON format.
As a result of executing the script, the data sets are downloaded in `/tmp/github-archive-data`.

```
bash src/bash/prepare-data-local.sh
cd /tmp/github-archive-data
ls -l
2015-01-01-0.json.gz 2015-01-01-17.json.gz 2015-01-01-4.json.gz
2015-01-01-1.json.gz 2015-01-01-18.json.gz 2015-01-01-5.json.gz
2015-01-01-10.json.gz 2015-01-01-19.json.gz 2015-01-01-6.json.gz
2015-01-01-11.json.gz 2015-01-01-2.json.gz 2015-01-01-7.json.gz
2015-01-01-12.json.gz 2015-01-01-20.json.gz 2015-01-01-8.json.gz
2015-01-01-13.json.gz 2015-01-01-21.json.gz 2015-01-01-9.json.gz
2015-01-01-14.json.gz 2015-01-01-22.json.gz dump
2015-01-01-15.json.gz 2015-01-01-23.json.gz github-users.json
2015-01-01-16.json.gz 2015-01-01-3.json.gz
```

## Build Spark on Your Local Machine

You should build Apache Spark on your local machine to use it.
It takes a long time to compile Spark with `sbt` for the first time.
So I recommend you to have a break for compiling.

This documentation is based on Spark 1.3 RC1, because Spark 1.3 have not released at Feb, 25 2015 yet.

```
git checkout https://github.com/apache/spark.git && cd spark
git checkout -b v1.3.0-rc1 origin/v1.3.0-rc1
./sbt/sbt clean assembly
```
22 changes: 22 additions & 0 deletions src/bash/download-for-cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/bin/bash

# Check wheather you have bsondump command or not
if [ `which bsondump` == "" ]; then
echo "WARN: You don't have bsondump command. You should install mongodbd."
exit
fi

# Make a directory to store downloaded data
mkdir -p /mnt/github-archive-data/
cd /mnt/github-archive-data/

# Download Github Archive data at 2015-01-01
# https://www.githubarchive.org/
wget http://data.githubarchive.org/2015-01-{01..30}-{0..23}.json.gz

# Download Github user data at 2015-01-29
# And arrange the data as 'github-users.json'
wget http://ghtorrent.org/downloads/users-dump.2015-01-29.tar.gz
tar zxvf users-dump.2015-01-29.tar.gz
# Replace ObjectId with null. ObjectId is used for mongoDB, not valid JSON.
bsondump dump/github/users.bson | sed -e "s/ObjectId([^)]*)/null/" > github-users.json
22 changes: 22 additions & 0 deletions src/bash/download-for-local.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/bin/bash

# Check wheather you have bsondump command or not
if [ `which bsondump` == "" ]; then
echo "WARN: You don't have bsondump command. You should install mongodbd."
exit
fi

# Make a directory to store downloaded data
mkdir -p /tmp/github-archive-data/
cd /tmp/github-archive-data/

# Download Github Archive data at 2015-01-01
# https://www.githubarchive.org/
wget http://data.githubarchive.org/2015-01-01-{0..23}.json.gz

# Download Github user data at 2015-01-29
# And arrange the data as 'github-users.json'
wget http://ghtorrent.org/downloads/users-dump.2015-01-29.tar.gz
tar zxvf users-dump.2015-01-29.tar.gz
# Replace ObjectId with null. ObjectId is used for mongoDB, not valid JSON.
bsondump dump/github/users.bson | sed -e "s/ObjectId([^)]*)/null/" > github-users.json

0 comments on commit 2ab56b3

Please sign in to comment.