Table of Contents generated with DocToc
See also running in cluster mode, running YARN in client mode and running on Mesos.
Create EMR cluster using AWS EMR console or aws cli. The script below creates ten r3.2xlarge slaves cluster. Replace the following parameters with correct values:
- subnet-xxxxxxx
- us-west-1
- KeyName=spark
aws emr create-cluster --name jobserver_test --release-label emr-4.2.0 \
--instance-groups InstanceCount=1,Name=sparkMaster,InstanceGroupType=MASTER,InstanceType=m3.xlarge \
InstanceCount=10,BidPrice=2.99,Name=sparkSlave,InstanceGroupType=CORE,InstanceType=r3.2xlarge \
--applications Name=Hadoop Name=Spark \
--ec2-attributes KeyName=spark,SubnetId=subnet-xxxxxxx --region us-west-1 \
--use-default-roles
- Ssh to master box using cluster Key file
ssh -i <key>.pem hadoop@<master_ip>
- Create required folders in /mnt
mkdir /mnt/work
mkdir -p /mnt/lib/.ivy2
ln -s /mnt/lib/.ivy2 ~/.ivy2
- Create spark-jobserver log dirs
mkdir /mnt/var/log/spark-jobserver
- Install sbt 0.13.9 as described here http://www.scala-sbt.org/0.13/tutorial/Manual-Installation.html
- Install jdk 1.7.0 and git
sudo yum install java-1.7.0-openjdk-devel git
- Clone spark-jobserver and checkout release v0.6.1
cd /mnt/work
git clone https://github.com/spark-jobserver/spark-jobserver.git
cd spark-jobserver
git checkout v0.6.1
- Build spark-jobserver
sbt clean update package assembly
- Create config/emr.sh
APP_USER=hadoop
APP_GROUP=hadoop
INSTALL_DIR=/mnt/lib/spark-jobserver
LOG_DIR=/mnt/var/log/spark-jobserver
PIDFILE=spark-jobserver.pid
JOBSERVER_MEMORY=1G
SPARK_VERSION=1.6.0
SPARK_HOME=/usr/lib/spark
SPARK_CONF_DIR=/etc/spark/conf
HADOOP_CONF_DIR=/etc/hadoop/conf
YARN_CONF_DIR=/etc/hadoop/conf
SCALA_VERSION=2.10.5
MANAGER_JAR_FILE="$appdir/spark-job-server.jar"
MANAGER_CONF_FILE="$(basename $conffile)"
MANAGER_EXTRA_JAVA_OPTIONS=
MANAGER_EXTRA_SPARK_CONFS="spark.yarn.submit.waitAppCompletion=false|spark.files=$appdir/log4jcluster.properties,$conffile"
MANAGER_LOGGING_OPTS="-Dlog4j.configuration=log4j-cluster.properties"
- Create config/emr.conf
spark {
# spark.master will be passed to each job's JobContext
master = "yarn-client"
jobserver {
port = 8090
# Note: JobFileDAO is deprecated from v0.7.0 because of issues in
# production and will be removed in future, now defaults to H2 file.
jobdao = spark.jobserver.io.JobSqlDAO
filedao {
rootdir = /mnt/tmp/spark-jobserver/filedao/data
}
sqldao {
# Slick database driver, full classpath
slick-driver = slick.driver.H2Driver
# JDBC driver, full classpath
jdbc-driver = org.h2.Driver
# Directory where default H2 driver stores its data. Only needed for H2.
rootdir = /tmp/spark-jobserver/sqldao/data
# Full JDBC URL / init string, along with username and password. Sorry, needs to match above.
# Substitutions may be used to launch job-server, but leave it out here in the default or tests won't pass
jdbc {
url = "jdbc:h2:file:/tmp/spark-jobserver/sqldao/data/h2-db"
user = ""
password = ""
}
# DB connection pool settings
dbcp {
enabled = false
maxactive = 20
maxidle = 10
initialsize = 10
}
}
}
# predefined Spark contexts
contexts {
# test {
# num-cpu-cores = 1 # Number of cores to allocate. Required.
# memory-per-node = 1g # Executor memory per node, -Xmx style eg 512m, 1G, etc.
# spark.executor.instances = 1
# }
# define additional contexts here
}
# universal context configuration. These settings can be overridden, see README.md
context-settings {
num-cpu-cores = 4 # Number of cores to allocate. Required.
memory-per-node = 8g # Executor memory per node, -Xmx style eg 512m, #1G, etc.
spark.executor.instances = 2
# If you wish to pass any settings directly to the sparkConf as-is, add them here in passthrough,
# such as hadoop connection settings that don't use the "spark." prefix
passthrough {
#es.nodes = "192.1.1.1"
}
}
# This needs to match SPARK_HOME for cluster SparkContexts to be created successfully
home = "/usr/lib/spark"
}
- Build tar.gz package. The package location will be /tmp/job-server/job-server.tar.gz
bin/server_package.sh emr
-
Ssh to master box using cluster Key file
-
Create directory for spark-jobserver and extract job-server.tar.gz into it
mkdir /mnt/lib/spark-jobserver
cd /mnt/lib/spark-jobserver
tar zxf /tmp/job-server/job-server.tar.gz
-
Check emr.conf and settings.sh
-
Start jobserver
./server_start.sh
- Check logs
tail -300f /mnt/var/log/spark-jobserver/spark-job-server.log
- To stop jobserver run
./server_stop.sh
- Check current status
# check current jars
curl localhost:8090/jars
# check current contexts
curl localhost:8090/contexts
- Upload jobserver example jar to testapp
curl --data-binary @/mnt/work/spark-jobserver/job-server-tests/target/scala-2.10/job-server-tests_2.10-0.6.1.jar \
localhost:8090/jars/testapp
# check current jars. should return testapp
curl localhost:8090/jars
- Create test context
# create test context
curl -d "" "localhost:8090/contexts/test?num-cpu-cores=1&memory-per-node=512m&spark.executor.instances=1"
# check current contexts. should return test
curl localhost:8090/contexts
- Run WordCount example
# run WordCount example (should be done in 1-2 sec)
curl -d "input.string = a b c a b see" \
"localhost:8090/jobs?appName=testapp&classPath=spark.jobserver.WordCountExample&context=test&sync=true"
- Check jobs
# check jobs
curl localhost:8090/jobs
- Check YARN website on port 8088. It should be Spark application RUNNING for test context.
- If for some reason spark-jobserver can not start Spark Application on Yarn cluster you can try to open Spark Shell and check if it can start Spark Application on Yarn cluster
spark-shell \
--num-executors 1 \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1
-
Check yarn website on port 8088. Spark-shell application should be have RUNNING status
-
Run the following test command in Spark Shell to test spark-shell yarn application
sc.parallelize(1 to 10, 1).count
- If shell works but spark-jobserver still does not work check logs
- on jobserver /var/log/spark-jobserver/spark-job-server.log
- on Cluster: Yarn App logs (website on port 8088)
- on Cluster: Containers stderr, stdout (website on port 8088)
- Make sure spark-jobserver uses EMR default spark installed in /usr/lib/spark