This document describes how to download and setup Spark in your machine without requiring a cluster setup.
⚠️ This is only intended for demo and learning purposes, please refer to the official deployment guide for further information on how to properly deploy an Spark cluster.
In this repository you will find also other alternative options to run Spark locally:
- Spark on Docker
- Spark on Kubernetes
- Spark on Vagrant
- Spark on Google Cloud Dataproc
- PySpark Jupyter Notebook
This setup assumes you have a linux machine with Java 8 and Python 3 installed. Assuming a Debian distribution, stretch version, you can install required dependencies with the following commands:
sudo apt-get update
sudo apt-get install -y openjdk-8-jdk-headless python3-software-properties python3-numpy curl
We recommend to install Spark in /opt/spark
. To download Spark package, you could use the following commands:
mkdir /opt/spark
curl http://apache.rediris.es/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz | tar -xz -C /opt/spark --strip-components=1
To make Spark binaries accessible add /opt/spark/bin
to the PATH
, by appending the following lines to your .bashrc
file:
export PYSPARK_PYTHON=python3
export PATH=$PATH:/opt/spark/bin
After that, restart current shell to make sure PATH
changes are applied.
Simply run the following command, you should get a value like res0: Long = 100
in the console:
echo 'sc.parallelize(1 to 100).count()' | spark-shell
By default Spark is too verbose and it would output a ton of the information in the terminal. Optionally you could reduce the log level doing:
- Rename the file
/opt/spark/conf/log4j.properties.template
tolog4j.properties
, in the same directory. - Edit the file and set
rootCategory
property toERROR
instead ofINFO
.
Use this two commands to do that automatically:
sed 's/rootCategory=INFO/rootCategory=ERROR/g' < /opt/spark/conf/log4j.properties.template > /opt/spark/conf/log4j.properties
All this procedure can be accomplished by a simple script included in the classroom repository. Just clone the repository and run local_setup.sh
:
git clone https://github.com/luisbelloch/data_processing_course.git
cd data_processing_course
./local_setup.sh
Spark will be installed in data_processing_course/.spark
. Do not forget to add bin
folder to the $PATH
.