Skip to content

Latest commit

 

History

History
73 lines (48 loc) · 2.8 KB

single-node.md

File metadata and controls

73 lines (48 loc) · 2.8 KB

Setting up single-node Spark

This document describes how to download and setup Spark in your machine without requiring a cluster setup.

⚠️ This is only intended for demo and learning purposes, please refer to the official deployment guide for further information on how to properly deploy an Spark cluster.

In this repository you will find also other alternative options to run Spark locally:

Requirements

This setup assumes you have a linux machine with Java 8 and Python 3 installed. Assuming a Debian distribution, stretch version, you can install required dependencies with the following commands:

sudo apt-get update
sudo apt-get install -y openjdk-8-jdk-headless python3-software-properties python3-numpy curl

Downloading and unpacking Spark

We recommend to install Spark in /opt/spark. To download Spark package, you could use the following commands:

mkdir /opt/spark
curl http://apache.rediris.es/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz | tar -xz -C /opt/spark --strip-components=1

To make Spark binaries accessible add /opt/spark/bin to the PATH, by appending the following lines to your .bashrc file:

export PYSPARK_PYTHON=python3
export PATH=$PATH:/opt/spark/bin

After that, restart current shell to make sure PATH changes are applied.

Testing the installation

Simply run the following command, you should get a value like res0: Long = 100 in the console:

echo 'sc.parallelize(1 to 100).count()' | spark-shell

Reducing log level

By default Spark is too verbose and it would output a ton of the information in the terminal. Optionally you could reduce the log level doing:

  1. Rename the file /opt/spark/conf/log4j.properties.template to log4j.properties, in the same directory.
  2. Edit the file and set rootCategory property to ERROR instead of INFO.

Use this two commands to do that automatically:

sed 's/rootCategory=INFO/rootCategory=ERROR/g' < /opt/spark/conf/log4j.properties.template > /opt/spark/conf/log4j.properties

TL;DR Using helper script

All this procedure can be accomplished by a simple script included in the classroom repository. Just clone the repository and run local_setup.sh:

git clone https://github.com/luisbelloch/data_processing_course.git
cd data_processing_course
./local_setup.sh

Spark will be installed in data_processing_course/.spark. Do not forget to add bin folder to the $PATH.