We have created a Vagrant setup using Ansible that will download and unpack Spark inside the generated machine.
⚠️ This is only intended for demo and learning purposes, please refer to the official deployment guide for further information on how to properly deploy an Spark cluster.
To bootstrap the machine, do:
git clone https://github.com/luisbelloch/data_processing_course.git
cd data_processing_course
vagrant up
Once the process completes you can access the machine by using:
vagrant ssh
Remember that you can access the host machine files using the /vagrant
folder from the inside of the VM.
Make sure the machine is up and running with vagrant up
, and you can access the virtual machine after doing vagrant ssh
.
To test the setup run the following command, you should get a value like res0: Long = 100
in the console:
echo 'sc.parallelize(1 to 100).count()' | spark-shell
The samples we discussed in class are available in the folder /vagrant/spark
inside the virtual machine:
vagrant@buster:~$ cd /vagrant/spark/
vagrant@buster:/vagrant/spark$ spark-submit compras_con_mas_de_un_descuento.py
You may want to start the pyspark
REPL as well:
vagrant@buster:~$ cd /vagrant/spark/
vagrant@buster:/vagrant/spark$ pyspark