Clone this project to create a 4 node Apache Hadoop cluster with the Cascading SDK pre-installed.
The Cascading 2.7 SDK includes Cascading and many of its sub-projects:
- Lingual - ANSI SQL Command Shell and JDBC Driver
- Pattern - Machine Learning
- Cascalog - Clojure DSL over Cascding
- Scalding - Scala DSL over Cascading
- Multitool - Command line tool for managing large files
- Load - Command line tool for load testing Hadoop
To make getting started as easy as possible this setup also includes build tools used by parts of the SDK:
- gradle - build tool used by Cascading and its related projects
- leiningen 2 - a popular build tool in the clojure community, which is used in the cascalog tutorial included in the SDK
- sbt - a popular build tool in the scala community, which is used in the scalding tutorial included in the SDK
This work is based on: http://cscarioni.blogspot.co.uk/2012/09/setting-up-hadoop-virtual-cluster-with.html
First install either Virtual Box (free) or VMware Fusion/Workstation (paid) and Vagrant for your platform. If using VMware Fusion/Workstation, you will also need a VMware + Vagrant license.
Then simply clone this repository, change into the directory and bring the cluster up.
$ vagrant up
This will set up 4 machines - master
, hadoop1
, hadoop2
and hadoop3
. Each of them will have two CPUs and .5GB of
RAM. If this is too much for your machine, adjust the Vagrantfile
.
The machines will be provisioned using Puppet. All of them will have hadoop (apache-hadoop-2.6.0) installed, ssh will be configured and local name resolution also works.
Hadoop is installed in /opt/hadoop-2.6.0
and all tools are in the PATH
.
The master
machine acts as the namenode and the yarn resource manager, the 3 others are data nodes and run node
managers.
The cluster uses zeroconf (a.k.a. bonjour) for name
resolution. This means that you never have to remember any IP nor will you have to fiddle with your /etc/hosts
file.
Name resolution works from the host to all VMs and between all VMs as well. If you are using linux, make sure you have
avahi-daemon
installed and it is running. On a Mac everything should just work (TM) witouth doing anything. Windows
users have to install Bonjour for Windows before starting the cluster.
The network used is 192.168.7.0/24
. If that causes any problems, change the Vagrantfile
and
modules/avahi/file/hosts
files to something that works for you. Since everything else is name based, no other change
is required.
This cluster uses the ssh-into-all-the-boxes-and-start-things-up
-approach, which is fine for testing.
Once all machines are up and provisioned, the cluster can be started. Log into the master, format hdfs and start the cluster.
$ vagrant ssh master
$ (master) sudo prepare-cluster.sh
$ (master) sudo start-all.sh
After a little while, all daemons will be running and you have a fully working hadoop cluster. Note that the
prepare-cluster.sh
step is a one time action.
If you want to shut down your cluster, but want to keep it around for later use, shut down all the services and tell vagrant to stop the machines like this:
$ vagrant ssh master
$ (master) sudo stop-all.sh
$ exit or Ctrl-D
$ vagrant halt
When you want to use your cluster again, simply do this:
$ vagrant up
$ vagrant ssh master
$ (master) sudo start-all.sh
If you don't need the cluster anymore and want to get your disk-space back do this:
$ vagrant destroy -f
This will only delete the VMs all local files in the directory stay untouched and can be used again, if you decide to start up a new cluster.
You can access all services of the cluster with your web-browser.
- namenode: http://master.local:50070/dfshealth.jsp
- application master: http://master.local:8088/cluster
- job history server: http://master.local:19888/jobhistory
To interact with the cluster on the command line, log into the master and use the hadoop command.
$ vagrant ssh master
$ (master) hadoop fs -ls /
$ ...
You can access the host file system from the /vagrant
directory, which means that you can drop your hadoop job in
there and run it on your own fully distributed hadoop cluster.
Since this is a fully virtualized environment running on your computer, it will not be super-fast. This is not the goal of this setup. The goal is to have a fully distributed cluster for testing and troubleshooting.
To not overload the host machine, has each tasktracker a hard limit of 1 map task and 1 reduce task at a time.
Puppet will download the latest Cascading SDK 2.7-wip build and put all SDK tools in
the PATH
. The SDK itself can be found in /opt/CascadingSDK
.
The SDK allows you to install the [Driven plugin for Cascading]((http://cascading.io/driven) , by simply running
install-driven-plugin
. This will install the plugin for the vagrant user in /home/vagrant/.cascading/.driven-plugin
.
Installing the plugin will cause every Cascading based application to send telemetry to https://driven.cascading.io
.
If you no longer want this to happen, you can simply delete the installation directory of the plugin mentioned above.
For more information about driven, please read the Driven documentation.
This version of the cluster also contains Apache HBase. The layout on disk is similar to
Hadoop. The distributition is in /opt/hbase-<version>
. You can start the HBase cluster like so.
$ (master) sudo start-hbase.sh
The Hadoop cluster must be running, before you issue this command, since HBase requires HDFS to be up and running.
To cluster is shut down like so:
$ (master) sudo stop-hbase.sh
The setup is fully distributed. hadoop1
, hadoop2
and hadoop3
are running a
zookeeper instance and a region-server each. The HBase master is running on the master
VM.
The webinterface of the HBase master is http://master.local:60010.
If something is not working right, join the Cascading mailinglist and post your problem there.
If your computer is not capable of running 4 VMs at a time, you can still benefit from this setup. The single-node
directory contains an alternative Vagrantfile
, which only starts the master
and deploys everything on it.
The interaction, the start- and stop sequence work the same ways as in the multi-VM cluster, except that it isn't fully distributed. This slimmed down version of the setup also does not include HBase.
To run the single node setup, run vagrant up
in the single-node
directory instead of the root directory. Everything
else stays the same.
Vagrant makes it easy to share files between the vms of the cluster and your host machine. The project directory is
mounted under /vagrant
, which enables you to get files from or to your host, by simply copying them into that
directory.
The namenode stores the fsimage
in /srv/hadoop/namenode
. The datanodes are storing all data in
/srv/hadoop/datanode
.
Sometimes, when experimenting too much, your cluster might not start anymore. If that is the case, you can easily reset it like so.
$ for host in master hadoop1 hadoop2 hadoop3; do vagrant ssh $host --command 'sudo rm -rf /srv/hadoop' ; done
$ vagrant provision
After those two commands your cluster is in the same state as when you started it for the first time. You can now reformat the namenode and restart all services.
If you change any of the puppet modules, you can simply apply the changes with vagrants built-in provisioner.
$ vagrant provision
In order to save bandwidth and time we download hadoop only once and store it in the /vagrant
directory, so that the
other vms can reuse it. If the download fails for some reason, delete the tarball and rerun vagrant provision
.
We are also downloading a file containing checksums for the tarball. They are verified, before the cluster is started.
If something went wrong during the download, you will see the verify_tarball
part of puppet fail. If that is the case,
delete the tarball and the checksum file (<tarball>.mds
) and rerun vagrant provision
.
- have a way to configure the names/ips in only one file