Physical Cluster

Purpose

The purpose of setting up a physical cluster is to use Hadoop's services to answer the question at hand. While cloud computing is an option, some of Revature's clients may have their own physical cluster and as responsible contractors we need to be ready to set up, maintain, and execute jobs on these physical clusters.

Goals

Our goal is to process the data on the physical cluster and obtain an answer.

Difficulties

The only difficulty we experienced was time constraints. We have access to many resources which guided our installation smoothly. We only came across a few small errors, which we documented at the bottom of this guide.

Installing Hadoop on Xubuntu

NOTES:

developer is the username of the host machine throughout these instructions.
Java is already installed on all machines.
The versions mentioned here were the most recent at that time. Be sure to get the most recent versions.
Grayed out text represents commands in the CLI.
Bold text represents text in files.

Find IP address of machine
ifconfig
Disable firewall restrictions
sudo ufw disable
Other helpful commands
sudo ufw status
sudo ufw enable
Add this IP address to hosts file
sudo vim /etc/hosts
Restart SSH
sudo service ssh restart
Other helpful commands
sudo service ssh status
Create SSH key
ssh-keygen -t rsa -P “”
Hit Enter when asked "Enter file in which to save the key"
Copy SSH key to master node’s authorized keys
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Copy master node’s SSH key to slave’s authorized keys
ssh-copy-id -i ~/.ssh/id_rsa.pub <username>@<slave node IP address>
Download Hadoop
wget https://archive.apache.org/dist/hadoop/core/hadoop-2.6.0/hadoop-2.6.0.tar.gz
Extract Hadoop tar file on all nodes
tar -xf hadoop-2.6.0.tar.gz
Move extracted hadoop directory to user's home
mv hadoop-2.6.0 ~/
Add Hadoop paths in the bash file (.bashrc) on all nodes.
sudo vim ~/.bashrc
Add the following paths in the .bashrc file
export HADOOP_HOME=$HOME/hadoop-2.6.0
export HADOOP_CONF_DIR=$HOME/hadoop-2.6.0/etc/hadoop
export HADOOP_MAPRED_HOME=$HOME/hadoop-2.6.0
export HADOOP_COMMON_HOME=$HOME/hadoop-2.6.0
export HADOOP_HDFS_HOME=$HOME/hadoop-2.6.0
export YARN_HOME=$HOME/hadoop-2.6.0
export PATH=$PATH:$HOME/hadoop-2.6.0/bin
Save and exit the .bashrc file.
Load these changes into your current shell
source ~/.bashrc
Check for the Hadoop version to ensure Hadoop has been installed correctly.
hadoop version
Edit the master file in the master node
sudo vim ~/hadoop-2.6.0/etc/hadoop/master
Add the following to the master file
master
Save and exit the master file.
Edit the slave file in the master node
sudo vim ~/hadoop-2.6.0/etc/hadoop/slaves
Add the following to the slave file
master
slave
Master must be on the first line and slave must be on the second.
Save and exit the slave file.
Edit slaves file on the slave nodes
sudo vim ~/hadoop-2.6.0/etc/hadoop/slaves
Add the following to the first line
slave
Save and exit the slave file.
Edit core-site.xml file on both master and slave nodes
sudo vim ~/hadoop-2.6.0/etc/hadoop/core-site.xml
Add the following property between the configuration tags
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
</configuration>
Save and exit the core-site.xml file.
Edit hdfs-site.xml file on the master node
sudo vim ~/hadoop-2.6.0/etc/hadoop/hdfs-site.xml
Add the following properties between the configuration tags
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/developer/hadoop-2.6.0/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/developer/hadoop-2.6.0/datanode</value>
</property>
</configuration>
Save and exit the hdfs-site.xml file.
Edit hdfs-site.xml file on the slave nodes
sudo vim ~/hadoop-2.6.0/etc/hadoop/hdfs-site.xml
Add the following properties between the configuration tags
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/developer/hadoop-2.6.0/datanode</value>
</property>
</configuration>
Save and exit the hdfs-site.xml file.
Copy the template file mapred-site.xml.template as mapred-site.xml
to the same directory. This will be on both master and slave nodes.
cp ~/hadoop-2.6.0/etc/hadoop/mapred-site.xml.template ~/hadoop-2.6.0/etc/hadoop/mapred-site.xml
Edit this mapred-site.xml file on both master and slave nodes
sudo vim ~/hadoop-2.6.0/etc/hadoop/mapred-site.xml
Add the following properties between the configuration tags
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Save and exit the mapred-site.xml file.
Edit the yarn-site.xml file on both master and slave nodes
sudo vim ~/hadoop-2.6.0/etc/hadoop/yarn-site.xml
Add the following properties between the configuration tags
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Save and exit the yarn-site.xml file.
Format the namenode
hadoop namenode -format
Start the Hadoop services
~./sbin/start-all.sh
Check services are running
jps
Should have the following running:
NameNode
SecondaryNameNode
DataNode
ResourceManager
NodeManager
These services can also be viewed through the browser at
master:50070/dfshealth.html.
A file system check can also be performed with
hdfs fsck /
There will be no directories in HDFS at this point.
To add directories
hdfs dfs -mkdir /user/
hdfs dfs -mkdir /user/developer/
These are the recommended names.
The developer directory is to match with the username but is not required.
After these two directories are made then -ls with no arguments will show these directories by default.

Possible errors

If DataNode is not running its possible it has a different cluster ID than the namenode.
To check this first stop the services
~./sbin/stop-all.sh
Check the cluster ID on two files
sudo vim ~/hadoop-2.6.0/datanode/current/VERSION
sudo vim ~/hadoop-2.6.0/namenode/current/VERSION
If these are different, simply edit the DataNode's version to match the NameNode's version.

If the DataNodes are not processing, may need to add the following in their yarn-site.xml file: <property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>

Provide feedback

Saved searches