Skip to content

Physical Cluster

Terrance Mount edited this page Jan 18, 2019 · 29 revisions

Biforce Brand

Purpose

The purpose of setting up a physical cluster is to use Hadoop's services to answer the question at hand. While cloud computing is an option, some of Revature's clients may have their own physical cluster and as responsible contractors we need to be ready to set up, maintain, and execute jobs on these physical clusters.

Goals

Our goal is to process the data on the physical cluster and obtain an answer.

Difficulties

The only difficulty we experienced was time constraints. We have access to many resources which guided our installation smoothly. We only came across a few small errors, which we documented at the bottom of this guide.

Installing Hadoop on Xubuntu

NOTES:

  • developer is the username of the host machine throughout these instructions.
  • Java is already installed on all machines.
  • The versions mentioned here were the most recent at that time. Be sure to get the most recent versions.
  • Grayed out text represents commands in the CLI.
  • Bold text represents text in files.
  1. Find IP address of machine
    ifconfig
  2. Disable firewall restrictions
    sudo ufw disable
    Other helpful commands
    sudo ufw status
    sudo ufw enable
  3. Add this IP address to hosts file
    sudo vim /etc/hosts
  4. Restart SSH
    sudo service ssh restart
    Other helpful commands
    sudo service ssh status
  5. Create SSH key
    ssh-keygen -t rsa -P “”
    Hit Enter when asked "Enter file in which to save the key"
  6. Copy SSH key to master node’s authorized keys
    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  7. Copy master node’s SSH key to slave’s authorized keys
    ssh-copy-id -i ~/.ssh/id_rsa.pub <username>@<slave node IP address>
  8. Download Hadoop
    wget https://archive.apache.org/dist/hadoop/core/hadoop-2.6.0/hadoop-2.6.0.tar.gz
  9. Extract Hadoop tar file on all nodes
    tar -xf hadoop-2.6.0.tar.gz
  10. Move extracted hadoop directory to user's home
    mv hadoop-2.6.0 ~/
  11. Add Hadoop paths in the bash file (.bashrc) on all nodes.
    sudo vim ~/.bashrc
    Add the following paths in the .bashrc file
    export HADOOP_HOME=$HOME/hadoop-2.6.0
    export HADOOP_CONF_DIR=$HOME/hadoop-2.6.0/etc/hadoop
    export HADOOP_MAPRED_HOME=$HOME/hadoop-2.6.0
    export HADOOP_COMMON_HOME=$HOME/hadoop-2.6.0
    export HADOOP_HDFS_HOME=$HOME/hadoop-2.6.0
    export YARN_HOME=$HOME/hadoop-2.6.0
    export PATH=$PATH:$HOME/hadoop-2.6.0/bin
    Save and exit the .bashrc file.
  12. Load these changes into your current shell
    source ~/.bashrc
  13. Check for the Hadoop version to ensure Hadoop has been installed correctly.
    hadoop version
  14. Edit the master file in the master node
    sudo vim ~/hadoop-2.6.0/etc/hadoop/master
    Add the following to the master file
    master
    Save and exit the master file.
  15. Edit the slave file in the master node
    sudo vim ~/hadoop-2.6.0/etc/hadoop/slaves
    Add the following to the slave file
    master
    slave
    Master must be on the first line and slave must be on the second.
    Save and exit the slave file.
  16. Edit slaves file on the slave nodes
    sudo vim ~/hadoop-2.6.0/etc/hadoop/slaves
    Add the following to the first line
    slave
    Save and exit the slave file.
  17. Edit core-site.xml file on both master and slave nodes
    sudo vim ~/hadoop-2.6.0/etc/hadoop/core-site.xml
    Add the following property between the configuration tags
    <configuration>
    <property>
    <name>fs.default.name</name>
    <value>hdfs://master:9000</value>
    </property>
    </configuration>

    Save and exit the core-site.xml file.
  18. Edit hdfs-site.xml file on the master node
    sudo vim ~/hadoop-2.6.0/etc/hadoop/hdfs-site.xml
    Add the following properties between the configuration tags
    <configuration>
    <property>
    <name>dfs.replication</name>
    <value>2</value>
    </property>
    <property>
    <name>dfs.permissions</name>
    <value>false</value>
    </property>
    <property>
    <name>dfs.namenode.name.dir</name>
    <value>/home/developer/hadoop-2.6.0/namenode</value>
    </property>
    <property>
    <name>dfs.datanode.data.dir</name>
    <value>/home/developer/hadoop-2.6.0/datanode</value>
    </property>
    </configuration>

    Save and exit the hdfs-site.xml file.
  19. Edit hdfs-site.xml file on the slave nodes
    sudo vim ~/hadoop-2.6.0/etc/hadoop/hdfs-site.xml
    Add the following properties between the configuration tags
    <configuration>
    <property>
    <name>dfs.replication</name>
    <value>2</value>
    </property>
    <property>
    <name>dfs.permissions</name>
    <value>false</value>
    </property>
    <property>
    <name>dfs.datanode.data.dir</name>
    <value>/home/developer/hadoop-2.6.0/datanode</value>
    </property>
    </configuration>

    Save and exit the hdfs-site.xml file.
  20. Copy the template file mapred-site.xml.template as mapred-site.xml
    to the same directory. This will be on both master and slave nodes.
    cp ~/hadoop-2.6.0/etc/hadoop/mapred-site.xml.template ~/hadoop-2.6.0/etc/hadoop/mapred-site.xml
  21. Edit this mapred-site.xml file on both master and slave nodes
    sudo vim ~/hadoop-2.6.0/etc/hadoop/mapred-site.xml
    Add the following properties between the configuration tags
    <configuration>
    <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
    </property>
    </configuration>

    Save and exit the mapred-site.xml file.
  22. Edit the yarn-site.xml file on both master and slave nodes
    sudo vim ~/hadoop-2.6.0/etc/hadoop/yarn-site.xml
    Add the following properties between the configuration tags
    <configuration>
    <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
    </property>
    <property>
    <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    </configuration>

    Save and exit the yarn-site.xml file.
  23. Format the namenode
    hadoop namenode -format
  24. Start the Hadoop services
    ~./sbin/start-all.sh
  25. Check services are running
    jps
    Should have the following running:
    NameNode
    SecondaryNameNode
    DataNode
    ResourceManager
    NodeManager
  26. These services can also be viewed through the browser at
    master:50070/dfshealth.html.
  27. A file system check can also be performed with
    hdfs fsck /
  28. There will be no directories in HDFS at this point.
    To add directories
    hdfs dfs -mkdir /user/
    hdfs dfs -mkdir /user/developer/
    These are the recommended names.
    The developer directory is to match with the username but is not required.
    After these two directories are made then -ls with no arguments will show these directories by default.

Possible errors

If DataNode is not running its possible it has a different cluster ID than the namenode.
To check this first stop the services
~./sbin/stop-all.sh
Check the cluster ID on two files
sudo vim ~/hadoop-2.6.0/datanode/current/VERSION
sudo vim ~/hadoop-2.6.0/namenode/current/VERSION
If these are different, simply edit the DataNode's version to match the NameNode's version.

If the DataNodes are not processing, may need to add the following in their yarn-site.xml file: <property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>

Clone this wiki locally