Vagrant is a great tool when it comes to deploy multiple virtual machines that have the exact system resources and libraries as the current running machines in different environments. You can find out more information about Vagrant in here.
The prerequisites in order to deploy VA spark cluster in multiple VMs on your Virtualbox are:
-
Having installed Virtualbox setup in place.
-
Installed vagrant tool
-
General knowledge about how VM works
Without ado, let’s jump right into the detailed steps for how to do it.
mkdir -p projects/va-spark-vagrant
cd projects/va-spark-vagrant
mkdir resources
2. Create a new shell script to tell vagrant to create the Ubuntu (18.04) machines with pre-installed tools like java jdk, software-properties-common ..
mkdir scripts
cd scripts
touch bootstrap.sh
Copy the following file to bootstrap.sh
VAGRANT_HOME="/home/vagrant"
sudo apt-get -y update
# install vim
sudo apt-get install -y vim htop r-base
# install jdk8
sudo apt-get install -y software-properties-common python-software-properties
sudo add-apt-repository -y ppa:openjdk-r/ppa
sudo apt-get update
sudo apt-get install -y openjdk-8-jdk
cd ..
touch Vagrantfile
Copy the following content to Vagrant file
# -*- mode: ruby -*-
# vi: set ft=ruby :
Vagrant.require_version">= 1.5.0"
ipAdrPrefix = "192.168.100.1"
memTot = 30000 #hyperparameter
numNodes = 4
memory = memTot/numNodes
cpuCap = 100/numNodes
Vagrant.configure(2) do |config|
r = numNodes..1
(r.first).downto(r.last).each do |i|
config.vm.define "node-#{i}" do |node|
#node.vm.box = "hashicorp/precise64"
node.vm.box = "hashicorp/bionic64"
node.vm.provider "virtualbox" do |v|
v.name = "spark-node#{i}"
v.customize ["modifyvm", :id, "--cpuexecutioncap", cpuCap]
v.customize ["modifyvm", :id, "--memory", memory.to_s]
v.customize ["modifyvm", :id, "--usb", "off"]
v.customize ["modifyvm", :id, "--usbehci", "off"]
end
node.vm.network "private_network", ip: "#{ipAdrPrefix}#{i}"
node.vm.hostname = "spark-node#{i}"
node.vm.provision "shell" do |s|
s.path = "./scripts/bootstrap.sh"
s.args = "#{i} #{numNodes} #{ipAdrPrefix}"
s.privileged = false
end
end
end
end
Run the following command to trigger up the vm instances.
vagrant up
This tutorial will help you to install Apache Hadoop with a basic cluster. In the example below, there will be 1 master (also be used as a worker) node - cluster1, and 3 worker nodes - cluster2, cluster3, cluster4. More worker nodes can be used as users need. All nodes in the instruction use OS Ubuntu Server 18.04, with login user ubuntu, therefore the home directory will be /home/ubuntu/. Remember to replace your appropriate Home directory with /home/ubuntu/.
To figure out IP address of the virtual machines run the following command:
$ ip addr
#I run the above-mentioned command on master and workers (slaves).
#For each machine you will find different IP address.
Below are the 4 nodes and their IP addresses I will be referring to here:
192.168.100.11 cluster1
192.168.100.12 cluster2
192.168.100.13 cluster3
192.168.100.14 cluster4
You need to ssh to every node in the VMS in order to run the below commands
vagrant ssh spark-node1
sudo apt-get -y update && sudo apt-get -y install ssh
sudo nano /etc/hosts
And paste this to the end of the file:
/etc/hosts
192.168.100.11 cluster1
192.168.100.12 cluster2
192.168.100.13 cluster3
192.168.100.14 cluster4
Now configure Open SSH server-client on master. To configure Open SSH server-client, run the following command:
$ sudo apt-get install openssh-server openssh-client
Next step is to generate key pairs. For this purpose, run the following command:
ssh-keygen -t rsa -P ""
Run the following command to authorize the key:
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Now copy the content of .ssh/id_rsa.pub form master to .ssh/authorized_keys (all the workers/slaves as well as master). Run the following commands:
$ ssh-copy-id cluster2
$ ssh-copy-id cluster3
$ ssh-copy-id cluster4
Note: user name and IP will be different of your machines. So, use accordingly.
Now it’s time to check if everything installed properly. Run the following command on master to connect to the slaves / workers:
$ ssh cluster2
$ ssh cluster3
You can exit from slave machine by type the command:
$ exit
sudo apt-get -y install openjdk-8-jdk-headless
Download Hadoop 2.7.3 in all nodes
:
mkdir /tmp/hadoop/ && cd /tmp/hadoop/
wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
Unzip
tar -xzf hadoop-2.7.3.tar.gz
Rename the directory for short
mv hadoop-2.7.3 /home/vagrant/hadoop
Update hadoop-env.sh
in all nodes:
nano /home/vagrant/hadoop/etc/hadoop/hadoop-env.sh
In hadoop-env.sh
file, file the line starts with export JAVA_HOME=
and replaces it with the line below. If not found, then add the line at the end of the file.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Update core-site.xml
in all nodes:
nano /home/vagrant/hadoop/etc/hadoop/core-site.xml
The full content of core-site.xml
in all nodes:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://cluster1:9000</value>
</property>
</configuration>
Update hdfs-site.xml
in all nodes (The content of master node will be different):
nano /home/vagrant/hadoop/etc/hadoop/hdfs-site.xml
The full content of hdfs-site.xml
in cluster1:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/vagrant/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/vagrant/hadoop/hdfs/datanode</value>
</property>
</configuration>
The full content of hdfs-site.xml
in cluster2
, cluster3
, cluster4
:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/vagrant/hadoop/hdfs/datanode</value>
</property>
</configuration>
Update yarn-site.xml
in all nodes:
nano /home/vagrant/hadoop/etc/hadoop/yarn-site.xml
The full content of yarn-site.xml
in all nodes:
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>cluster1:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>cluster1:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>cluster1:8050</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>cluster1</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>16384</value>
</property>
</configuration>
Update mapred-site.xml
in all nodes (If not existed, create this file):
nano /home/vagrant/hadoop/etc/hadoop/mapred-site.xml
The full content of mapred-site.xml
in all nodes:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>cluster1:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hostName:19888</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
</configuration>
Edit slaves (workers) file in all nodes:
nano /home/vagrant/hadoop/etc/hadoop/slaves
And add the following lines (delete localhost if such a line exists):
[slaves file]
cluster1
cluster2
cluster3
cluster4
Edit master
file in all nodes:
nano /home/vagrant/hadoop/etc/hadoop/masters
And add the following lines (delete localhost if such a line exists):
[masters file]
cluster1
nano /home/vagrant/.bashrc
And add these lines to the end of the file:
export HADOOP_HOME="/home/vagrant/hadoop"
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
Now load the environment variables to the opened session
source /home/vagrant/.bashrc
In cluster1
:
sudo rm -rf /home/vagrant/hadoop/hdfs/namenode/
sudo rm -rf /home/vagrant/hadoop/hdfs/datanode/
sudo mkdir -p /home/vagrant/hadoop/hdfs/namenode/
sudo mkdir -p /home/vagrant/hadoop/hdfs/datanode/
sudo chown vagrant:vagrant /home/vagrant/hadoop/hdfs/namenode/
sudo chown vagrant:vagrant /home/vagrant/hadoop/hdfs/datanode/
sudo chmod 777 /home/vagrant/hadoop/hdfs/namenode/
sudo chmod 777 /home/vagrant/hadoop/hdfs/datanode/
In cluster2
, cluster3
, cluster4
:
sudo rm -rf /home/vagrant/hadoop/hdfs/datanode/
sudo mkdir -p /home/vagrant/hadoop/hdfs/datanode/
sudo chown vagrant:vagrant /home/vagrant/hadoop/hdfs/datanode/
sudo chmod 777 /home/vagrant/hadoop/hdfs/datanode/
I set chmod to 777 for easy access. You can change it if you want.
hdfs namenode -format
start-dfs.sh && start-yarn.sh
You should see the following lines:
Run jps on cluster1 should list the following:
Run jps on cluster2, 3, 4 should list the following: Update firewall rules for port 50070 and port 8088 to be accessible.
sudo ufw allow 50070
sudo ufw allow 8088
Then set up a security group for inbound rules value for ports 50070
and 8088
to be accessed from the internet (in this case from my IP address).
By accessing http://${cluster1}:50070
you should see the following HDFS web UI (where ${cluster_1} is the IP value you can retrieve from AWS console).
By accessing http://${cluster1}:8088
you should see the following YARN web UI.
Note that if you're having the issue which DataNode is not up and running in cluster 1
:
.
You would need to remove the following directories in cluster 1
:
sudo rm -rf /home/vagrant/hadoop/hdfs/namenode/
sudo rm -rf /home/vagrant/hadoop/hdfs/datanode/
Then creating the new directories and set write permissions for it.
sudo mkdir -p /home/vagrant/hadoop/hdfs/namenode/
sudo mkdir -p /home/vagrant/hadoop/hdfs/datanode/
sudo chown vagrant:vagrant /home/vagrant/hadoop/hdfs/namenode/
sudo chown vagrant:vagrant /home/vagrant/hadoop/hdfs/datanode/
sudo chmod 777 /home/vagrant/hadoop/hdfs/namenode/
sudo chmod 777 /home/vagrant/hadoop/hdfs/datanode/
```vagrant
### 10. setting lib by uploading a file to HDFS
Writing and reading to HDFS is done with command hdfs dfs. First, manually create your home directory. All other commands will use a path relative to this default home directory: (note that ubuntu is my logged in user. If you login with different user then please use your user id instead of ubuntu).
```bash
jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .
hdfs dfs -mkdir -p /user/vagrant/
hdfs dfs -put spark-libs.jar /user/vagrant/
hdfs dfs -ls /user/vagrant/
Get a books file (EXAMPLE)
wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt
Upload downloaded file to hdfs using -put:
hdfs dfs -mkdir books
hdfs dfs -put alice.txt books/alice.txt
List a file on hdfs:
hdfs dfs -ls books/
. You can also check in HDFS Web UI
stop-yarn.sh && stop-dfs.sh
cd /home/vagrant/
wget https://archive.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
tar -xzf spark-2.4.0-bin-hadoop2.7.tgz
mv spark-2.4.0-bin-hadoop2.7 spark
Open file in vim editor:
nano /home/vagrant/.bashrc
Add the below variables to the end of the file:
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_HOME=/home/vagrant/spark
export PATH=$PATH:$SPARK_HOME/bin
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH
Now load the environment variables to the opened session by running below command:
source /home/vagrant/.bashrc
In case if you added to .profile file then restart your session by logging out and logging in again.
spark-submit --master local\[*\] --deploy-mode client --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.0.jar 10
Notes: when running with VEP, need to run hdfs dfsadmin -safemode leave to disable safemode.
sudo apt-get update
sudo apt-get install libdbi-perl gcc libdbd-mysql-perl perl-base=5.26.1-6ubuntu0.5 gcc=4:7.4.0-1ubuntu2.3 g++=4:7.4.0-1ubuntu2.3 make=4.1-9.1ubuntu1 libbz2-dev=1.0.6-8.1ubuntu0.2 liblzma-dev=5.2.2-1.3 libpng-dev=1.6.34-1ubuntu0.18.04.2 uuid-dev=2.31.1-0.4ubuntu3.7 cpanminus libmysqlclient-dev mysql-server unzip=6.0-21ubuntu1.1 git make unzip libpng-dev uuid-dev bcftools
sudo cpanm Archive::Zip
sudo cpanm Archive::Extract
sudo cpanm DBD::mysql
sudo cpanm Set::IntervalTree
sudo cpanm JSON
sudo cpanm PerlIO::gzip
wget http://hgdownload.cse.ucsc.edu/admin/jksrc.zip
unzip jksrc.zip
cd kent/src/lib
export MACHTYPE=i686
make
cd ..
export KENT_SRC=`pwd`
sudo cpanm Bio::DB::BigFile
git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep
git checkout release/108
perl INSTALL.pl
Notes:
- Cache version: 108
- GRCh38
- Fasta: Homo sapiens
- Plugins: All
- All else configs: Default
Copy data file from local to AWS EC2 instance (EC2-AWS optional)
scp -i ${pem_file} ${path_to_data_file_local} ${user_name_ec2_machine}@${ec2_ip}:${path_to_folder_ec2}
./vep ---format vcf --no_stats --force_overwrite --dir_cache /home/vagrant/.vep --offline --vcf --vcf_info_field ANN --buffer_size 60000 --phased --hgvsg --hgvs --symbol --variant_class --biotype --gene_phenotype --regulatory --ccds --transcript_version --tsl --appris --canonical --protein --uniprot --domains --sift b --polyphen b --check_existing --af --max_af --af_1kg --af_gnomad --minimal --allele_number --pubmed --fasta /home/vagrant/data --input_file ../1KGP/cyp3a7.vcf.gz --output_file f1_b60000_test.vcf
sudo apt-get update
sudo apt-get install scala=2.11.12-4~18.04 zip=3.0-11build1
curl -s "https://get.sdkman.io" | bash
source "$HOME/.sdkman/bin/sdkman-init.sh"
sdk install sbt 1.3.8
git clone https://github.com/variant-annotation/va-spark.git
cd va-spark/
git checkout fix/snpeff
sbt assembly
(time spark-submit --master yarn --deploy-mode cluster --conf spark.yarn.archive=hdfs:///user/vagrant/spark-libs.jar --conf spark.driver.memoryOverhead=2048 --conf spark.executor.memoryOverhead=2048 --executor-memory 4g --num-executors 4 --executor-cores 2 /home/vagrant/va-spark/target/scala-2.11/vaspark-0.1.jar --vep_dir /home/vagrant/ensembl-vep/vep ---format vcf --no_stats --force_overwrite --dir_cache /home/vagrant/.vep --offline --vcf --vcf_info_field ANN --buffer_size 60000 --phased --hgvsg --hgvs --symbol --variant_class --biotype --gene_phenotype --regulatory --ccds --transcript_version --tsl --appris --canonical --protein --uniprot --domains --sift b --polyphen b --check_existing --af --max_af --af_1kg --af_gnomad --minimal --allele_number --pubmed --fasta /home/vagrant/data --input_file ../1KGP/cyp3a7.vcf.gz --output_file f1_b60000_test.vcf) &> time_vs_10gb_nop34_r8_non4_442.txt