This cloudformation
tool (MAC and Linux compatible) creates an EMR 5.23.0 cluster with Spark 2.4.5, using spot instances, a cost effective option (using a bid price) to deploy clusters. Once your cluster is up and running it will have the latest Hail 0.2 version and Jupyter Lab
installed. See sample file in the notebook
folder, pre-loaded in Jupyter Lab
for you to use as starting point.
This tool requires the following programs to be previously installed in your computer (see details in section Before getting started):
- Python3, pip and some additional python libraries
- Amazon's
Command Line Interface (CLI)
To install the required software open a terminal and execute the following:
# Installs homebrew
ruby -e "$(curl -fsSL"
# Installs python3
brew install python3
# Upgrades pip
pip3 install --upgrade pip
#Installs additional libraries
sudo -H pip3 install boto3 pandas botocore paramiko pyyaml nose tornado
# If the previous command does not work, try the following
sudo -H python3 pip install boto3 pandas botocore paramiko pyyaml nose tornado
# Installs AWS CLI
brew install awscli
# Installs Linuxbrew
sudo apt-get -y install build-essential curl file git
echo 'export PATH="/home/linuxbrew/.linuxbrew/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
# Installs python3
brew install python3
# Upgrades pip
pip3 install --upgrade pip
#Installs additional libraries
sudo -H pip3 install boto3 pandas botocore paramiko pyyaml nose tornado
# If the previous command does not work, try the following
sudo -H python3 pip install boto3 pandas botocore paramiko pyyaml nose tornado
# Installs AWS CLI
brew install awscli
This tool is executed from the command line using Amazon's CLI
utility. Before spinning gears, make sure you have:
a) A configured CLI
account. From the terminal execute aws configure
, click here for additional information. If your CLI
account has been previously configured, the tool will use such configuration by default. If you want to re-configure and use a specific account or a different user, execute aws configure
and re-configure your account
b) A valid EC2 key pair. Click here to learn more on how to create and use your key. Safety remark: once you have your key make sure to set the proper permissions for it: chmod 400 my-key.pem
Open a terminal and clone this repository:
git clone
Change directories:
cd hail-on-AWS-spot-instances/src
Using the text editor of your preference (sublime, atom, vi, emacs, etc) update the configuration file
as per the instructions below. This file is your gateway to properly spinning a cluster and it requires specific elements to successfully create your working cluster. Before heading to step 4, follow the instructions explained beneath.This file will be used to provide the necessary information to create the cluster (do not change the name of the file). Give a name to your
and add meaningful information by properly identifying yourEC2_NAME_TAG
. The file in the repo is defaulted to regionus-east-1
, onem4.large
master node and twor4.4xlarge
worker nodes. You can change all this parameters to whatever suits your application.config: EMR_CLUSTER_NAME: "my-hail-02-cluster" # Give a name to your EMR cluster EC2_NAME_TAG: "my-hail-EMR" # Adds a tag to the individual EC2 instances OWNER_TAG: "emr-owner" # EC2 owner tag PROJECT_TAG: "my-project" # Project tag REGION: "us-east-1" MASTER_INSTANCE_TYPE: "m4.large" # Suggested EC2 instances, change as desired WORKER_INSTANCE_TYPE: "r4.xlarge" # Suggested EC2 instances, change as desired WORKER_COUNT: "4" # Number of worker nodes WORKER_BID_PRICE: "0.44" # Required for spot instances MASTER_HD_SIZE: "50" # Size in GB - For large data sets, more HD space may be required WORKER_HD_SIZE: "150" # Size in GB - For large data sets, more HD space may be required (i.e. ~500GB for the 1KG Phase 3) SUBNET_ID: "" # This field can be either left blank or for further security you can specify your private subnet ID in the form: subnet-1a2b3c4d S3_BUCKET: "s3n://my-s3-bucket/" # Specify your S3 bucket for EMR log storage KEY_NAME: "my-key" # Input your key name ONLY! DO NOT include the .pem extension PATH_TO_KEY: "/full-path-to/my-key/" # # Full path to the FOLDER where the .pem file resides WORKER_SECURITY_GROUP: "" # If empty creates a new group by default. You can also add a specific SG. See the SG link in the FAQs section MASTER_SECURITY_GROUP: "" # If empty creates a new group by default. You can also add a specific SG. See the SG link in the FAQs section HAIL_VERSION: "current" # Specify a git hash version (the first 7-12 characters will suffice) to install a specific commit/version. When left empty or "current" will install the latest version of Hail available in the repo
3.1. Select the EC2 instances for your
. It is recommended to use a small generic EC2 for the master, such asm4.large
, and more powerful EC2s (compute or memory optimized) for your worker nodes such asr4.4large
. Visit this link to see the different types of EC2 instances.Suggested EC2s ( WORKER_INSTANCE_TYPE
)c4.4xlarge r4.2xlarge r4.4xlarge m4.4xlarge i3.4xlarge Since we are using spot instances, the worker nodes require a maximum bid price to be specified. The field
specifies the maximum cost that we will pay for each of the worker nodes. To choose an accurate and competitive bid price for your worker nodes, login to the EMR management console:Click on Create cluster:
Then, click on Go to advanced options:
You will be taken to Step 1: Software and Steps, click Next:
Here, click on the instance type selection pencil (1) to find your worker node type. Within the list select your desired instance type and click on the Save button. Next, hover over the i icon (2) to show the current spot price for such instance:
Prices vary based on demand and by the Subnet with its corresponding Availability Zone (subnet-053f834c and zone us-east-1a in this example), where the later dictates the bid price; a good practice is to identify the current prices per subnet/zone and just go slightly above such price to guarantee that you will be promptly provisioned with instances. Even though you specify a higher bid price, you will still pay less if a lower price is available for your zone. The example below shows a suggested bid of $0.44 for
instances in zones 1a and 1c:3.2. For your
you can either specify the subnet from the previous step (i.e. subnet-053f834c) or you can also choose a specific one from the VPC Dashboard, click on Subnets on the left panel:For instance pricing, follow the guidelines from step 3.1. The price is given by the zone where your subnet is located.
3.3. The
field specifies a location to store all the logs of your cluster (i.e. s3n://my-s3-bucket/). If you leave it blank ("") the log folder will be created under your S3 root folder. The log folder will have the same name as your automatically assigned EMR cluster ID (i.e. j-123EMRID3210)3.4. The
field must include the name of your key without the extension. If your key file ismy-key.pem
only putmy-key
field requires the full path pointing to the key file. For additional details upon your key scroll up to the Before getting started section in this repo.3.5. In order to specify the
go to the VPC Dashboard and from the left panel Security >> Security Groups . Note: if these two fields are left empty (default in the configuration file) the security groups are automatically assigned. IMPORTANT: to properly accessJupyter Lab
from the browser, the port8192
has to be added to the inbound rules of yourMASTER_SECURITY_GROUP
. To achieve this, and once you are in the Security Groups page, select your desired group:Click on the Inbound Rules tab to double check that ports
are on the list. To add/edit port rules click on Edit rules and use one of the two configurations suggested below:Click here for additional documentation on security groups.
3.6. In case you desire to perform analysis in Hail under a specific version, the option
accepts either the abbreviated or the full SHA-1 hash. The script will accept any hash between 7-40 characters. The default is "current". If the specific hash is not given or if it wasn't found, the latest available version will be installed. -
Once the configuration file is properly filled and saved, go back to the terminal and from the
execute the command:sh
. The EMR cluster creation takes between 7-10 minutes (depending on EC2 availability). DO NOT terminate the script execution as you will automatically get the IP address to connect to theJypyterNotebook
in the form:123.456.0.1:8192
. Here's a sample screenshot showing what you get once the cluster is successfully created:
(Optional) The full log of the EMR provisioning can be found at: /tmp/cloudcreation_log.out
- You can check the status of the EMR creation at: The EMR is successfully created once it gets the Status
and a solid green circle to the left of the cluster Name.
After the cluster is created, allow for automatic program installation and configuration (~5-8 minutes depending on the number of worker nodes). No additional action is required but to wait for the installation process to complete. (Optional) In addition, the script will also provide the public DNS to connect to the master node. Click here for instructions on how to connect to the master node (NOTE: use username hadoop
) to monitor cluster progress and status (the program installation log at the master node of your EMR is saved at the path: /tmp/cloudcreation_log.out
To launch Jupyter Lab
you need to paste the previously given IP (123.456.0.1:8192
this is the master node's IP pointing to port 8192) in a browser and hit Enter
; once you see the following screen:
use password: phenopolis
to login. If you successfully log in, you are all set!
If after executing
you get an error message saying that "variable cluster_id_json is out of range" it means that the CLI commandaws emr create-cluster --applications Name=Hadoop Name=Spark ...
did not retrieve a cluster ID. This error occurs due to different reasons: a defective AWS account configuration (aws configure
), the user needs additional permits such as AmazonElasticMapReduce* or AmazonEC2*. -
Some times you may get sudden or unexpected errors. One of the reasons may be the fact that your initial spot instances can be dropped and replaced by a new instance (that's how the spot instance model works). This
tool constantly --every minute-- checks for this behavior and will fix everything for you. A common error when an instance is replaced is:
FatalError: ClassNotFoundException: is.hail.kryo.HailKryoRegistrator
- For this and other
Jupyter Lab
glitches, you only need to restart the kernel by clicking onKernel
orRestart & Run All
documentation visit their website:
If you get the error "EMR_DefaultRole is invalid", this is how to solve it: