The Open Data Platform (ODP) is an open-source data management platform that can be rapidly deployed and tailored to accelerate Big Data and Cloud-scale solution delivery. The Bootstrap repository features an Ansible Playbook that automates the deployment of a 5-server Hadoop cluster on Amazon Web Services (AWS) Elastic Compute Cloud (EC2) instances that are managed by either HortonWorks or Cloudera.
- The following are installed locally or on Linux VM:
- Ansible (2.4 or later)
- Python (2.7 or later)
- python-boto (Python 2.x) python-boto (Python 3.x)
- AWS IAM user has permissions to launch EC2 Instances
- AWS Security Groups are set up to allow for communication between EC2 instances
- Set
AWS_ACCESS_KEY
andAWS_SECRET_ACCESS_KEY
as environment variables for user running Ansible
export AWS_ACCESS_KEY_ID=aws_access_key_id
export AWS_SECRET_ACCESS_KEY=aws_secret_access_key
- AWS private key from the key pair is saved as
~/.ssh/id_rsa
for user running Ansible (id_rsa
should be the filename of the key, not the directory - there is no extension for the key)
- Notice: It is extremely important that the AWS SSH key for Ansible is saved as
~/.ssh/id_rsa
- It is best practice to set permissions of
0400
on id_rsa files (read-only to file owner)
- Clone the Open Data Platform repo to the Ansible host and
cd
to the repo base directory - Use Ansible Vault to encrypt the properties file. The Ansible Vault password entered in this step will be needed to edit the properties file and to run the Ansible Playbook.
ansible-vault encrypt group_vars/all
- Use Ansible Vault to edit
group_vars/all
and configure AWS Settings foraws_user
,aws_access_mode
,aws_unique_identifier
,aws_image
,aws_region
,aws_subnet_id
,aws_security_group
,aws_keypair
,aws_device_name
,aws_instance_type
,aws_management_server_volume_size
, andaws_client_server_volume_size
. Further details and example values for these properties can be found in thegroup_vars/all
file comments.
ansible-vault edit group_vars/all
- AMI Image ID for RHEL7 can be found in AWS console by clicking 'Launch Instance' under the 'Quick Start' tab
- AMI Image ID for CentOS 7 can be found on the CentOS Wiki
- We have verified that the following AMIs work:
- RHEL-7.4_HVM_GA-20170808-x86_64-2-Hourly2-GP2 (ami-c998b6b2)
- CentOS Linux 7 x86_64 HVM EBS 1602-b7ee8a69-ee97-4a49-9e68-afaee216db2e-ami-d7e1d2bd.3 (ami-6d1c2007)
- Region, Subnet, Security Group and Key Pair should be available from your AWS administrator
- We recommend allocating at least 50 GB of primary disk space
The HortonWorks deployment installs the following services in Ambari:
- HDFS
- YARN
- MapReduce2
- Tez
- Hive
- Oozie
- Zookeeper
- Kafka
- Spark
- Zeppelin Notebook
The specific components of each HortonWorks service are installed using the following default topology:
- Master Node (Ambari Server): HDFS NameNode, HDFS Client, HDFS DataNode, Kafka Broker, HBase RegionServer, HBase Client, Zeppelin Master, Oozie Server, Spark Client, Tez Client, History Server, Node Manager
- Client Node 1: HDFS SecondaryNameNode, HDFS Client, HDFS DataNode, Kafka Broker, HBase Master, HBase RegionServer, HBase Client, ZooKeeper server, Spark JobHistory Server, Hive WebHCat Server, Spark Client, Tez Client,
- Client Node 2: HDFS Client, HDFS DataNode, Kafka Broker, HBase RegionServer, HBase Client, MySQL Server, Hive MetaStore, Hive Server, Resource Manager, App Timeline Server, ZooKeeper Client, Spark Client, Tez Client, MapReduce2 Client
- Client Nodes 3 & 4: HDFS Client, HDFS DataNode, Kafka Broker, HBase Client, ZooKeeper Client, Spark Client, Hive Client, Oozie Client, Tez Client, Yarn Client, MapReduce2 Client
- Execute the following command from the repo base directory:
ansible-playbook --vault-id @prompt provision_hortonworks.yml
Note: The .retry
files do not work. You can re-run the scripts.
- When the playbook has completed execution, Ansible will print a message specifying the URL to access the Ambari console.
The Cloudera deployment installs the following services in Cloudera Manager:
- HBase
- HDFS
- Hive
- Hue
- Spark
- Kafka
- Oozie
- YARN (MapReduce2 included)
- Zookeeper
The specific components of each Cloudera service are installed using the following default topology:
- Master Server (Cloudera Manager): Cloudera Management Service, HBase RegionServer, HDFS DataNode, Hive Metastore Server, HiveServer2, Hue Server,Oozie Server, YARN NodeManager, Spark JobHistory Server, Spark Server
- Client 1: HBase Thrift Server, HBase Region Server, HDFS Data Node, YARN Node Manager, YARN ResourceManager, ZooKeeper Server, Spark Server
- Client 2: HBase REST Server, HBase RegionServer, HDFS DataNode, Hue Load Balancer, YARN ResourceManager, ZooKeeper Server, Spark Server
- Client 3: HBase Master, HBase RegionServer, HDFS SecondaryNameNode, HDFS DataNode, YARN JobHistory Server, YARN NodeManager, ZooKeeper Server, Spark Server
- Client 4: HBase RegionServer, HDFS NameNode, HDFS DataNode, Hive WebHCat Server, Kafka Broker, YARN NodeManager, Spark Server
- Edit the
group_vars/all
file and create PostgreSQL database passwords by setting the values forcloudera_db_password
,hive_metastore_db_password
,hue_db_password
, andoozie_db_password
. - Execute the following command from the repo base directory:
ansible-playbook --vault-id @prompt provision_cloudera.yml
Note: The .retry
files do not work. If the scripts failed during provisiong the AWS instances you can re-run the scripts. If they failed during the python script setup of cloudera, you will need to delete your EC2 instances and try again.
- When the playbook has completed execution, Ansible will print a message specifying the URL to access the Cloudera Manager console.
In certain cases, such as AWS environments with very limited bandwidth, it may be necessary to set up a local instance of the HortonWorks or Cloudera repositories. In order to do this, first start by creating a new EC2 instance to host the repositories, and then use the reposync
utility to clone all necessary repositories. Finally, update the group_vars/all
properties to point to the local repositories.
- Provision a new EC2 instance using either the Red Hat or CentOS AMI. Ensure the instance is of size t2.medium or larger.
- SSH into the instance and become the
root
user. - Execute the following commands to disable SELinux, install the Apache httpd web server, wget, and createrepo utility:
setenforce 0
yum -y install httpd wget createrepo
systemctl start httpd
systemctl enable httpd
mkdir /var/www/html/repos
Execute the following commands to create a full mirror of the Ambari, HDP, and HDP-Utils repositories:
cd /etc/yum.repos.d
wget http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.5.2.0/ambari.repo http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.6.2.14/hdp.repo
cd /var/www/html/repos
nohup reposync -r ambari-2.5.2.0 -r HDP-2.6.2.14 -r HDP-UTILS-1.1.0.21 &
createrepo ambari-2.5.2.0
createrepo HDP-2.6.2.14
createrepo HDP-UTILS-1.1.0.21
Execute the following commands to create a full mirror of the Cloudera Manager repository and to create a partial mirror of the parcels repository that only pulls the necessary artifacts:
cd /etc/yum.repos.d
wget https://archive.cloudera.com/cm5/redhat/7/x86_64/cm/cloudera-manager.repo
cd /var/www/html/repos
nohup reposync -r cloudera-manager &
mkdir cloudera-parcels
cd cloudera-parcels
nohup wget http://archive.cloudera.com/cdh5/parcels/5.13.0.29/CDH-5.13.0-1.cdh5.13.0.p0.29-el7.parcel http://archive.cloudera.com/cdh5/parcels/5.13.0.29/manifest.json &
cd /var/www/html/repos
createrepo cloudera-manager
createrepo cloudera-parcels
Edit the group_vars/all
file and update the following properties, inserting the private IP address of the EC2 instance being used to host the repositories in place of repo_server_private_ip
:
- For HortonWorks:
ambari_repo_7: http://repo_server_private_ip/repos/ambari-2.5.2.0
hdp_repo_7: http://repo_server_private_ip/repos/HDP-2.6.2.14
hdp_utils_repo_7: http://repo_server_private_ip/repos/HDP-UTILS-1.1.0.21
- For Cloudera:
cloudera_manager_repo: http://repo_server_private_ip/repos/cloudera-manager
cloudera_parcel_repo: http://repo_server_private_ip/repos/cloudera-parcels
- NiFi Resources
- We are using ODP NiFi Version 0.1.0
- Elastic
- We are using Elastic version 6.6.0
- kibana
- We are using Kibana version 6.6.0
- Ansible Resources
- HortonWorks Resources
- Cloudera Resources
- Amazon Web Services Resources