Cloud Cluster Team

Responsibilities

The objective of the cloud cluster team is to run the Spark job (jar) on the AWS EMR cluster triggered remotely from a local computer.

We developed an ssh action in Oozie on a local machine to run a bash script on the EMR cluster. The bash script contains the spark-submit command responsible for uploading the output to an AWS S3 bucket.

We then connected to EMR using ssh-action from oozie ¹. Once a connection has been established, a prepared shell script will run the spark-submit command on the EMR. Make sure the Spark jar and input files for the Spark job are present within the AWS EMR cluster before executing the job. Once the job has successfully finished running, the output of the Spark job is migrated to a S3 bucket for visual analysis via an OLAP team.

Keep in mind Oozie is not aware of the status/progress of the Spark job running in EMR. Therefore, it will run "infinitely" from Oozie's perspective as there there is no feedback from the cluster to notify Oozie that the job has finished running. As such, this job needs to be the final job in the Oozie workflow.

Configuring the Cloud Cluster

If the EMR key is encrypted, decrypt it to allow passwordless ssh login.

Copy secure key to id_rsa.

cp emr-secure-key.pem id_rsa

Decrypt the copied key

ssh-keygen -p -f id_rsa

Move id_rsa to .ssh directory

mv id_rsa ~/.ssh/id_rsa

Create an ssh-action within an Oozie workflow.

 <action name="oozie-ssh">
   <ssh xmlns="uri:oozie:ssh-action:0.1">
       <host>${emr_hostname}</host>
       <command>/home/hadoop/test.sh 2> /home/hadoop/SparkCommand.log</command>
 	       <capture-output/>
    </ssh>
    <ok to="end"/>
    <error to="kill"/>
 </action>

Copy the output of the command and use it for localhost in name_node and job_tracker

hostname -f

Specify the hostname and localhost in the job.properties file.

#Configuration Parameters
name_node = hdfs://localhost:8020
job_tracker = localhost:8032
emr_hostname = emr_hostname@emr_ipaddress
Oozie.wf.application.path = /path_to_workflow_in_hdfs
Oozie.use.system.libpath = true
Oozie.action.ssh.allow.user.at.host = true

Notes:

¹ Make sure to decrypt your pet key if it is password protected. Rename the decrypted pem file as id_rsa and move it to the ~/.ssh/ directory on your local machine. This will allow you to ssh into the cluster from the terminal without a password.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud Cluster Team

Responsibilities

Configuring the Cloud Cluster

Notes:

Clone this wiki locally