-
Notifications
You must be signed in to change notification settings - Fork 9
Cloud Cluster Team
The objective of the cloud cluster team is to run the Spark job (jar) on the AWS EMR cluster triggered remotely from a local computer.
We developed an ssh action in Oozie on a local machine to run a bash script on the EMR cluster. The bash script contains the spark-submit
command responsible for uploading the output to an AWS S3 bucket.
We then connected to EMR using ssh-action from oozie 1. Once a connection has been established, a prepared shell script will run the spark-submit
command on the EMR. Make sure the Spark jar and input files for the Spark job are present within the AWS EMR cluster before executing the job. Once the job has successfully finished running, the output of the Spark job is migrated to a S3 bucket for visual analysis via an OLAP team.
Keep in mind Oozie is not aware of the status/progress of the Spark job running in EMR. Therefore, it will run "infinitely" from Oozie's perspective as there there is no feedback from the cluster to notify Oozie that the job has finished running. As such, this job needs to be the final job in the Oozie workflow.
If the EMR key is encrypted, decrypt it to allow passwordless ssh login.
- Copy secure key to id_rsa.
cp emr-secure-key.pem id_rsa
Decrypt the copied key
ssh-keygen -p -f id_rsa
Move id_rsa to .ssh directory
mv id_rsa ~/.ssh/id_rsa
Create an ssh-action within an Oozie workflow.
<action name="oozie-ssh">
<ssh xmlns="uri:oozie:ssh-action:0.1">
<host>${emr_hostname}</host>
<command>/home/hadoop/test.sh 2> /home/hadoop/SparkCommand.log</command>
<capture-output/>
</ssh>
<ok to="end"/>
<error to="kill"/>
</action>
Copy the output of the command and use it for localhost in name_node and job_tracker
hostname -f
Specify the hostname and localhost in the job.properties
file.
#Configuration Parameters
name_node = hdfs://localhost:8020
job_tracker = localhost:8032
emr_hostname = emr_hostname@emr_ipaddress
Oozie.wf.application.path = /path_to_workflow_in_hdfs
Oozie.use.system.libpath = true
Oozie.action.ssh.allow.user.at.host = true
1 Make sure to decrypt your pet key if it is password protected. Rename the decrypted pem file as id_rsa and move it to the ~/.ssh/ directory on your local machine. This will allow you to ssh into the cluster from the terminal without a password.