ec2-spot-labs/cost-optimized-spark-on-kubernetes at master · awslabs/ec2-spot-labs

History

Name		Name	Last commit message	Last commit date
parent directory ..
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
driver_pod_template.yml		driver_pod_template.yml
executor_pod_template.yml		executor_pod_template.yml
managedNodeGroups.yml		managedNodeGroups.yml
script.py		script.py
spark-s3.json		spark-s3.json
spark-submit-command.md		spark-submit-command.md
spark_docker_image_instructions.md		spark_docker_image_instructions.md

README.md

Running Cost Optimized Spark workloads on Kubernetes using EC2 Spot Instances and Amazon Elastic Kubernetes Service

This GitHub contains sample configuration files for running Apache Spark on Kubernetes using Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon EC2 Spot Instances. We recommend reading this blog post for more information on this topic. The blog also contains the detailed tutorial for the step-by-step instructions below.

What you’ll run

A word-count Spark application counting the words from an Amazon Customer Review dataset and write the output to an Amazon S3 folder.

Step-by-step Instructions

Create a S3 Bucket

Create Amazon S3 Access Policy

aws iam create-policy --policy-name spark-s3-policy --policy-document file://spark-s3.json

Replace the output folder with the bucket name

Create an EKS cluster using the following command

eksctl create cluster –name=sparkonk8 --node-private-networking  --without-nodegroup --asg-access –region=<AWS Region>

Create the nodegroup using the nodeGroup config file. Replace the string using the ARN string from the previous step.
```
eksctl create nodegroup -f managedNodeGroups.yml
```

Create a service account

kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole='edit'  --serviceaccount=default:spark
--namespace=default

Download and install the Cluster Autoscaler

curl -LO https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

Edit it to add the cluster-name.

Install the Cluster Autoscaler

kubectl apply -f cluster-autoscaler-autodiscover.yaml

Get the details of Kubernetes master url
```
kubectl cluster-info
```
Build the docker image using the instructions here
Use the application file(script.py) and upload into the Amazon S3 bucket created.
Download the pod template files

This enables scheduling of the driver pods to On-Demand Instances and executor pods to Spot Instances
Submit the spark job using the command here

Cleanup

Delete the EKS cluster and the nodegroups with the following command:
```
eksctl delete cluster --name sparkonk8
```
Delete the Amazon S3 Access Policy with the following command:
```
aws iam delete-policy --policy-arn <POLICY ARN>
```
Delete the Amazon S3 Output Bucket with the following command:
```
aws s3 rb --force s3://<S3_BUCKET>
```

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cost-optimized-spark-on-kubernetes

cost-optimized-spark-on-kubernetes

README.md

Running Cost Optimized Spark workloads on Kubernetes using EC2 Spot Instances and Amazon Elastic Kubernetes Service

Step-by-step Instructions

Cleanup

Security

License

Files

cost-optimized-spark-on-kubernetes

Directory actions

More options

Directory actions

More options

Latest commit

History

cost-optimized-spark-on-kubernetes

Folders and files

parent directory

README.md

Running Cost Optimized Spark workloads on Kubernetes using EC2 Spot Instances and Amazon Elastic Kubernetes Service

Step-by-step Instructions

Cleanup

Security

License