SeQuiLa recipes, examples and other cloud-related content demonstrating how to run SeQuila jobs in the cloud. For most tasks we use Terraform as a main IaC (Infrastructure as Code) tool.
- Disclaimer
- Demo scenario
- Support matrix
- Modules statuses
- Init backends
- AWS
- Azure
- GCP
- Development and contribution
- Terraform doc
These are NOT production-ready examples. Terraform modules and Docker images are scanned/linted with tools such as checkov, tflint and tfsec but some security tweaks have been disabled for the sake of simplicity. Some cloud deployments best practices has been intentionally skipped as well. Check code comments for details.
- The presented scenario can be deployed on one of the main cloud providers: Azure(Microsoft), AWS(Amazon) and GCP(Google).
- For each cloud two options are presented - deployment on managed Hadoop ecosystem (Azure - HDInsight, AWS - EMR, GCP - Dataproc) or or using managed Kubernetes service (Azure - AKS, AWS - EKS and GCP - GKE).
- Scenario includes the following steps:
- setup distributed object storage
- copy test data
- setup computing environment
- run a test PySeQuiLa job using PySpark using YARN or spark-on-k8s-operator
- We assume that:
- on AWS: an account is created
- on GCP: a project is created and attached to billing account
- on Azure: a subscription is created (A Google Cloud project is conceptually similar to the Azure subscription, in terms of billing, quotas, and limits).
Cloud | Service | Release | Spark | SeQuiLa | PySeQuila | Image tag* |
---|---|---|---|---|---|---|
GCP | GKE | 1.23.8-gke.1900 | 3.2.2 | 1.1.0 | 0.4.1 | docker.io/biodatageeks/spark-py:pysequila-0.4.1-gke-latest |
GCP | Dataproc | 2.0.27-ubuntu18 | 3.1.3 | 1.0.0 | 0.3.3 | - |
GCP | Dataproc Serverless | 1.0.21 | 3.2.2 | 1.1.0 | 0.4.1 | gcr.io/${TF_VAR_project_name}/spark-py:pysequila-0.4.1-dataproc-latest |
Azure | AKS | 1.23.12 | 3.2.2 | 1.1.0 | 0.4.1 | docker.io/biodatageeks/spark-py:pysequila-0.4.1-aks-latest |
Azure | HDInsight | 5.0.300.1 | 3.2.2 | 1.1.0 | 0.4.1 | - |
AWS | EKS | 1.23.9 | 3.2.2 | 1.1.0 | 0.4.1 | docker.io/biodatageeks/spark-py:pysequila-0.4.1-eks-latest |
AWS | EMR Serverless | emr-6.7.0 | 3.2.1 | 1.1.0 | 0.4.1 | - |
Based on the above table set software versions and Docker images accordingly, e.g.: :bulb: These environment variables need to be set prior launching SeQuiLa-cli container.
### All clouds
export TF_VAR_pysequila_version=0.4.1
export TF_VAR_sequila_version=1.1.0
## GCP only
export TF_VAR_pysequila_image_gke=docker.io/biodatageeks/spark-py:pysequila-${TF_VAR_pysequila_version}-gke-latest
export TF_VAR_pysequila_image_dataproc=docker.io/biodatageeks/spark-py:pysequila-${TF_VAR_pysequila_version}-dataproc-latest
## Azure only
export TF_VAR_pysequila_image_aks=docker.io/biodatageeks/spark-py:pysequila-${TF_VAR_pysequila_version}-aks-latest
## AWS only
export TF_VAR_pysequila_image_eks=docker.io/biodatageeks/spark-py:pysequila-${TF_VAR_pysequila_version}-eks-latest
💡 It is strongly recommended to use biodatageeks/sequila-cloud-cli:latest
image to run all the commands.
This is image contains all the tools required to set up both infrastructure and run SeQuiLa demo jobs.
## change to your project and region/zone
export TF_VAR_project_name=tbd-tbd-devel
export TF_VAR_region=europe-west2
export TF_VAR_zone=europe-west2-b
##
docker pull biodatageeks/sequila-cloud-cli:latest
docker run --rm -it \
-e TF_VAR_project_name=${TF_VAR_project_name} \
-e TF_VAR_region=${TF_VAR_region} \
-e TF_VAR_zone=${TF_VAR_zone} \
-e TF_VAR_pysequila_version=${TF_VAR_pysequila_version} \
-e TF_VAR_sequila_version=${TF_VAR_sequila_version} \
-e TF_VAR_pysequila_image_gke=${TF_VAR_pysequila_image_gke} \
biodatageeks/sequila-cloud-cli:latest
💡 The rest of the commands in this demo should be executed in the container.
cd git && git clone https://github.com/biodatageeks/sequila-cloud-recipes.git && \
cd sequila-cloud-recipes && \
cd cloud/gcp
terraform init
export TF_VAR_region=westeurope
docker pull biodatageeks/sequila-cloud-cli:latest
docker run --rm -it \
-e TF_VAR_region=${TF_VAR_region} \
-e TF_VAR_pysequila_version=${TF_VAR_pysequila_version} \
-e TF_VAR_sequila_version=${TF_VAR_sequila_version} \
-e TF_VAR_pysequila_image_aks=${TF_VAR_pysequila_image_aks} \
biodatageeks/sequila-cloud-cli:latest
💡 The rest of the commands in this demo should be executed in the container.
cd git && git clone https://github.com/biodatageeks/sequila-cloud-recipes.git && \
cd sequila-cloud-recipes && \
cd cloud/azure
terraform init
docker pull biodatageeks/sequila-cloud-cli:latest
docker run --rm -it \
/var/run/docker.sock:/var/run/docker.sock \
-e TF_VAR_pysequila_version=${TF_VAR_pysequila_version} \
-e TF_VAR_sequila_version=${TF_VAR_sequila_version} \
-e TF_VAR_pysequila_image_eks=${TF_VAR_pysequila_image_eks} \
biodatageeks/sequila-cloud-cli:latest
💡 The rest of the commands in this demo should be executed in the container.
cd git && git clone https://github.com/biodatageeks/sequila-cloud-recipes.git && \
cd sequila-cloud-recipes && \
cd cloud/aws
terraform init
- AKS (Azure Kubernetes Service): ✅
- HDInsight: ✅
There are a few
authentication method available. Pick up the one is the most convenient for you - e.g. set AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
and AWS_REGION
environment variables.
export AWS_ACCESS_KEY_ID="anaccesskey"
export AWS_SECRET_ACCESS_KEY="asecretkey"
export AWS_REGION="eu-west-1"
💡 Above-mentioned User/Service Account should have account admin privileges to manage EKS/EMR and S3 resources.
- Ensure you are in the right subfolder
echo $PWD | rev | cut -f1,2 -d'/' | rev
cloud/aws
- Run
terraform apply -var-file=../../env/aws.tfvars -var-file=../../env/_all.tfvars -var-file=../../env/aws-eks.tfvars
- Connect to the K8S cluster, e.g.:
## Fetch configuration
aws eks update-kubeconfig --region eu-west-1 --name sequila
## Verify
kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-1-241.eu-west-1.compute.internal Ready <none> 36m v1.23.9-eks-ba74326
- Use sparkctl (recommended - available in sequila-cli image) or use
kubectl
to deploy a SeQuiLa job:
sparkctl create ../../jobs/aws/eks/pysequila.yaml
After a while you will be able to check the logs:
sparkctl log -f pysequila
💡 Or you can use k9s tool (available in the image) to check Spark Driver std output:
sparkctl delete pysequila
terraform destroy -var-file=../../env/aws.tfvars -var-file=../../env/_all.tfvars -var-file=../../env/aws-eks.tfvars
Unlike GCP Dataproc Serverless that support providing custom docker images for Spark driver and executors, AWS EMR Serverless
requires preparing both: a tarball of a Python virtual environment (using venv-pack
or conda-pack
) and copying extra jar files
to a s3 bucket. Both steps are automated by emr-serverless module.
More info can be found here
Starting from EMR release 6.7.0
it is possible to specify extra jars using --packages
option but requires an additional VPN NAT setup.
:bulb: This is why it may take some time (depending on you network bandwidth) to prepare and upload additional dependencies to a s3 bucket - please be patient.
terraform apply -var-file=../../env/aws.tfvars -var-file=../../env/_all.tfvars -var-file=../../env/aws-emr.tfvars
As an output of the above command you will find a rendered command that you can use to launch a sample job (including environment variables export):
Apply complete! Resources: 178 added, 0 changed, 0 destroyed.
Outputs:
emr_server_exec_role_arn = "arn:aws:iam::927478350239:role/sequila-role"
emr_serverless_command = <<EOT
export APPLICATION_ID=00f5c6prgt01190p
export JOB_ROLE_ARN=arn:aws:iam::927478350239:role/sequila-role
aws emr-serverless start-job-run \
--application-id $APPLICATION_ID \
--execution-role-arn $JOB_ROLE_ARN \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://sequilabhp8knyc/jobs/pysequila/sequila-pileup.py",
"entryPointArguments": ["pyspark_pysequila-0.4.1.tar.gz"],
"sparkSubmitParameters": "--conf spark.driver.cores=1 --conf spark.driver.memory=2g --conf spark.executor.cores=1 --conf spark.executor.memory=4g --conf spark.executor.instances=1 --archives=s3://sequilabhp8knyc/venv/pysequila/pyspark_pysequila-0.4.1.tar.gz#environment --jars s3://sequilabhp8knyc/jars/sequila/1.1.0/* --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.files=s3://sequilabhp8knyc/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta,s3://sequilabhp8knyc/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta.fai"
}
}'
EOT
terraform destroy -var-file=../../env/aws.tfvars -var-file=../../env/_all.tfvars -var-file=../../env/aws-emr.tfvars
Install Azure CLI and set default subscription
az login
az account set --subscription "Azure subscription 1"
💡 According to the release notes HDInisght 5.0 comes with Apache Spark 3.1.2. Unfortunately it is 3.0.2:
Since HDInsight is in fact a full-fledged Hadoop cluster we were able to add to the Terraform module support for Apache Spark 3.2.2 using a script action mechanism.
export TF_VAR_hdinsight_gateway_password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16 ; echo '')
export TF_VAR_hdinsight_ssh_password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16 ; echo '')
terraform apply -var-file=../../env/azure.tfvars -var-file=../../env/azure-hdinsight.tfvars -var-file=../../env/_all.tfvars
Check Terraform output variables for ssh connection string, credentials and Spark Submit command, e.g.
Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
Outputs:
hdinsight_gateway_password = "w8aN6oVSJobq7eu4"
hdinsight_ssh_password = "wun6RzBBPWD9z9ke"
pysequila_submit_command = <<EOT
export SPARK_HOME=/opt/spark
spark-submit \
--master yarn \
--packages org.biodatageeks:sequila_2.12:1.1.0 \
--conf spark.pyspark.python=/usr/bin/miniforge/envs/py38/bin/python3 \
--conf spark.driver.cores=1 \
--conf spark.driver.memory=1g \
--conf spark.executor.cores=1 \
--conf spark.executor.memory=3g \
--conf spark.executor.instances=1 \
--conf spark.files=wasb://[email protected]/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta,wasb://[email protected]/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta.fai \
wasb://[email protected]/jobs/pysequila/sequila-pileup.py
EOT
ssh_command = "ssh [email protected]"
- Use
ssh_command
andhdinsight_ssh_password
to connect to the head node. - Run
pysequila_submit_command
command.
terraform destroy -var-file=../../env/azure.tfvars -var-file=../../env/azure-hdinsight.tfvars -var-file=../../env/_all.tfvars
- Ensure you are in the right subfolder
echo $PWD | rev | cut -f1,2 -d'/' | rev
cloud/azure
- Run
terraform apply -var-file=../../env/azure.tfvars -var-file=../../env/azure-aks.tfvars -var-file=../../env/_all.tfvars
- Connect to the K8S cluster, e.g.:
## Fetch configuration
az aks get-credentials --resource-group sequila-resources --name sequila-aks1
# check connectivity
kubectl get nodes
NAME STATUS ROLES AGE VERSION
aks-default-37875945-vmss000002 Ready agent 59m v1.20.9
aks-default-37875945-vmss000003 Ready agent 59m v1.20.9
- Use sparkctl (recommended - available in sequila-cli image) or use
kubectl
to deploy a SeQuiLa job:
sparkctl create ../../jobs/azure/aks/pysequila.yaml
After a while you will be able to check the logs:
sparkctl log -f pysequila
💡 Or you can use k9s tool (available in the image) to check Spark Driver std output:
sparkctl delete pysequila
terraform destroy -var-file=../../env/azure.tfvars -var-file=../../env/azure-aks.tfvars -var-file=../../env/_all.tfvars
- Install Cloud SDK
- Authenticate
gcloud auth application-default login
# set default project
gcloud config set project $TF_VAR_project_name
- Set GCP project-related env variables, e.g.: :bulb: If you use our image all the env variables are already set.
export TF_VAR_project_name=tbd-tbd-devel
export TF_VAR_region=europe-west2
export TF_VAR_zone=europe-west2-b
Above variables are necessary for both Dataproc
and GKE
setups.
2. Ensure you are in the right subfolder
echo $PWD | rev | cut -f1,2 -d'/' | rev
cloud/gcp
terraform apply -var-file=../../env/gcp.tfvars -var-file=../../env/gcp-dataproc.tfvars -var-file=../../env/_all.tfvars
gcloud dataproc workflow-templates instantiate pysequila-workflow --region ${TF_VAR_region}
Waiting on operation [projects/tbd-tbd-devel/regions/europe-west2/operations/36cbc4dc-783c-336c-affd-147d24fa014c].
WorkflowTemplate [pysequila-workflow] RUNNING
Creating cluster: Operation ID [projects/tbd-tbd-devel/regions/europe-west2/operations/ef2869b4-d1eb-49d8-ba56-301c666d385b].
Created cluster: tbd-tbd-devel-cluster-s2ullo6gjaexa.
Job ID tbd-tbd-devel-job-s2ullo6gjaexa RUNNING
Job ID tbd-tbd-devel-job-s2ullo6gjaexa COMPLETED
Deleting cluster: Operation ID [projects/tbd-tbd-devel/regions/europe-west2/operations/0bff879e-1204-4971-ae9e-ccbf9c642847].
WorkflowTemplate [pysequila-workflow] DONE
Deleted cluster: tbd-tbd-devel-cluster-s2ullo6gjaexa.
terraform destroy -var-file=../../env/gcp.tfvars -var-file=../../env/gcp-dataproc.tfvars -var-file=../../env/_all.tfvars
- Prepare infrastructure including a Container registry (see point 2)
terraform apply -var-file=../../env/gcp.tfvars -var-file=../../env/gcp-dataproc.tfvars -var-file=../../env/_all.tfvars
- Since accoring to the documentation Dataproc Serverless
services cannot fetch containers from other registries than GCP ones (in particular from
docker.io
). This is why you need to pull a required image fromdocker.io
and push it to your project GCR(Google Container Registry), e.g.:
gcloud auth configure-docker
docker tag biodatageeks/spark-py:pysequila-0.4.1-dataproc-b3c836e $TF_VAR_pysequila_image_dataproc
docker push $TF_VAR_pysequila_image_dataproc
gcloud dataproc batches submit pyspark gs://${TF_VAR_project_name}-staging/jobs/pysequila/sequila-pileup.py \
--batch=pysequila \
--region=${TF_VAR_region} \
--container-image=${TF_VAR_pysequila_image_dataproc} \
--version=1.0.21 \
--files gs://bigdata-datascience-staging/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta,gs://bigdata-datascience-staging/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta.fai
Batch [pysequila] submitted.
Pulling image gcr.io/bigdata-datascience/spark-py:pysequila-0.3.4-dataproc-b3c836e
Image is up to date for sha256:30b836594e0a768211ab209ad02ad3ad0fb1c40c0578b3503f08c4fadbab7c81
Waiting for container log creation
PYSPARK_PYTHON=/usr/bin/python3.9
JAVA_HOME=/usr/lib/jvm/temurin-11-jdk-amd64
SPARK_EXTRA_CLASSPATH=/opt/spark/.ivy2/jars/*
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/spark/.ivy2/jars/org.slf4j_slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]
:: loading settings :: file = /etc/spark/conf/ivysettings.xml
+------+---------+-------+---------+--------+--------+-----------+----+-----+
|contig|pos_start|pos_end| ref|coverage|countRef|countNonRef|alts|quals|
+------+---------+-------+---------+--------+--------+-----------+----+-----+
| 1| 34| 34| C| 1| 1| 0|null| null|
| 1| 35| 35| C| 2| 2| 0|null| null|
| 1| 36| 37| CT| 3| 3| 0|null| null|
| 1| 38| 40| AAC| 4| 4| 0|null| null|
| 1| 41| 49|CCTAACCCT| 5| 5| 0|null| null|
+------+---------+-------+---------+--------+--------+-----------+----+-----+
only showing top 5 rows
Batch [pysequila] finished.
metadata:
'@type': type.googleapis.com/google.cloud.dataproc.v1.BatchOperationMetadata
batch: projects/bigdata-datascience/locations/europe-west2/batches/pysequila
batchUuid: c798a09f-c690-4bc8-9dc8-6be5d1e565e0
createTime: '2022-11-04T08:37:17.627022Z'
description: Batch
operationType: BATCH
name: projects/bigdata-datascience/regions/europe-west2/operations/a746a63b-61ed-3cca-816b-9f2a4ccae2f8
- Remove Dataproc serverless batch
gcloud dataproc batches delete pysequila --region=${TF_VAR_region}
- Destroy infrastructure
terraform destroy -var-file=../../env/gcp.tfvars -var-file=../../env/gcp-dataproc.tfvars -var-file=../../env/_all.tfvars
terraform apply -var-file=../../env/gcp.tfvars -var-file=../../env/gcp-gke.tfvars -var-file=../../env/_all.tfvars
- Connect to the K8S cluster, e.g.:
## Fetch configuration
gcloud container clusters get-credentials ${TF_VAR_project_name}-cluster --zone ${TF_VAR_zone} --project ${TF_VAR_project_name}
# check connectivity
kubectl get nodes
NAME STATUS ROLES AGE VERSION
gke-tbd-tbd-devel-cl-tbd-tbd-devel-la-cb515767-8wqh Ready <none> 25m v1.21.5-gke.1302
gke-tbd-tbd-devel-cl-tbd-tbd-devel-la-cb515767-dlr1 Ready <none> 25m v1.21.5-gke.1302
gke-tbd-tbd-devel-cl-tbd-tbd-devel-la-cb515767-r5l3 Ready <none> 25m v1.21.5-gke.1302
- Use sparkctl (recommended - available in sequila-cli image) or use
kubectl
to deploy a SeQuiLa job:
sparkctl create ../../jobs/gcp/gke/pysequila.yaml
After a while you will be able to check the logs:
sparkctl log -f pysequila
💡 Or you can use k9s tool (available in the image) to check Spark Driver std output:
sparkctl delete pysequila
terraform destroy -var-file=../../env/gcp.tfvars -var-file=../../env/gcp-gke.tfvars -var-file=../../env/_all.tfvars
- Activate pre-commit integration
pre-commit install
- Install pre-commit hooks deps