Skip to content

Commit

Permalink
Aws emr (#17)
Browse files Browse the repository at this point in the history
* Typo fix

* EMR Serverless

* Storage refactoring

* EMR deps
  • Loading branch information
mwiewior authored Nov 6, 2022
1 parent 16e458a commit b714092
Show file tree
Hide file tree
Showing 28 changed files with 409 additions and 108 deletions.
9 changes: 0 additions & 9 deletions .github/workflows/default.yml
Original file line number Diff line number Diff line change
Expand Up @@ -93,12 +93,3 @@ jobs:
download_external_modules: true # optional: download external terraform modules from public git repositories and terraform registry
log_level: DEBUG # optional: set log level. Default WARNING
container_user: 1000 # optional: Define what UID and / or what GID to run the container under to prevent permission issues
list-images: # Job that list subdirectories
runs-on: self-hosted
outputs:
dir: ${{ steps.set-dirs.outputs.dir }} # generate output name dir by using inner step output
steps:
- uses: actions/checkout@v2
- id: set-dirs # Give it an id to handle to get step outputs in the outputs key above
run: echo "::set-output name=dir::['sequila-cloud-cli','spark-py/gke']"
# Define step output named dir base on ls command transformed to JSON thanks to jq
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,4 @@ override.tf.json
.idea
venv
docker/spark-py/**/*.jar
modules/aws/emr-serverless/resources/dependencies
56 changes: 53 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ Table of Contents
* [AWS](#aws)
* [Login](#login)
* [EKS](#eks)
* [EMR Serverless](#emr-serverless)
* [Azure](#azure-1)
* [Login](#login)
* [AKS](#aks)
Expand Down Expand Up @@ -64,8 +65,8 @@ as well. Check code comments for details.
| GCP | Dataproc |2.0.27-ubuntu18| 3.1.3 | 1.0.0 | 0.3.3 | -|
| GCP | Dataproc Serverless|1.0.21| 3.2.2 | 1.1.0 | 0.4.1 | gcr.io/${TF_VAR_project_name}/spark-py:pysequila-0.3.4-dataproc-latest |
| Azure | AKS |1.23.12|3.2.2|1.1.0|0.4.1| docker.io/biodatageeks/spark-py:pysequila-0.4.1-aks-latest|
| AWS | EKS|1.23.9 | 3.2.2 | 1.1.0 | 0.4.1 | docker.io/biodatageeks/spark-py:pysequila-0.4.1-aks-latest|
| AWS | EMR Serverless|xxx | 3.2.2 | 1.1.0 | 0.4.1 | |
| AWS | EKS|1.23.9 | 3.2.2 | 1.1.0 | 0.4.1 | docker.io/biodatageeks/spark-py:pysequila-0.4.1-eks-latest|
| AWS | EMR Serverless|emr-6.7.0 | 3.2.1 | 1.1.0 | 0.4.1 |- |

Based on the above table set software versions and Docker images accordingly, e.g.:
```bash
Expand All @@ -83,6 +84,7 @@ export TF_VAR_project_name=tbd-tbd-devel
export TF_VAR_region=europe-west2
export TF_VAR_zone=europe-west2-b
docker run --rm -it \
-v /var/run/docker.sock:/var/run/docker.sock \
-e TF_VAR_project_name=${TF_VAR_project_name} \
-e TF_VAR_region=${TF_VAR_region} \
-e TF_VAR_zone=${TF_VAR_zone} \
Expand Down Expand Up @@ -138,7 +140,7 @@ terraform init
* [AKS (Azure Kubernetes Service)](#AKS): :white_check_mark:

## AWS
* EMR Serverless: :soon:
* [EMR Serverless](#emr-serverless): :white_check_mark:
* [EKS(Elastic Kubernetes Service)](#EKS): :white_check_mark:

# AWS
Expand Down Expand Up @@ -189,6 +191,54 @@ sparkctl delete pysequila
terraform destroy -var-file=../../env/aws.tfvars -var-file=../../env/_all.tfvars -var-file=../../env/aws-eks.tfvars
```

## EMR Serverless
### Deploy
Unlike GCP Dataproc Serverless that support providing custom docker images for Spark driver and executors, AWS EMR Serverless
requires preparing both: a tarball of a Python virtual environment (using `venv-pack` or `conda-pack`) and copying extra jar files
to a s3 bucket. Both steps are automated by [emr-serverless](modules/aws/emr-serverless/README.md) module.
More info can be found [here](https://github.com/aws-samples/emr-serverless-samples/blob/main/examples/pyspark/dependencies/README.md)
Starting from EMR release `6.7.0` it is possible to specify extra jars using `--packages` option but requires an additional VPN NAT setup.

```bash
terraform apply -var-file=../../env/aws.tfvars -var-file=../../env/_all.tfvars -var-file=../../env/aws-emr.tfvars
```

### Run
As an output of the above command you will find a rendered command that you can use to launch a sample job (including environment variables export):
```bash
Apply complete! Resources: 178 added, 0 changed, 0 destroyed.

Outputs:

emr_server_exec_role_arn = "arn:aws:iam::927478350239:role/sequila-role"
emr_serverless_command = <<EOT
export APPLICATION_ID=00f5c6prgt01190p
export JOB_ROLE_ARN=arn:aws:iam::927478350239:role/sequila-role
aws emr-serverless start-job-run \
--application-id $APPLICATION_ID \
--execution-role-arn $JOB_ROLE_ARN \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://sequilabhp8knyc/jobs/pysequila/sequila-pileup.py",
"entryPointArguments": ["pyspark_pysequila-0.4.1.tar.gz"],
"sparkSubmitParameters": "--conf spark.driver.cores=1 --conf spark.driver.memory=2g --conf spark.executor.cores=1 --conf spark.executor.memory=4g --conf spark.executor.instances=1 --archives=s3://sequilabhp8knyc/venv/pysequila/pyspark_pysequila-0.4.1.tar.gz#environment --jars s3://sequilabhp8knyc/jars/sequila/1.1.0/* --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.files=s3://sequilabhp8knyc/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta,s3://sequilabhp8knyc/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta.fai"
}
}'
EOT

```
![](doc/images/emr-serverless-job-1.png)
![](doc/images/emr-serverless-job-2.png)

### Cleanup
```bash
terraform destroy -var-file=../../env/aws.tfvars -var-file=../../env/_all.tfvars -var-file=../../env/aws-emr.tfvars

```

# Azure
## Login
Install [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli) and set default subscription
Expand Down
9 changes: 7 additions & 2 deletions cloud/aws/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,15 @@
|------|--------|---------|
| <a name="module_aws-job-code"></a> [aws-job-code](#module\_aws-job-code) | ../../modules/aws/jobs-code | n/a |
| <a name="module_eks"></a> [eks](#module\_eks) | terraform-aws-modules/eks/aws | v18.30.2 |
| <a name="module_emr-job"></a> [emr-job](#module\_emr-job) | ../../modules/aws/emr-serverless | n/a |
| <a name="module_spark-on-k8s-operator-eks"></a> [spark-on-k8s-operator-eks](#module\_spark-on-k8s-operator-eks) | ../../modules/kubernetes/spark-on-k8s-operator | n/a |
| <a name="module_storage"></a> [storage](#module\_storage) | ../../modules/aws/storage | n/a |
| <a name="module_vpc"></a> [vpc](#module\_vpc) | terraform-aws-modules/vpc/aws | v3.18.1 |

## Resources

| Name | Type |
|------|------|
| [aws_ecr_repository.ecr](https://registry.terraform.io/providers/hashicorp/aws/4.38.0/docs/resources/ecr_repository) | resource |
| [aws_eks_cluster.eks](https://registry.terraform.io/providers/hashicorp/aws/4.38.0/docs/data-sources/eks_cluster) | data source |
| [aws_eks_cluster_auth.eks](https://registry.terraform.io/providers/hashicorp/aws/4.38.0/docs/data-sources/eks_cluster_auth) | data source |

Expand All @@ -37,6 +38,7 @@
|------|-------------|------|---------|:--------:|
| <a name="input_aws-eks-deploy"></a> [aws-eks-deploy](#input\_aws-eks-deploy) | Deploy EKS service | `bool` | `false` | no |
| <a name="input_aws-emr-deploy"></a> [aws-emr-deploy](#input\_aws-emr-deploy) | Deploy EMR service | `bool` | `false` | no |
| <a name="input_aws-emr-release"></a> [aws-emr-release](#input\_aws-emr-release) | EMR Serverless release (needs to be >=6.6.0) | `string` | n/a | yes |
| <a name="input_data_files"></a> [data\_files](#input\_data\_files) | Data files to copy to staging bucket | `list(string)` | n/a | yes |
| <a name="input_eks_machine_type"></a> [eks\_machine\_type](#input\_eks\_machine\_type) | Machine size | `string` | `"t3.xlarge"` | no |
| <a name="input_eks_max_node_count"></a> [eks\_max\_node\_count](#input\_eks\_max\_node\_count) | Maximum number of kubernetes nodes | `number` | `2` | no |
Expand All @@ -48,5 +50,8 @@

## Outputs

No outputs.
| Name | Description |
|------|-------------|
| <a name="output_emr_server_exec_role_arn"></a> [emr\_server\_exec\_role\_arn](#output\_emr\_server\_exec\_role\_arn) | ARN of EMR Serverless execution role |
| <a name="output_emr_serverless_command"></a> [emr\_serverless\_command](#output\_emr\_serverless\_command) | EMR Serverless command to run a sample SeQuiLa job |
<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
49 changes: 28 additions & 21 deletions cloud/aws/main.tf
Original file line number Diff line number Diff line change
@@ -1,23 +1,22 @@
module "storage" {
source = "../../modules/aws/storage"
}


module "aws-job-code" {
bucket = module.storage.bucket
source = "../../modules/aws/jobs-code"
data_files = var.data_files
pysequila_version = var.pysequila_version
sequila_version = var.sequila_version
pysequila_image_eks = var.pysequila_image_eks
}

resource "aws_ecr_repository" "ecr" {
count = (var.aws-emr-deploy || var.aws-eks-deploy) ? 1 : 0
name = "ecr"
image_tag_mutability = "MUTABLE"
image_scanning_configuration {
scan_on_push = false
}
}



module "vpc" {
count = var.aws-eks-deploy ? 1 : 0
count = (var.aws-eks-deploy || var.aws-emr-deploy) ? 1 : 0
source = "terraform-aws-modules/vpc/aws"
version = "v3.18.1"

Expand All @@ -37,6 +36,19 @@ module "vpc" {
}
}

module "emr-job" {
source = "../../modules/aws/emr-serverless"
aws-emr-release = var.aws-emr-release
bucket = module.storage.bucket
pysequila_version = var.pysequila_version
sequila_version = var.sequila_version
data_files = [for f in var.data_files : "s3://${module.storage.bucket}/data/${f}" if length(regexall("fasta", f)) > 0]
subnet_ids = module.vpc[0].private_subnets
vpc_id = module.vpc[0].vpc_id
security_group_ids = [module.vpc[0].default_security_group_id]
}


module "eks" {
count = var.aws-eks-deploy ? 1 : 0
depends_on = [module.vpc]
Expand All @@ -62,29 +74,24 @@ module "eks" {
}

data "aws_eks_cluster_auth" "eks" {
name = module.eks[0].cluster_id
count = var.aws-eks-deploy ? 1 : 0
name = module.eks[0].cluster_id
}

data "aws_eks_cluster" "eks" {
name = module.eks[0].cluster_id
count = var.aws-eks-deploy ? 1 : 0
name = module.eks[0].cluster_id
}

provider "helm" {
alias = "eks"
kubernetes {
host = try(data.aws_eks_cluster.eks.endpoint, "")
token = data.aws_eks_cluster_auth.eks.token
cluster_ca_certificate = try(base64decode(data.aws_eks_cluster.eks.certificate_authority[0].data), "")
host = try(data.aws_eks_cluster.eks[0].endpoint, "")
token = try(data.aws_eks_cluster_auth.eks[0].token, "")
cluster_ca_certificate = try(base64decode(data.aws_eks_cluster.eks[0].certificate_authority[0].data), "")
}
}

#provider "kubernetes" {
# alias = "gke"
# host = try("https://${module.gke[0].endpoint}", "")
# token = data.google_client_config.default.access_token
# cluster_ca_certificate = try(module.gke[0].cluster_ca_certificate, "")
#}

module "spark-on-k8s-operator-eks" {
depends_on = [module.eks]
source = "../../modules/kubernetes/spark-on-k8s-operator"
Expand Down
9 changes: 9 additions & 0 deletions cloud/aws/output.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
output "emr_server_exec_role_arn" {
value = try(module.emr-job.emr_server_exec_role_arn, "")
description = "ARN of EMR Serverless execution role"
}

output "emr_serverless_command" {
value = try(module.emr-job.emr_serverless_command, "")
description = "EMR Serverless command to run a sample SeQuiLa job"
}
5 changes: 5 additions & 0 deletions cloud/aws/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,11 @@ variable "aws-emr-deploy" {
description = "Deploy EMR service"
}

variable "aws-emr-release" {
type = string
description = "EMR Serverless release (needs to be >=6.6.0)"
}

variable "aws-eks-deploy" {
type = bool
default = false
Expand Down
Binary file added doc/images/emr-serverless-job-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/images/emr-serverless-job-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docker/sequila-cloud-cli/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ ENV TZ=Europe/Warsaw
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone && \
apt -y update && \
apt -y upgrade && \
apt install -y curl gnupg lsb-release software-properties-common apt-transport-https ca-certificates git pwgen unzip zip
apt install -y curl gnupg lsb-release software-properties-common apt-transport-https ca-certificates git pwgen unzip zip docker.io

RUN curl -fsSL https://apt.releases.hashicorp.com/gpg | apt-key add - && \
apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main" && \
Expand Down
3 changes: 2 additions & 1 deletion env/aws-emr.tfvars
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
aws-emr-deploy = true
aws-emr-deploy = true
aws-emr-release = "emr-6.7.0"
50 changes: 50 additions & 0 deletions modules/aws/emr-serverless/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# emr

<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
## Requirements

No requirements.

## Providers

| Name | Version |
|------|---------|
| <a name="provider_aws"></a> [aws](#provider\_aws) | n/a |
| <a name="provider_external"></a> [external](#provider\_external) | n/a |
| <a name="provider_null"></a> [null](#provider\_null) | n/a |

## Modules

No modules.

## Resources

| Name | Type |
|------|------|
| [aws_emrserverless_application.emr-serverless](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/emrserverless_application) | resource |
| [aws_iam_role.EMRServerlessS3RuntimeRole](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role) | resource |
| [aws_s3_object.pysequila-venv-pack](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_object) | resource |
| [aws_s3_object.sequila-deps](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_object) | resource |
| [null_resource.pysequila-venv-pack](https://registry.terraform.io/providers/hashicorp/null/latest/docs/resources/resource) | resource |
| [external_external.dependencies-extract](https://registry.terraform.io/providers/hashicorp/external/latest/docs/data-sources/external) | data source |

## Inputs

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_aws-emr-release"></a> [aws-emr-release](#input\_aws-emr-release) | EMR Serverless release (needs to be >=6.6.0) | `string` | n/a | yes |
| <a name="input_bucket"></a> [bucket](#input\_bucket) | Bucket name for code, dependencies, etc. | `string` | n/a | yes |
| <a name="input_data_files"></a> [data\_files](#input\_data\_files) | Data files to copy to staging bucket | `list(string)` | n/a | yes |
| <a name="input_pysequila_version"></a> [pysequila\_version](#input\_pysequila\_version) | PySeQuiLa version | `string` | n/a | yes |
| <a name="input_security_group_ids"></a> [security\_group\_ids](#input\_security\_group\_ids) | Default security group ids | `list(string)` | n/a | yes |
| <a name="input_sequila_version"></a> [sequila\_version](#input\_sequila\_version) | SeQuiLa version | `string` | n/a | yes |
| <a name="input_subnet_ids"></a> [subnet\_ids](#input\_subnet\_ids) | List of subnets | `list(string)` | n/a | yes |
| <a name="input_vpc_id"></a> [vpc\_id](#input\_vpc\_id) | VPC | `string` | n/a | yes |

## Outputs

| Name | Description |
|------|-------------|
| <a name="output_emr_server_exec_role_arn"></a> [emr\_server\_exec\_role\_arn](#output\_emr\_server\_exec\_role\_arn) | ARN of EMR Serverless execution role |
| <a name="output_emr_serverless_command"></a> [emr\_serverless\_command](#output\_emr\_serverless\_command) | EMR Serverless command to run a sample SeQuiLa job |
<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
72 changes: 72 additions & 0 deletions modules/aws/emr-serverless/module.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
locals {
resources_dir = "${path.module}/resources"
jars_dir = "${local.resources_dir}/dependencies"
}

resource "aws_emrserverless_application" "emr-serverless" {
name = "sequila"
release_label = var.aws-emr-release
type = "spark"
auto_stop_configuration {
enabled = true
idle_timeout_minutes = 5
}
network_configuration {
subnet_ids = var.subnet_ids
security_group_ids = var.security_group_ids
}

}

resource "aws_iam_role" "EMRServerlessS3RuntimeRole" {
name = "sequila-role"

tags = {
Name = "sequila-role"
}

assume_role_policy = jsonencode(
{
Version = "2012-10-17",
Statement = [
{
Action = "sts:AssumeRole"
Principal = {
Service = "emr-serverless.amazonaws.com"
}
Effect = "Allow"
}
]
}
)
}



data "external" "dependencies-extract" {
program = ["${local.resources_dir}/jar_extractor.sh", var.pysequila_version, local.resources_dir, local.jars_dir]


}

resource "aws_s3_object" "sequila-deps" {
for_each = toset(split(",", data.external.dependencies-extract.result.jars))
key = "jars/sequila/${var.sequila_version}/${each.key}"
source = "${local.jars_dir}/${each.key}"
bucket = var.bucket
acl = "public-read"
}

resource "null_resource" "pysequila-venv-pack" {
provisioner "local-exec" {
command = "${local.resources_dir}/venv_packer.sh ${var.pysequila_version} ${local.resources_dir}"
}
}

resource "aws_s3_object" "pysequila-venv-pack" {
depends_on = [null_resource.pysequila-venv-pack]
key = "venv/pysequila/pyspark_pysequila-${var.pysequila_version}.tar.gz"
source = "${local.resources_dir}/venv/pyspark_pysequila-${var.pysequila_version}.tar.gz"
bucket = var.bucket
acl = "public-read"
}
Loading

0 comments on commit b714092

Please sign in to comment.