diff --git a/.github/workflows/default.yml b/.github/workflows/default.yml index 3c56f1c..9f4da5a 100644 --- a/.github/workflows/default.yml +++ b/.github/workflows/default.yml @@ -93,12 +93,3 @@ jobs: download_external_modules: true # optional: download external terraform modules from public git repositories and terraform registry log_level: DEBUG # optional: set log level. Default WARNING container_user: 1000 # optional: Define what UID and / or what GID to run the container under to prevent permission issues - list-images: # Job that list subdirectories - runs-on: self-hosted - outputs: - dir: ${{ steps.set-dirs.outputs.dir }} # generate output name dir by using inner step output - steps: - - uses: actions/checkout@v2 - - id: set-dirs # Give it an id to handle to get step outputs in the outputs key above - run: echo "::set-output name=dir::['sequila-cloud-cli','spark-py/gke']" - # Define step output named dir base on ls command transformed to JSON thanks to jq diff --git a/.gitignore b/.gitignore index 06a9867..307f9a2 100644 --- a/.gitignore +++ b/.gitignore @@ -32,3 +32,4 @@ override.tf.json .idea venv docker/spark-py/**/*.jar +modules/aws/emr-serverless/resources/dependencies \ No newline at end of file diff --git a/README.md b/README.md index 1932e59..55ce291 100644 --- a/README.md +++ b/README.md @@ -21,6 +21,7 @@ Table of Contents * [AWS](#aws) * [Login](#login) * [EKS](#eks) + * [EMR Serverless](#emr-serverless) * [Azure](#azure-1) * [Login](#login) * [AKS](#aks) @@ -64,8 +65,8 @@ as well. Check code comments for details. | GCP | Dataproc |2.0.27-ubuntu18| 3.1.3 | 1.0.0 | 0.3.3 | -| | GCP | Dataproc Serverless|1.0.21| 3.2.2 | 1.1.0 | 0.4.1 | gcr.io/${TF_VAR_project_name}/spark-py:pysequila-0.3.4-dataproc-latest | | Azure | AKS |1.23.12|3.2.2|1.1.0|0.4.1| docker.io/biodatageeks/spark-py:pysequila-0.4.1-aks-latest| -| AWS | EKS|1.23.9 | 3.2.2 | 1.1.0 | 0.4.1 | docker.io/biodatageeks/spark-py:pysequila-0.4.1-aks-latest| -| AWS | EMR Serverless|xxx | 3.2.2 | 1.1.0 | 0.4.1 | | +| AWS | EKS|1.23.9 | 3.2.2 | 1.1.0 | 0.4.1 | docker.io/biodatageeks/spark-py:pysequila-0.4.1-eks-latest| +| AWS | EMR Serverless|emr-6.7.0 | 3.2.1 | 1.1.0 | 0.4.1 |- | Based on the above table set software versions and Docker images accordingly, e.g.: ```bash @@ -83,6 +84,7 @@ export TF_VAR_project_name=tbd-tbd-devel export TF_VAR_region=europe-west2 export TF_VAR_zone=europe-west2-b docker run --rm -it \ + -v /var/run/docker.sock:/var/run/docker.sock \ -e TF_VAR_project_name=${TF_VAR_project_name} \ -e TF_VAR_region=${TF_VAR_region} \ -e TF_VAR_zone=${TF_VAR_zone} \ @@ -138,7 +140,7 @@ terraform init * [AKS (Azure Kubernetes Service)](#AKS): :white_check_mark: ## AWS -* EMR Serverless: :soon: +* [EMR Serverless](#emr-serverless): :white_check_mark: * [EKS(Elastic Kubernetes Service)](#EKS): :white_check_mark: # AWS @@ -189,6 +191,54 @@ sparkctl delete pysequila terraform destroy -var-file=../../env/aws.tfvars -var-file=../../env/_all.tfvars -var-file=../../env/aws-eks.tfvars ``` +## EMR Serverless +### Deploy +Unlike GCP Dataproc Serverless that support providing custom docker images for Spark driver and executors, AWS EMR Serverless +requires preparing both: a tarball of a Python virtual environment (using `venv-pack` or `conda-pack`) and copying extra jar files +to a s3 bucket. Both steps are automated by [emr-serverless](modules/aws/emr-serverless/README.md) module. +More info can be found [here](https://github.com/aws-samples/emr-serverless-samples/blob/main/examples/pyspark/dependencies/README.md) +Starting from EMR release `6.7.0` it is possible to specify extra jars using `--packages` option but requires an additional VPN NAT setup. + +```bash +terraform apply -var-file=../../env/aws.tfvars -var-file=../../env/_all.tfvars -var-file=../../env/aws-emr.tfvars +``` + +### Run +As an output of the above command you will find a rendered command that you can use to launch a sample job (including environment variables export): +```bash +Apply complete! Resources: 178 added, 0 changed, 0 destroyed. + +Outputs: + +emr_server_exec_role_arn = "arn:aws:iam::927478350239:role/sequila-role" +emr_serverless_command = < [aws-job-code](#module\_aws-job-code) | ../../modules/aws/jobs-code | n/a | | [eks](#module\_eks) | terraform-aws-modules/eks/aws | v18.30.2 | +| [emr-job](#module\_emr-job) | ../../modules/aws/emr-serverless | n/a | | [spark-on-k8s-operator-eks](#module\_spark-on-k8s-operator-eks) | ../../modules/kubernetes/spark-on-k8s-operator | n/a | +| [storage](#module\_storage) | ../../modules/aws/storage | n/a | | [vpc](#module\_vpc) | terraform-aws-modules/vpc/aws | v3.18.1 | ## Resources | Name | Type | |------|------| -| [aws_ecr_repository.ecr](https://registry.terraform.io/providers/hashicorp/aws/4.38.0/docs/resources/ecr_repository) | resource | | [aws_eks_cluster.eks](https://registry.terraform.io/providers/hashicorp/aws/4.38.0/docs/data-sources/eks_cluster) | data source | | [aws_eks_cluster_auth.eks](https://registry.terraform.io/providers/hashicorp/aws/4.38.0/docs/data-sources/eks_cluster_auth) | data source | @@ -37,6 +38,7 @@ |------|-------------|------|---------|:--------:| | [aws-eks-deploy](#input\_aws-eks-deploy) | Deploy EKS service | `bool` | `false` | no | | [aws-emr-deploy](#input\_aws-emr-deploy) | Deploy EMR service | `bool` | `false` | no | +| [aws-emr-release](#input\_aws-emr-release) | EMR Serverless release (needs to be >=6.6.0) | `string` | n/a | yes | | [data\_files](#input\_data\_files) | Data files to copy to staging bucket | `list(string)` | n/a | yes | | [eks\_machine\_type](#input\_eks\_machine\_type) | Machine size | `string` | `"t3.xlarge"` | no | | [eks\_max\_node\_count](#input\_eks\_max\_node\_count) | Maximum number of kubernetes nodes | `number` | `2` | no | @@ -48,5 +50,8 @@ ## Outputs -No outputs. +| Name | Description | +|------|-------------| +| [emr\_server\_exec\_role\_arn](#output\_emr\_server\_exec\_role\_arn) | ARN of EMR Serverless execution role | +| [emr\_serverless\_command](#output\_emr\_serverless\_command) | EMR Serverless command to run a sample SeQuiLa job | diff --git a/cloud/aws/main.tf b/cloud/aws/main.tf index 6347e65..67de18c 100644 --- a/cloud/aws/main.tf +++ b/cloud/aws/main.tf @@ -1,4 +1,10 @@ +module "storage" { + source = "../../modules/aws/storage" +} + + module "aws-job-code" { + bucket = module.storage.bucket source = "../../modules/aws/jobs-code" data_files = var.data_files pysequila_version = var.pysequila_version @@ -6,18 +12,11 @@ module "aws-job-code" { pysequila_image_eks = var.pysequila_image_eks } -resource "aws_ecr_repository" "ecr" { - count = (var.aws-emr-deploy || var.aws-eks-deploy) ? 1 : 0 - name = "ecr" - image_tag_mutability = "MUTABLE" - image_scanning_configuration { - scan_on_push = false - } -} + module "vpc" { - count = var.aws-eks-deploy ? 1 : 0 + count = (var.aws-eks-deploy || var.aws-emr-deploy) ? 1 : 0 source = "terraform-aws-modules/vpc/aws" version = "v3.18.1" @@ -37,6 +36,19 @@ module "vpc" { } } +module "emr-job" { + source = "../../modules/aws/emr-serverless" + aws-emr-release = var.aws-emr-release + bucket = module.storage.bucket + pysequila_version = var.pysequila_version + sequila_version = var.sequila_version + data_files = [for f in var.data_files : "s3://${module.storage.bucket}/data/${f}" if length(regexall("fasta", f)) > 0] + subnet_ids = module.vpc[0].private_subnets + vpc_id = module.vpc[0].vpc_id + security_group_ids = [module.vpc[0].default_security_group_id] +} + + module "eks" { count = var.aws-eks-deploy ? 1 : 0 depends_on = [module.vpc] @@ -62,29 +74,24 @@ module "eks" { } data "aws_eks_cluster_auth" "eks" { - name = module.eks[0].cluster_id + count = var.aws-eks-deploy ? 1 : 0 + name = module.eks[0].cluster_id } data "aws_eks_cluster" "eks" { - name = module.eks[0].cluster_id + count = var.aws-eks-deploy ? 1 : 0 + name = module.eks[0].cluster_id } provider "helm" { alias = "eks" kubernetes { - host = try(data.aws_eks_cluster.eks.endpoint, "") - token = data.aws_eks_cluster_auth.eks.token - cluster_ca_certificate = try(base64decode(data.aws_eks_cluster.eks.certificate_authority[0].data), "") + host = try(data.aws_eks_cluster.eks[0].endpoint, "") + token = try(data.aws_eks_cluster_auth.eks[0].token, "") + cluster_ca_certificate = try(base64decode(data.aws_eks_cluster.eks[0].certificate_authority[0].data), "") } } -#provider "kubernetes" { -# alias = "gke" -# host = try("https://${module.gke[0].endpoint}", "") -# token = data.google_client_config.default.access_token -# cluster_ca_certificate = try(module.gke[0].cluster_ca_certificate, "") -#} - module "spark-on-k8s-operator-eks" { depends_on = [module.eks] source = "../../modules/kubernetes/spark-on-k8s-operator" diff --git a/cloud/aws/output.tf b/cloud/aws/output.tf new file mode 100644 index 0000000..194538b --- /dev/null +++ b/cloud/aws/output.tf @@ -0,0 +1,9 @@ +output "emr_server_exec_role_arn" { + value = try(module.emr-job.emr_server_exec_role_arn, "") + description = "ARN of EMR Serverless execution role" +} + +output "emr_serverless_command" { + value = try(module.emr-job.emr_serverless_command, "") + description = "EMR Serverless command to run a sample SeQuiLa job" +} \ No newline at end of file diff --git a/cloud/aws/variables.tf b/cloud/aws/variables.tf index a705fc2..41056ee 100644 --- a/cloud/aws/variables.tf +++ b/cloud/aws/variables.tf @@ -27,6 +27,11 @@ variable "aws-emr-deploy" { description = "Deploy EMR service" } +variable "aws-emr-release" { + type = string + description = "EMR Serverless release (needs to be >=6.6.0)" +} + variable "aws-eks-deploy" { type = bool default = false diff --git a/doc/images/emr-serverless-job-1.png b/doc/images/emr-serverless-job-1.png new file mode 100644 index 0000000..04181be Binary files /dev/null and b/doc/images/emr-serverless-job-1.png differ diff --git a/doc/images/emr-serverless-job-2.png b/doc/images/emr-serverless-job-2.png new file mode 100644 index 0000000..fb02abc Binary files /dev/null and b/doc/images/emr-serverless-job-2.png differ diff --git a/docker/sequila-cloud-cli/Dockerfile b/docker/sequila-cloud-cli/Dockerfile index 15841cb..f4ac52b 100644 --- a/docker/sequila-cloud-cli/Dockerfile +++ b/docker/sequila-cloud-cli/Dockerfile @@ -18,7 +18,7 @@ ENV TZ=Europe/Warsaw RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone && \ apt -y update && \ apt -y upgrade && \ - apt install -y curl gnupg lsb-release software-properties-common apt-transport-https ca-certificates git pwgen unzip zip + apt install -y curl gnupg lsb-release software-properties-common apt-transport-https ca-certificates git pwgen unzip zip docker.io RUN curl -fsSL https://apt.releases.hashicorp.com/gpg | apt-key add - && \ apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main" && \ diff --git a/env/aws-emr.tfvars b/env/aws-emr.tfvars index 622faea..3f6b936 100644 --- a/env/aws-emr.tfvars +++ b/env/aws-emr.tfvars @@ -1 +1,2 @@ -aws-emr-deploy = true \ No newline at end of file +aws-emr-deploy = true +aws-emr-release = "emr-6.7.0" \ No newline at end of file diff --git a/modules/aws/emr-serverless/README.md b/modules/aws/emr-serverless/README.md new file mode 100644 index 0000000..f475c15 --- /dev/null +++ b/modules/aws/emr-serverless/README.md @@ -0,0 +1,50 @@ +# emr + + +## Requirements + +No requirements. + +## Providers + +| Name | Version | +|------|---------| +| [aws](#provider\_aws) | n/a | +| [external](#provider\_external) | n/a | +| [null](#provider\_null) | n/a | + +## Modules + +No modules. + +## Resources + +| Name | Type | +|------|------| +| [aws_emrserverless_application.emr-serverless](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/emrserverless_application) | resource | +| [aws_iam_role.EMRServerlessS3RuntimeRole](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role) | resource | +| [aws_s3_object.pysequila-venv-pack](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_object) | resource | +| [aws_s3_object.sequila-deps](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_object) | resource | +| [null_resource.pysequila-venv-pack](https://registry.terraform.io/providers/hashicorp/null/latest/docs/resources/resource) | resource | +| [external_external.dependencies-extract](https://registry.terraform.io/providers/hashicorp/external/latest/docs/data-sources/external) | data source | + +## Inputs + +| Name | Description | Type | Default | Required | +|------|-------------|------|---------|:--------:| +| [aws-emr-release](#input\_aws-emr-release) | EMR Serverless release (needs to be >=6.6.0) | `string` | n/a | yes | +| [bucket](#input\_bucket) | Bucket name for code, dependencies, etc. | `string` | n/a | yes | +| [data\_files](#input\_data\_files) | Data files to copy to staging bucket | `list(string)` | n/a | yes | +| [pysequila\_version](#input\_pysequila\_version) | PySeQuiLa version | `string` | n/a | yes | +| [security\_group\_ids](#input\_security\_group\_ids) | Default security group ids | `list(string)` | n/a | yes | +| [sequila\_version](#input\_sequila\_version) | SeQuiLa version | `string` | n/a | yes | +| [subnet\_ids](#input\_subnet\_ids) | List of subnets | `list(string)` | n/a | yes | +| [vpc\_id](#input\_vpc\_id) | VPC | `string` | n/a | yes | + +## Outputs + +| Name | Description | +|------|-------------| +| [emr\_server\_exec\_role\_arn](#output\_emr\_server\_exec\_role\_arn) | ARN of EMR Serverless execution role | +| [emr\_serverless\_command](#output\_emr\_serverless\_command) | EMR Serverless command to run a sample SeQuiLa job | + diff --git a/modules/aws/emr-serverless/module.tf b/modules/aws/emr-serverless/module.tf new file mode 100644 index 0000000..979ec01 --- /dev/null +++ b/modules/aws/emr-serverless/module.tf @@ -0,0 +1,72 @@ +locals { + resources_dir = "${path.module}/resources" + jars_dir = "${local.resources_dir}/dependencies" +} + +resource "aws_emrserverless_application" "emr-serverless" { + name = "sequila" + release_label = var.aws-emr-release + type = "spark" + auto_stop_configuration { + enabled = true + idle_timeout_minutes = 5 + } + network_configuration { + subnet_ids = var.subnet_ids + security_group_ids = var.security_group_ids + } + +} + +resource "aws_iam_role" "EMRServerlessS3RuntimeRole" { + name = "sequila-role" + + tags = { + Name = "sequila-role" + } + + assume_role_policy = jsonencode( + { + Version = "2012-10-17", + Statement = [ + { + Action = "sts:AssumeRole" + Principal = { + Service = "emr-serverless.amazonaws.com" + } + Effect = "Allow" + } + ] + } + ) +} + + + +data "external" "dependencies-extract" { + program = ["${local.resources_dir}/jar_extractor.sh", var.pysequila_version, local.resources_dir, local.jars_dir] + + +} + +resource "aws_s3_object" "sequila-deps" { + for_each = toset(split(",", data.external.dependencies-extract.result.jars)) + key = "jars/sequila/${var.sequila_version}/${each.key}" + source = "${local.jars_dir}/${each.key}" + bucket = var.bucket + acl = "public-read" +} + +resource "null_resource" "pysequila-venv-pack" { + provisioner "local-exec" { + command = "${local.resources_dir}/venv_packer.sh ${var.pysequila_version} ${local.resources_dir}" + } +} + +resource "aws_s3_object" "pysequila-venv-pack" { + depends_on = [null_resource.pysequila-venv-pack] + key = "venv/pysequila/pyspark_pysequila-${var.pysequila_version}.tar.gz" + source = "${local.resources_dir}/venv/pyspark_pysequila-${var.pysequila_version}.tar.gz" + bucket = var.bucket + acl = "public-read" +} diff --git a/modules/aws/emr-serverless/output.tf b/modules/aws/emr-serverless/output.tf new file mode 100644 index 0000000..5a54b27 --- /dev/null +++ b/modules/aws/emr-serverless/output.tf @@ -0,0 +1,23 @@ +output "emr_server_exec_role_arn" { + description = "ARN of EMR Serverless execution role" + value = aws_iam_role.EMRServerlessS3RuntimeRole.arn +} + +output "emr_serverless_command" { + description = "EMR Serverless command to run a sample SeQuiLa job" + value = <<-EOT + export APPLICATION_ID=${aws_emrserverless_application.emr-serverless.id} + export JOB_ROLE_ARN=${aws_iam_role.EMRServerlessS3RuntimeRole.arn} + + aws emr-serverless start-job-run \ + --application-id $APPLICATION_ID \ + --execution-role-arn $JOB_ROLE_ARN \ + --job-driver '{ + "sparkSubmit": { + "entryPoint": "s3://${var.bucket}/jobs/pysequila/sequila-pileup.py", + "entryPointArguments": ["pyspark_pysequila-${var.pysequila_version}.tar.gz"], + "sparkSubmitParameters": "--conf spark.dynamicAllocation.enabled=false --conf spark.driver.cores=1 --conf spark.driver.memory=2g --conf spark.executor.cores=1 --conf spark.executor.memory=4g --conf spark.executor.instances=1 --archives=s3://${var.bucket}/venv/pysequila/pyspark_pysequila-${var.pysequila_version}.tar.gz#environment --jars s3://${var.bucket}/jars/sequila/${var.sequila_version}/* --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.files=${join(",", var.data_files)}" + } + }' + EOT +} \ No newline at end of file diff --git a/modules/aws/emr-serverless/resources/Dockerfile.jars b/modules/aws/emr-serverless/resources/Dockerfile.jars new file mode 100644 index 0000000..f0a6691 --- /dev/null +++ b/modules/aws/emr-serverless/resources/Dockerfile.jars @@ -0,0 +1,6 @@ +ARG BASE_IMAGE +FROM $BASE_IMAGE as base + +FROM scratch AS export +ARG IVY_DIR_JARS=/opt/spark/.ivy2/jars +COPY --from=base $IVY_DIR_JARS/*.jar / \ No newline at end of file diff --git a/modules/aws/emr-serverless/resources/Dockerfile.venv b/modules/aws/emr-serverless/resources/Dockerfile.venv new file mode 100644 index 0000000..b5fca6e --- /dev/null +++ b/modules/aws/emr-serverless/resources/Dockerfile.venv @@ -0,0 +1,18 @@ +FROM --platform=linux/amd64 amazonlinux:2 AS base +ARG PYSEQUILA_VERSION +RUN yum install -y python3 + +ENV VIRTUAL_ENV=/opt/venv +RUN python3 -m venv $VIRTUAL_ENV +ENV PATH="$VIRTUAL_ENV/bin:$PATH" + +RUN python3 -m pip install --upgrade pip && \ + python3 -m pip install \ + pysequila==$PYSEQUILA_VERSION \ + venv-pack==0.2.0 + +RUN mkdir /output && venv-pack -o /output/pyspark_pysequila-${PYSEQUILA_VERSION}.tar.gz + +FROM scratch AS export +ARG PYSEQUILA_VERSION +COPY --from=base /output/pyspark_pysequila-${PYSEQUILA_VERSION}.tar.gz / \ No newline at end of file diff --git a/modules/aws/emr-serverless/resources/jar_extractor.sh b/modules/aws/emr-serverless/resources/jar_extractor.sh new file mode 100755 index 0000000..b442ae0 --- /dev/null +++ b/modules/aws/emr-serverless/resources/jar_extractor.sh @@ -0,0 +1,11 @@ +#!/usr/bin/env bash + +pysequila_version=$1 +resources_dir=$2 +jars_dir=$3 + +DOCKER_BUILDKIT=1 docker build --build-arg BASE_IMAGE=biodatageeks/spark-py:pysequila-${pysequila_version}-base-latest \ + -f ${resources_dir}/Dockerfile.jars \ + --output ${jars_dir} . &>/dev/null +jars=$(ls -1 $jars_dir/) +jq -n --arg inarr "${jars}" '{ jars: $inarr | split("\n") | join(",") }' \ No newline at end of file diff --git a/modules/aws/emr-serverless/resources/venv_packer.sh b/modules/aws/emr-serverless/resources/venv_packer.sh new file mode 100755 index 0000000..184a96c --- /dev/null +++ b/modules/aws/emr-serverless/resources/venv_packer.sh @@ -0,0 +1,9 @@ +#!/usr/bin/env bash + +pysequila_version=$1 +resources_dir=$2 + +DOCKER_BUILDKIT=1 docker build \ + -f ${resources_dir}/Dockerfile.venv \ + --build-arg PYSEQUILA_VERSION=$pysequila_version \ + --output ${resources_dir}/venv . \ No newline at end of file diff --git a/modules/aws/emr-serverless/variables.tf b/modules/aws/emr-serverless/variables.tf new file mode 100644 index 0000000..c81844d --- /dev/null +++ b/modules/aws/emr-serverless/variables.tf @@ -0,0 +1,39 @@ +variable "bucket" { + type = string + description = "Bucket name for code, dependencies, etc." +} + +variable "pysequila_version" { + type = string + description = "PySeQuiLa version" +} + +variable "sequila_version" { + type = string + description = "SeQuiLa version" +} + +variable "data_files" { + type = list(string) + description = "Data files to copy to staging bucket" +} + +variable "aws-emr-release" { + type = string + description = "EMR Serverless release (needs to be >=6.6.0)" +} + +variable "subnet_ids" { + type = list(string) + description = "List of subnets" +} + +variable "vpc_id" { + type = string + description = "VPC" +} + +variable "security_group_ids" { + type = list(string) + description = "Default security group ids" +} \ No newline at end of file diff --git a/modules/aws/emr/README.md b/modules/aws/emr/README.md deleted file mode 100644 index afb017a..0000000 --- a/modules/aws/emr/README.md +++ /dev/null @@ -1,27 +0,0 @@ -# emr - - -## Requirements - -No requirements. - -## Providers - -No providers. - -## Modules - -No modules. - -## Resources - -No resources. - -## Inputs - -No inputs. - -## Outputs - -No outputs. - diff --git a/modules/aws/emr/module.tf b/modules/aws/emr/module.tf deleted file mode 100644 index e69de29..0000000 diff --git a/modules/aws/emr/variables.tf b/modules/aws/emr/variables.tf deleted file mode 100644 index e69de29..0000000 diff --git a/modules/aws/jobs-code/README.md b/modules/aws/jobs-code/README.md index 216c1b5..28c6929 100644 --- a/modules/aws/jobs-code/README.md +++ b/modules/aws/jobs-code/README.md @@ -11,7 +11,6 @@ No requirements. |------|---------| | [aws](#provider\_aws) | n/a | | [local](#provider\_local) | n/a | -| [random](#provider\_random) | n/a | ## Modules @@ -21,18 +20,15 @@ No modules. | Name | Type | |------|------| -| [aws_s3_bucket.bucket](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_bucket) | resource | | [aws_s3_object.sequila-data](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_object) | resource | | [aws_s3_object.sequila-pileup](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_object) | resource | | [local_file.deployment_file](https://registry.terraform.io/providers/hashicorp/local/latest/docs/resources/file) | resource | -| [local_file.py_file](https://registry.terraform.io/providers/hashicorp/local/latest/docs/resources/file) | resource | -| [random_string.storage_id](https://registry.terraform.io/providers/hashicorp/random/latest/docs/resources/string) | resource | -| [aws_caller_identity.current](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/caller_identity) | data source | ## Inputs | Name | Description | Type | Default | Required | |------|-------------|------|---------|:--------:| +| [bucket](#input\_bucket) | Bucket name for code, dependencies, etc. | `string` | n/a | yes | | [data\_files](#input\_data\_files) | Data files to copy to staging bucket | `list(string)` | n/a | yes | | [pysequila\_image\_eks](#input\_pysequila\_image\_eks) | EKS PySeQuiLa image | `string` | n/a | yes | | [pysequila\_version](#input\_pysequila\_version) | PySeQuiLa version | `string` | n/a | yes | diff --git a/modules/aws/jobs-code/module.tf b/modules/aws/jobs-code/module.tf index 398a0a2..a3b4e65 100644 --- a/modules/aws/jobs-code/module.tf +++ b/modules/aws/jobs-code/module.tf @@ -1,28 +1,6 @@ -data "aws_caller_identity" "current" {} - -resource "random_string" "storage_id" { - keepers = { - sub_id = data.aws_caller_identity.current.account_id - } - length = 8 - special = false - lower = true -} - -#tfsec:ignore:aws-s3-enable-bucket-encryption -#tfsec:ignore:aws-s3-enable-bucket-logging -#tfsec:ignore:aws-s3-enable-versioning -#tfsec:ignore:aws-s3-no-public-access-with-acl -#tfsec:ignore:aws-s3-specify-public-access-block -resource "aws_s3_bucket" "bucket" { - bucket = "sequila${lower(random_string.storage_id.id)}" - acl = "public-read" - -} - resource "aws_s3_object" "sequila-data" { for_each = toset(var.data_files) - bucket = aws_s3_bucket.bucket.bucket + bucket = var.bucket key = "data/${each.value}" source = "../../data/${each.value}" acl = "public-read" @@ -37,10 +15,10 @@ locals { sequila = SequilaSession.builder \ .appName("SeQuiLa") \ .getOrCreate() - + sequila.sparkContext.setLogLevel("INFO") sequila.sql("SET spark.biodatageeks.readAligment.method=disq") sequila\ - .pileup(f"s3a://${aws_s3_bucket.bucket.bucket}/data/NA12878.multichrom.md.bam", + .pileup(f"s3a://${var.bucket}/data/NA12878.multichrom.md.bam", f"Homo_sapiens_assembly18_chr1_chrM.small.fasta", False) \ .show(5) EOT @@ -76,12 +54,12 @@ locals { mode: cluster image: "${var.pysequila_image_eks}" imagePullPolicy: Always - mainApplicationFile: s3a://${aws_s3_bucket.bucket.bucket}/jobs/pysequila/sequila-pileup.py + mainApplicationFile: s3a://${var.bucket}/jobs/pysequila/sequila-pileup.py sparkVersion: "3.2.2" deps: files: - - s3a://${aws_s3_bucket.bucket.bucket}/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta - - s3a://${aws_s3_bucket.bucket.bucket}/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta.fai + - s3a://${var.bucket}/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta + - s3a://${var.bucket}/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta.fai filesDownloadDir: "/opt/spark/work-dir" hadoopConf: fs.s3a.aws.credentials.provider: org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider @@ -118,21 +96,14 @@ locals { } -resource "local_file" "py_file" { - content = local.py_file - filename = "../../jobs/aws/sequila-pileup.py" -} - resource "local_file" "deployment_file" { content = local.spark_k8s_deployment filename = "../../jobs/aws/eks/pysequila.yaml" } resource "aws_s3_object" "sequila-pileup" { - - key = "jobs/pysequila/sequila-pileup.py" - source = "../../jobs/aws/sequila-pileup.py" - bucket = aws_s3_bucket.bucket.bucket - acl = "public-read" - etag = filemd5("../../jobs/aws/sequila-pileup.py") + key = "jobs/pysequila/sequila-pileup.py" + content = local.py_file + bucket = var.bucket + acl = "public-read" } \ No newline at end of file diff --git a/modules/aws/jobs-code/variables.tf b/modules/aws/jobs-code/variables.tf index ee71ddc..1dca725 100644 --- a/modules/aws/jobs-code/variables.tf +++ b/modules/aws/jobs-code/variables.tf @@ -1,3 +1,7 @@ +variable "bucket" { + type = string + description = "Bucket name for code, dependencies, etc." +} variable "data_files" { type = list(string) description = "Data files to copy to staging bucket" diff --git a/modules/aws/storage/README.md b/modules/aws/storage/README.md new file mode 100644 index 0000000..aaf8a52 --- /dev/null +++ b/modules/aws/storage/README.md @@ -0,0 +1,36 @@ +# storage + + +## Requirements + +No requirements. + +## Providers + +| Name | Version | +|------|---------| +| [aws](#provider\_aws) | n/a | +| [random](#provider\_random) | n/a | + +## Modules + +No modules. + +## Resources + +| Name | Type | +|------|------| +| [aws_s3_bucket.bucket](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_bucket) | resource | +| [random_string.storage_id](https://registry.terraform.io/providers/hashicorp/random/latest/docs/resources/string) | resource | +| [aws_caller_identity.current](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/caller_identity) | data source | + +## Inputs + +No inputs. + +## Outputs + +| Name | Description | +|------|-------------| +| [bucket](#output\_bucket) | n/a | + diff --git a/modules/aws/storage/module.tf b/modules/aws/storage/module.tf new file mode 100644 index 0000000..9d9bc6a --- /dev/null +++ b/modules/aws/storage/module.tf @@ -0,0 +1,21 @@ +data "aws_caller_identity" "current" {} + +resource "random_string" "storage_id" { + keepers = { + sub_id = data.aws_caller_identity.current.account_id + } + length = 8 + special = false + lower = true +} + +#tfsec:ignore:aws-s3-enable-bucket-encryption +#tfsec:ignore:aws-s3-enable-bucket-logging +#tfsec:ignore:aws-s3-enable-versioning +#tfsec:ignore:aws-s3-no-public-access-with-acl +#tfsec:ignore:aws-s3-specify-public-access-block +resource "aws_s3_bucket" "bucket" { + bucket = "sequila${lower(random_string.storage_id.id)}" + acl = "public-read" + +} \ No newline at end of file diff --git a/modules/aws/storage/output.tf b/modules/aws/storage/output.tf new file mode 100644 index 0000000..9dca8c0 --- /dev/null +++ b/modules/aws/storage/output.tf @@ -0,0 +1,3 @@ +output "bucket" { + value = aws_s3_bucket.bucket.bucket +} \ No newline at end of file