diff --git a/README.md b/README.md index 57d2b0a..1932e59 100644 --- a/README.md +++ b/README.md @@ -18,6 +18,9 @@ Table of Contents * [Azure](#azure) * [AWS](#aws) * [Init backends](#init) +* [AWS](#aws) + * [Login](#login) + * [EKS](#eks) * [Azure](#azure-1) * [Login](#login) * [AKS](#aks) @@ -25,22 +28,13 @@ Table of Contents * [Login](#login-2) * [General GCP setup](#general-gcp-setup) * [Dataproc](#dataproc) - * [Deploy](#deploy) - * [Run](#run) - * [Cleanup](#cleanup) * [Dataproc Serverless](#dataproc-serverless) - * [Deploy](#deploy) - * [Run](#run) - * [Cleanup](#cleanup) * [GKE](#gke) - * [Deploy](#deploy-1) - * [Run](#run-1) - * [Cleanup](#cleanup-1) * [Development and contribution](#development-and-contribution) * [Setup pre-commit checks](#setup-pre-commit-checks) * [Terraform doc](#terraform-doc) - + # Disclaimer These are NOT production-ready examples. Terraform modules and Docker images are scanned/linted with tools such as [checkov](https://www.checkov.io/), [tflint](https://github.com/terraform-linters/tflint) and [tfsec](https://github.com/aquasecurity/tfsec) @@ -50,17 +44,17 @@ as well. Check code comments for details. # Demo scenario 1. The presented scenario can be deployed on one of the main cloud providers: Azure(Microsoft), AWS(Amazon) and GCP(Google). 2. For each cloud two options are presented - deployment on managed Hadoop ecosystem (Azure - HDInsight, AWS - EMR, GCP - Dataproc) or -or using managed Kubernetes service (Azure - AKS, AWS - EKS and GCP - GKE). + or using managed Kubernetes service (Azure - AKS, AWS - EKS and GCP - GKE). 3. Scenario includes the following steps: - 1. setup distributed object storage - 2. copy test data - 3. setup computing environment - 4. run a test PySeQuiLa job using PySpark using YARN or [spark-on-k8s-operator](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) - 5. We assume that: - * on GCP: a project is created and attached to billing account - * on Azure: a subscription is created (A Google Cloud project is conceptually similar to the Azure subscription, in terms of billing, quotas, and limits). + 1. setup distributed object storage + 2. copy test data + 3. setup computing environment + 4. run a test PySeQuiLa job using PySpark using YARN or [spark-on-k8s-operator](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) + 5. We assume that: + * on GCP: a project is created and attached to billing account + * on Azure: a subscription is created (A Google Cloud project is conceptually similar to the Azure subscription, in terms of billing, quotas, and limits). # Set SeQuiLa and PySeQuiLa versions - + ## Support matrix @@ -72,9 +66,8 @@ or using managed Kubernetes service (Azure - AKS, AWS - EKS and GCP - GKE). | Azure | AKS |1.23.12|3.2.2|1.1.0|0.4.1| docker.io/biodatageeks/spark-py:pysequila-0.4.1-aks-latest| | AWS | EKS|1.23.9 | 3.2.2 | 1.1.0 | 0.4.1 | docker.io/biodatageeks/spark-py:pysequila-0.4.1-aks-latest| | AWS | EMR Serverless|xxx | 3.2.2 | 1.1.0 | 0.4.1 | | -| AWS | EMR |emr-6.6.0 | 3.2.2 | 1.1.0 | 0.4.1 | | -Based on the above table set software versions and Docker images accordingly, e.g.: +Based on the above table set software versions and Docker images accordingly, e.g.: ```bash export TF_VAR_pysequila_version=0.4.1 export TF_VAR_sequila_version=1.1.0 @@ -137,7 +130,7 @@ terraform init # Modules statuses ## GCP -* [Dataproc](#Dataproc) :white_check_mark: +* [Dataproc](#Dataproc) :white_check_mark: * [Dataproc serverless](#dataproc-serverless) :white_check_mark: * [GKE (Google Kubernetes Engine)](#GKE) :white_check_mark: @@ -149,7 +142,6 @@ terraform init * [EKS(Elastic Kubernetes Service)](#EKS): :white_check_mark: # AWS - ## Login There are [a few](https://registry.terraform.io/providers/hashicorp/aws/latest/docs#authentication-and-configuration) authentication method available. Pick up the one is the most convenient for you - e.g. set `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY` @@ -228,7 +220,7 @@ aks-default-37875945-vmss000002 Ready agent 59m v1.20.9 aks-default-37875945-vmss000003 Ready agent 59m v1.20.9 ``` 2. Use `sparkctl` or `kubectl`: \ -and + and ```bash sparkctl create ../../jobs/azure/aks/pysequila.yaml ``` @@ -257,7 +249,7 @@ gcloud config set project $TF_VAR_project_name ## General GCP setup 1. Set GCP project-related env variables, e.g.: -:bulb: If you use our image all the env variables are already set. + :bulb: If you use our image all the env variables are already set. ```bash export TF_VAR_project_name=tbd-tbd-devel @@ -308,8 +300,8 @@ terraform destroy -var-file=../../env/gcp.tfvars -var-file=../../env/gcp-datapro terraform apply -var-file=../../env/gcp.tfvars -var-file=../../env/gcp-dataproc.tfvars -var-file=../../env/_all.tfvars ``` 2. Since accoring to the [documentation](https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers) Dataproc Serverless -services cannot fetch containers from other registries than GCP ones (in particular from `docker.io`). This is why you need to pull -a required image from `docker.io` and push it to your project GCR(Google Container Registry), e.g.: + services cannot fetch containers from other registries than GCP ones (in particular from `docker.io`). This is why you need to pull + a required image from `docker.io` and push it to your project GCR(Google Container Registry), e.g.: ```bash gcloud auth configure-docker docker tag biodatageeks/spark-py:pysequila-0.4.1-dataproc-b3c836e $TF_VAR_pysequila_image_dataproc @@ -391,7 +383,7 @@ gke-tbd-tbd-devel-cl-tbd-tbd-devel-la-cb515767-dlr1 Ready 25m v1 gke-tbd-tbd-devel-cl-tbd-tbd-devel-la-cb515767-r5l3 Ready 25m v1.21.5-gke.1302 ``` 2. Install [sparkctl](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/tree/master/sparkctl) (recommended) or use `kubectl`: \ -:bulb: If you use our image all the tools are already installed. + :bulb: If you use our image all the tools are already installed. ```bash sparkctl create ../../jobs/gcp/gke/pysequila.yaml