Skip to content

Commit

Permalink
Fixing ToC (#16)
Browse files Browse the repository at this point in the history
  • Loading branch information
mwiewior authored Nov 5, 2022
1 parent e0a8007 commit 16e458a
Showing 1 changed file with 20 additions and 28 deletions.
48 changes: 20 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,29 +18,23 @@ Table of Contents
* [Azure](#azure)
* [AWS](#aws)
* [Init backends](#init)
* [AWS](#aws)
* [Login](#login)
* [EKS](#eks)
* [Azure](#azure-1)
* [Login](#login)
* [AKS](#aks)
* [GCP](#gcp-1)
* [Login](#login-2)
* [General GCP setup](#general-gcp-setup)
* [Dataproc](#dataproc)
* [Deploy](#deploy)
* [Run](#run)
* [Cleanup](#cleanup)
* [Dataproc Serverless](#dataproc-serverless)
* [Deploy](#deploy)
* [Run](#run)
* [Cleanup](#cleanup)
* [GKE](#gke)
* [Deploy](#deploy-1)
* [Run](#run-1)
* [Cleanup](#cleanup-1)
* [Development and contribution](#development-and-contribution)
* [Setup pre-commit checks](#setup-pre-commit-checks)
* [Terraform doc](#terraform-doc)


# Disclaimer
These are NOT production-ready examples. Terraform modules and Docker images are scanned/linted with tools such
as [checkov](https://www.checkov.io/), [tflint](https://github.com/terraform-linters/tflint) and [tfsec](https://github.com/aquasecurity/tfsec)
Expand All @@ -50,17 +44,17 @@ as well. Check code comments for details.
# Demo scenario
1. The presented scenario can be deployed on one of the main cloud providers: Azure(Microsoft), AWS(Amazon) and GCP(Google).
2. For each cloud two options are presented - deployment on managed Hadoop ecosystem (Azure - HDInsight, AWS - EMR, GCP - Dataproc) or
or using managed Kubernetes service (Azure - AKS, AWS - EKS and GCP - GKE).
or using managed Kubernetes service (Azure - AKS, AWS - EKS and GCP - GKE).
3. Scenario includes the following steps:
1. setup distributed object storage
2. copy test data
3. setup computing environment
4. run a test PySeQuiLa job using PySpark using YARN or [spark-on-k8s-operator](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator)
5. We assume that:
* on GCP: a project is created and attached to billing account
* on Azure: a subscription is created (A Google Cloud project is conceptually similar to the Azure subscription, in terms of billing, quotas, and limits).
1. setup distributed object storage
2. copy test data
3. setup computing environment
4. run a test PySeQuiLa job using PySpark using YARN or [spark-on-k8s-operator](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator)
5. We assume that:
* on GCP: a project is created and attached to billing account
* on Azure: a subscription is created (A Google Cloud project is conceptually similar to the Azure subscription, in terms of billing, quotas, and limits).
# Set SeQuiLa and PySeQuiLa versions

## Support matrix


Expand All @@ -72,9 +66,8 @@ or using managed Kubernetes service (Azure - AKS, AWS - EKS and GCP - GKE).
| Azure | AKS |1.23.12|3.2.2|1.1.0|0.4.1| docker.io/biodatageeks/spark-py:pysequila-0.4.1-aks-latest|
| AWS | EKS|1.23.9 | 3.2.2 | 1.1.0 | 0.4.1 | docker.io/biodatageeks/spark-py:pysequila-0.4.1-aks-latest|
| AWS | EMR Serverless|xxx | 3.2.2 | 1.1.0 | 0.4.1 | |
| AWS | EMR |emr-6.6.0 | 3.2.2 | 1.1.0 | 0.4.1 | |

Based on the above table set software versions and Docker images accordingly, e.g.:
Based on the above table set software versions and Docker images accordingly, e.g.:
```bash
export TF_VAR_pysequila_version=0.4.1
export TF_VAR_sequila_version=1.1.0
Expand Down Expand Up @@ -137,7 +130,7 @@ terraform init
# Modules statuses
## GCP

* [Dataproc](#Dataproc) :white_check_mark:
* [Dataproc](#Dataproc) :white_check_mark:
* [Dataproc serverless](#dataproc-serverless) :white_check_mark:
* [GKE (Google Kubernetes Engine)](#GKE) :white_check_mark:

Expand All @@ -149,7 +142,6 @@ terraform init
* [EKS(Elastic Kubernetes Service)](#EKS): :white_check_mark:

# AWS

## Login
There are [a few](https://registry.terraform.io/providers/hashicorp/aws/latest/docs#authentication-and-configuration)
authentication method available. Pick up the one is the most convenient for you - e.g. set `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`
Expand Down Expand Up @@ -228,7 +220,7 @@ aks-default-37875945-vmss000002 Ready agent 59m v1.20.9
aks-default-37875945-vmss000003 Ready agent 59m v1.20.9
```
2. Use `sparkctl` or `kubectl`: \
and
and
```bash
sparkctl create ../../jobs/azure/aks/pysequila.yaml
```
Expand Down Expand Up @@ -257,7 +249,7 @@ gcloud config set project $TF_VAR_project_name

## General GCP setup
1. Set GCP project-related env variables, e.g.:
:bulb: If you use our image all the env variables are already set.
:bulb: If you use our image all the env variables are already set.

```bash
export TF_VAR_project_name=tbd-tbd-devel
Expand Down Expand Up @@ -308,8 +300,8 @@ terraform destroy -var-file=../../env/gcp.tfvars -var-file=../../env/gcp-datapro
terraform apply -var-file=../../env/gcp.tfvars -var-file=../../env/gcp-dataproc.tfvars -var-file=../../env/_all.tfvars
```
2. Since accoring to the [documentation](https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers) Dataproc Serverless
services cannot fetch containers from other registries than GCP ones (in particular from `docker.io`). This is why you need to pull
a required image from `docker.io` and push it to your project GCR(Google Container Registry), e.g.:
services cannot fetch containers from other registries than GCP ones (in particular from `docker.io`). This is why you need to pull
a required image from `docker.io` and push it to your project GCR(Google Container Registry), e.g.:
```bash
gcloud auth configure-docker
docker tag biodatageeks/spark-py:pysequila-0.4.1-dataproc-b3c836e $TF_VAR_pysequila_image_dataproc
Expand Down Expand Up @@ -391,7 +383,7 @@ gke-tbd-tbd-devel-cl-tbd-tbd-devel-la-cb515767-dlr1 Ready <none> 25m v1
gke-tbd-tbd-devel-cl-tbd-tbd-devel-la-cb515767-r5l3 Ready <none> 25m v1.21.5-gke.1302
```
2. Install [sparkctl](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/tree/master/sparkctl) (recommended) or use `kubectl`: \
:bulb: If you use our image all the tools are already installed.
:bulb: If you use our image all the tools are already installed.

```bash
sparkctl create ../../jobs/gcp/gke/pysequila.yaml
Expand Down

0 comments on commit 16e458a

Please sign in to comment.