Skip to content

Commit

Permalink
Merge pull request #9 from wednesday-solutions/feat/glue-bash
Browse files Browse the repository at this point in the history
Feat/glue-bash: Added automation script for glue job deployemnt
  • Loading branch information
vighnesh-wednesday authored Jan 4, 2024
2 parents 2aaa4ab + 328d619 commit 70bf2d3
Show file tree
Hide file tree
Showing 12 changed files with 202 additions and 34 deletions.
15 changes: 9 additions & 6 deletions .github/workflows/cd.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ jobs:
S3_BUCKET_NAME: ${{ secrets.S3_BUCKET_NAME }}
S3_SCRIPTS_PATH: ${{ secrets.S3_SCRIPTS_PATH }}
AWS_REGION: ${{ secrets.AWS_REGION }}
AWS_GLUE_ROLE: ${{ secrets.AWS_GLUE_ROLE}}
steps:
- uses: actions/checkout@v2

Expand All @@ -22,12 +23,12 @@ jobs:
with:
python-version: 3.9

- run: |
- name: Build App Wheel
run: |
pip install setuptools wheel
python3 setup.py bdist_wheel
# Step 1: Copy script to S3 bucket
- name: Copy script to S3 bucket
- name: Setup AWS cli & upload App Wheel to S3
uses: jakejarvis/[email protected]
with:
args: --follow-symlinks
Expand All @@ -36,7 +37,9 @@ jobs:
DEST_DIR: $S3_SCRIPTS_PATH
AWS_S3_BUCKET: $S3_BUCKET_NAME

- name: Upload Scripts to S3
run: aws s3 cp jobs "s3://$S3_BUCKET_NAME/$S3_SCRIPTS_PATH/" --recursive --region ap-south-1

- name: Upload Script file to S3
run: aws s3 cp ./main.py "s3://$S3_BUCKET_NAME/$S3_SCRIPTS_PATH/" --region ap-south-1

- name: Deploy Jobs on Glue
run: |
automation/deploy_glue_job.sh $S3_BUCKET_NAME $AWS_GLUE_ROLE $KAGGLE_TOKEN $KAGGLE_USERNAME
7 changes: 3 additions & 4 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ on:
jobs:
run-ci:
runs-on: ubuntu-latest
container: dipanshuwed/glue4.0:latest
container: vighneshwed/glue4:latest

steps:
- name: Checkout repository
Expand All @@ -19,15 +19,14 @@ jobs:
run: |
python3 -m pip install --upgrade pip
pip3 install -r requirements.txt
yum install -y jq
- name: Type check
run: mypy ./ --ignore-missing-imports

- name: Lint
run: |
pylint app tests main.py setup.py
pylint app tests main.py setup.py --output pylint-report.txt
pylint app tests jobs setup.py
pylint app tests jobs setup.py --output pylint-report.txt
- name: Testing
run: |
Expand Down
22 changes: 11 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ To run the same ETL code in multiple cloud services based on your preference, th

- Azure Databricks can't be configured locally, you can only connect your local IDE to running cluster in databricks. Push your code in Github repo then make a workflow in databricks with URL of the repo & file.
- For AWS Glue I'm setting up a local environment using the Docker image, then deploying it to AWS glue using github actions.
- The "tasks.txt" file contents the details of transformations done in the main file3.
- The "tasks.txt" file contents the details of transformations done in the main file.

## Requirements for Azure Databricks (for local connect only)
- [Unity Catalog](https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/enable-workspaces) enabled workspace.
Expand All @@ -26,22 +26,22 @@ To run the same ETL code in multiple cloud services based on your preference, th

2. Give your s3, adlas & kaggle (optional) paths in the ```app/.custom-env``` file.

3. Just run a Glue 4 docker conatiner & write your transformations in ```main.py``` file. Install dependancies using ```pip install -r requirements.txt```
3. Just run a Glue 4 docker conatiner & write your transformations in ```jobs``` folder. Refer ```demo.py``` file. Install dependancies using ```pip install -r requirements.txt```

4. Run your scirpts in the docker container locally using ```spark-sumbit main.py```
4. Run your scirpts in the docker container locally using ```spark-sumbit jobs/main.py```

## Deployemnt

1. In your AWS Glue job pass these parameters with thier correct values:
1. In your your GitHub Actions Secrets, setup the following keys with their values:
```
JOB_NAME
KAGGLE_USERNAME
KAGGLE_KEY
GLUE_READ_PATH
GLUE_WRITE_PATH
KAGGLE_PATH (keep blank if not extracting)
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
S3_BUCKET_NAME
S3_SCRIPTS_PATH
AWS_REGION
AWS_GLUE_ROLE
```
Rest everything is taken care of in ```cd.yml``` file.
Rest all the key-value pairs in the ```app/.custom-env``` file are passed using aws cli using ```cd.yml``` file, so no need to pass them manually in the job.
2. For Azure Databricks, make a workflow with the link of your repo & main file. Pass the following parameters with their correct values:
Expand Down
9 changes: 0 additions & 9 deletions app/.custom-env

This file was deleted.

9 changes: 9 additions & 0 deletions app/.custom_env
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# this is my custom file for read & write path based on environment

GLUE_READ_PATH="s3://glue-bucket-vighnesh/rawdata/"
GLUE_WRITE_PATH="s3://glue-bucket-vighnesh/transformed/"

DATABRICKS_READ_PATH="/mnt/rawdata/"
DATABRICKS_WRITE_PATH="/mnt/transformed/"

KAGGLE_PATH="mastmustu/insurance-claims-fraud-data"
32 changes: 32 additions & 0 deletions automation/create_glue_job.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"Name": "samplename",
"Description": "",
"LogUri": "",
"Role": "samplerole",
"ExecutionProperty": {
"MaxConcurrentRuns": 1
},
"Command": {
"Name": "glueetl",
"ScriptLocation": "sample-location",
"PythonVersion": "3"
},
"DefaultArguments": {
"--enable-glue-datacatalog": "true",
"--job-bookmark-option": "job-bookmark-disable",
"--TempDir": "sample-bucket/Logs/temp/",
"--enable-metrics": "true",
"--extra-py-files": "sample-bucket/scripts/sample-wheel",
"--spark-event-logs-path": "sample-bucket/Logs/UILogs/",
"--enable-job-insights": "false",
"--additional-python-modules": "python-dotenv,kaggle",
"--enable-observability-metrics": "true",
"--enable-continuous-cloudwatch-log": "true",
"--job-language": "python"
},
"MaxRetries": 0,
"Timeout": 10,
"WorkerType": "G.1X",
"NumberOfWorkers": 2,
"GlueVersion": "4.0"
}
73 changes: 73 additions & 0 deletions automation/deploy_glue_job.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
#!/bin/bash
s3_bucket="$1"
role="$2"
kaggle_key="$3"
kaggle_username="$4"

source ./app/.custom_env

job_names=$(aws glue get-jobs | jq -r '.Jobs | map(.Name)[]')

for file in jobs/*.py; do
filename=$(basename "$file" .py)

if [ "$filename" != "__init__" ]; then

if [[ $job_names != *"$filename"* ]]; then

jq --arg NAME "$filename" \
--arg SCRIPT_LOCATION "s3://$s3_bucket/scripts/$filename.py" \
--arg ROLE "$role" \
--arg TEMP_DIR "s3://$s3_bucket/Logs/temp/" \
--arg EVENT_LOG "s3://$s3_bucket/Logs/UILogs/" \
--arg WHEEL "s3://$s3_bucket/scripts/app-0.9-py3-none-any.whl" \
--arg KAGGLE_KEY "$kaggle_key" \
--arg KAGGLE_USERNAME "$kaggle_username" \
--arg GLUE_READ_PATH "$GLUE_READ_PATH" \
--arg GLUE_WRITE_PATH "$GLUE_WRITE_PATH" \
--arg KAGGLE_PATH "$KAGGLE_PATH" \
'.Name=$NAME |
.Command.ScriptLocation=$SCRIPT_LOCATION |
.Role=$ROLE |
.DefaultArguments["--TempDir"]=$TEMP_DIR |
.DefaultArguments["--spark-event-logs-path"]=$EVENT_LOG |
.DefaultArguments["--extra-py-files"]=$WHEEL |
.DefaultArguments["--KAGGLE_KEY"]=$KAGGLE_KEY |
.DefaultArguments["--KAGGLE_USERNAME"]=$KAGGLE_USERNAME |
.DefaultArguments["--GLUE_READ_PATH"] = $GLUE_READ_PATH |
.DefaultArguments["--GLUE_WRITE_PATH"] = $GLUE_WRITE_PATH |
.DefaultArguments["--KAGGLE_PATH"] = $KAGGLE_PATH' \
automation/create_glue_job.json > "automation/output_$filename.json"

aws glue create-job --cli-input-json file://"automation/output_$filename.json"

else

jq --arg NAME "$filename" \
--arg SCRIPT_LOCATION "s3://$s3_bucket/scripts/$filename.py" \
--arg ROLE "$role" \
--arg TEMP_DIR "s3://$s3_bucket/Logs/temp/" \
--arg EVENT_LOG "s3://$s3_bucket/Logs/UILogs/" \
--arg WHEEL "s3://$s3_bucket/scripts/app-0.9-py3-none-any.whl" \
--arg KAGGLE_KEY "$kaggle_key" \
--arg KAGGLE_USERNAME "$kaggle_username" \
--arg GLUE_READ_PATH "$GLUE_READ_PATH" \
--arg GLUE_WRITE_PATH "$GLUE_WRITE_PATH" \
--arg KAGGLE_PATH "$KAGGLE_PATH" \
'.JobName=$NAME |
.JobUpdate.Command.ScriptLocation=$SCRIPT_LOCATION |
.JobUpdate.Role=$ROLE |
.JobUpdate.DefaultArguments["--TempDir"]=$TEMP_DIR |
.JobUpdate.DefaultArguments["--spark-event-logs-path"]=$EVENT_LOG |
.JobUpdate.DefaultArguments["--extra-py-files"]=$WHEEL |
.JobUpdate.DefaultArguments["--KAGGLE_KEY"]=$KAGGLE_KEY |
.JobUpdate.DefaultArguments["--KAGGLE_USERNAME"]=$KAGGLE_USERNAME |
.JobUpdate.DefaultArguments["--GLUE_READ_PATH"] = $GLUE_READ_PATH |
.JobUpdate.DefaultArguments["--GLUE_WRITE_PATH"] = $GLUE_WRITE_PATH |
.JobUpdate.DefaultArguments["--KAGGLE_PATH"] = $KAGGLE_PATH' \
automation/update_glue_job.json > "automation/output_$filename.json"

aws glue update-job --cli-input-json file://"automation/output_$filename.json"
fi
fi
done
31 changes: 31 additions & 0 deletions automation/update_glue_job.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
{
"JobName": "sample-name",
"JobUpdate": {
"Description": "",
"Role": "sample-role",
"ExecutionProperty": {
"MaxConcurrentRuns": 1
},
"Command": {
"Name": "glueetl",
"ScriptLocation": "sample-location",
"PythonVersion": "3"
},
"DefaultArguments": {
"--enable-glue-datacatalog": "true",
"--job-bookmark-option": "job-bookmark-enable",
"--TempDir": "s3://sample-bucket/scripts/temp/",
"--enable-metrics": "true",
"--enable-spark-ui": "true",
"--spark-event-logs-path": "s3://sample-bucket/Logs/UILogs/",
"--enable-job-insights": "true",
"--enable-continuous-cloudwatch-log": "true",
"--job-language": "python"
},
"MaxRetries": 0,
"Timeout": 10,
"WorkerType": "G.1X",
"NumberOfWorkers": 2,
"GlueVersion": "4.0"
}
}
Empty file added jobs/__init__.py
Empty file.
30 changes: 30 additions & 0 deletions jobs/demo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# This is demo file for writing your transformations
from dotenv import load_dotenv
import app.environment as env

load_dotenv("app/.custom-env")

# COMMAND ----------

if "dbutils" in locals():
databricks = True
else:
spark = None
dbutils = None
databricks = False

# COMMAND ----------
# This is the example specific for "mastmustu/insurance-claims-fraud-data" data, different frames will be returned based on your data
# fmt: off

# Keep this flag True if you want to extract data from kaggle, else False
kaggle_extraction = True

[employee, insurance, vendor] = env.get_data(databricks, kaggle_extraction, dbutils, spark) #pylint: disable=unbalanced-tuple-unpacking

write_path = env.get_write_path(databricks)

# fmt: on
# COMMAND ----------

# Write all your transformations below:
2 changes: 1 addition & 1 deletion main.py → jobs/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
import app.environment as env
import app.spark_wrapper as sw

load_dotenv("app/.custom-env")
load_dotenv("app/.custom_env")

# COMMAND ----------

Expand Down
6 changes: 3 additions & 3 deletions sonar-project.properties
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ sonar.python.pylint.reportPath=pylint-report.txt
sonar.python.coverage.reportPaths=*coverage.xml
sonar.python.pylint_config=.pylintrc
sonar.python.pylint=/usr/local/bin/pylint
sonar.inclusions=**/app/**,**/main.py
sonar.inclusions=**/app/**,**/jobs/**
sonar.exclusions=**/tests/test_*.py,**/__init__.py,**/tests/mock/*.*
sonar.test.exclusions=**/tests/test_*.py,**/__init__.py,**/tests/mock/*.*,**/main.py
sonar.coverage.exclusions=**/tests/test_*.py,**/__init__.py,**/tests/mock/*.*,**/main.py
sonar.test.exclusions=**/tests/test_*.py,**/__init__.py,**/tests/mock/*.*,**/jobs/**
sonar.coverage.exclusions=**/tests/test_*.py,**/__init__.py,**/tests/mock/*.*,**/jobs/**
sonar.text.excluded.file.suffixes=csv
sonar.python.version=3.7

0 comments on commit 70bf2d3

Please sign in to comment.