Merge pull request #9 from wednesday-solutions/feat/glue-bash

Feat/glue-bash: Added automation script for glue job deployemnt
wednesday-solutions · Jan 4, 2024 · 70bf2d3 · 70bf2d3
2 parents 2aaa4ab + 328d619
commit 70bf2d3
Show file tree

Hide file tree

Showing 12 changed files with 202 additions and 34 deletions.
diff --git a/.github/workflows/cd.yml b/.github/workflows/cd.yml
@@ -14,6 +14,7 @@ jobs:
           S3_BUCKET_NAME: ${{ secrets.S3_BUCKET_NAME }}
           S3_SCRIPTS_PATH: ${{ secrets.S3_SCRIPTS_PATH }}
           AWS_REGION: ${{ secrets.AWS_REGION }}
+          AWS_GLUE_ROLE: ${{ secrets.AWS_GLUE_ROLE}}
         steps:
             - uses: actions/checkout@v2
 
@@ -22,12 +23,12 @@ jobs:
               with:
                 python-version: 3.9
 
-            - run: |
+            - name: Build App Wheel
+              run: |
                 pip install setuptools wheel
                 python3 setup.py bdist_wheel
 
-            # Step 1: Copy script to S3 bucket
-            - name: Copy script to S3 bucket
+            - name: Setup AWS cli & upload App Wheel to S3
               uses: jakejarvis/[email protected]
               with:
                 args: --follow-symlinks
@@ -36,7 +37,9 @@ jobs:
                 DEST_DIR: $S3_SCRIPTS_PATH
                 AWS_S3_BUCKET: $S3_BUCKET_NAME
 
+            - name: Upload Scripts to S3
+              run: aws s3 cp jobs "s3://$S3_BUCKET_NAME/$S3_SCRIPTS_PATH/" --recursive --region ap-south-1
 
-            - name: Upload Script file to S3
-              run: aws s3 cp ./main.py "s3://$S3_BUCKET_NAME/$S3_SCRIPTS_PATH/" --region ap-south-1
-
+            - name: Deploy Jobs on Glue
+              run: |
+                automation/deploy_glue_job.sh $S3_BUCKET_NAME $AWS_GLUE_ROLE $KAGGLE_TOKEN $KAGGLE_USERNAME
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -9,7 +9,7 @@ on:
 jobs:
     run-ci:
         runs-on: ubuntu-latest
-        container: dipanshuwed/glue4.0:latest
+        container: vighneshwed/glue4:latest
 
         steps:
             - name: Checkout repository
@@ -19,15 +19,14 @@ jobs:
               run: |
                 python3 -m pip install --upgrade pip
                 pip3 install -r requirements.txt
-                yum install -y jq
 
             - name: Type check
               run: mypy ./ --ignore-missing-imports
 
             - name: Lint
               run: |
-                pylint app tests main.py setup.py
-                pylint app tests main.py setup.py --output pylint-report.txt
+                pylint app tests jobs setup.py
+                pylint app tests jobs setup.py --output pylint-report.txt
 
             - name: Testing
               run: |

diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@ To run the same ETL code in multiple cloud services based on your preference, th
 
 - Azure Databricks can't be configured locally, you can only connect your local IDE to running cluster in databricks. Push your code in Github repo then make a workflow in databricks with URL of the repo & file.
 - For AWS Glue I'm setting up a local environment using the Docker image, then deploying it to AWS glue using github actions.
-- The "tasks.txt" file contents the details of transformations done in the main file3.
+- The "tasks.txt" file contents the details of transformations done in the main file.
 
 ## Requirements for Azure Databricks (for local connect only)
 - [Unity Catalog](https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/enable-workspaces) enabled workspace.
@@ -26,22 +26,22 @@ To run the same ETL code in multiple cloud services based on your preference, th
 
 2. Give your s3, adlas & kaggle (optional) paths in the ```app/.custom-env``` file.
 
-3. Just run a Glue 4 docker conatiner & write your transformations in ```main.py``` file. Install dependancies using ```pip install -r requirements.txt```
+3. Just run a Glue 4 docker conatiner & write your transformations in ```jobs``` folder. Refer ```demo.py``` file. Install dependancies using ```pip install -r requirements.txt```
 
-4. Run your scirpts in the docker container locally using ```spark-sumbit main.py```
+4. Run your scirpts in the docker container locally using ```spark-sumbit jobs/main.py```
 
 ## Deployemnt
 
-1. In your AWS Glue job pass these parameters with thier correct values: 
+1. In your your GitHub Actions Secrets, setup the following keys with their values:
     ```
-    JOB_NAME
-    KAGGLE_USERNAME
-    KAGGLE_KEY
-    GLUE_READ_PATH
-    GLUE_WRITE_PATH
-    KAGGLE_PATH (keep blank if not extracting)
+    AWS_ACCESS_KEY_ID
+    AWS_SECRET_ACCESS_KEY
+    S3_BUCKET_NAME
+    S3_SCRIPTS_PATH
+    AWS_REGION
+    AWS_GLUE_ROLE
     ```
-    Rest everything is taken care of in ```cd.yml``` file.
+    Rest all the key-value pairs in the ```app/.custom-env``` file are passed using aws cli using ```cd.yml``` file, so no need to pass them manually in the job.
 
 2. For Azure Databricks, make a workflow with the link of your repo & main file. Pass the following parameters with their correct values:
 

diff --git a/app/.custom-env b/app/.custom-env
diff --git a/app/.custom_env b/app/.custom_env
@@ -0,0 +1,9 @@
+# this is my custom file for read & write path based on environment
+
+GLUE_READ_PATH="s3://glue-bucket-vighnesh/rawdata/"
+GLUE_WRITE_PATH="s3://glue-bucket-vighnesh/transformed/"
+
+DATABRICKS_READ_PATH="/mnt/rawdata/"
+DATABRICKS_WRITE_PATH="/mnt/transformed/"
+
+KAGGLE_PATH="mastmustu/insurance-claims-fraud-data"
diff --git a/automation/create_glue_job.json b/automation/create_glue_job.json
@@ -0,0 +1,32 @@
+{
+    "Name": "samplename",
+    "Description": "",
+    "LogUri": "",
+    "Role": "samplerole",
+    "ExecutionProperty": {
+        "MaxConcurrentRuns": 1
+    },
+    "Command": {
+        "Name": "glueetl",
+        "ScriptLocation": "sample-location",
+        "PythonVersion": "3"
+    },
+    "DefaultArguments": {
+        "--enable-glue-datacatalog": "true",
+        "--job-bookmark-option": "job-bookmark-disable",
+        "--TempDir": "sample-bucket/Logs/temp/",
+        "--enable-metrics": "true",
+        "--extra-py-files": "sample-bucket/scripts/sample-wheel",
+        "--spark-event-logs-path": "sample-bucket/Logs/UILogs/",
+        "--enable-job-insights": "false",
+        "--additional-python-modules": "python-dotenv,kaggle",
+        "--enable-observability-metrics": "true",
+        "--enable-continuous-cloudwatch-log": "true",
+        "--job-language": "python"
+    },
+    "MaxRetries": 0,
+    "Timeout": 10,
+    "WorkerType": "G.1X",
+    "NumberOfWorkers": 2,
+    "GlueVersion": "4.0"
+}
diff --git a/automation/deploy_glue_job.sh b/automation/deploy_glue_job.sh
@@ -0,0 +1,73 @@
+#!/bin/bash
+s3_bucket="$1"
+role="$2"
+kaggle_key="$3"
+kaggle_username="$4"
+
+source ./app/.custom_env
+
+job_names=$(aws glue get-jobs | jq -r '.Jobs | map(.Name)[]')
+
+for file in jobs/*.py; do
+    filename=$(basename "$file" .py)
+
+    if [ "$filename" != "__init__" ]; then
+
+        if [[ $job_names != *"$filename"* ]]; then
+
+            jq --arg NAME "$filename" \
+                --arg SCRIPT_LOCATION "s3://$s3_bucket/scripts/$filename.py" \
+                --arg ROLE "$role" \
+                --arg TEMP_DIR "s3://$s3_bucket/Logs/temp/" \
+                --arg EVENT_LOG "s3://$s3_bucket/Logs/UILogs/" \
+                --arg WHEEL "s3://$s3_bucket/scripts/app-0.9-py3-none-any.whl" \
+                --arg KAGGLE_KEY "$kaggle_key" \
+                --arg KAGGLE_USERNAME "$kaggle_username" \
+                --arg GLUE_READ_PATH "$GLUE_READ_PATH" \
+                --arg GLUE_WRITE_PATH "$GLUE_WRITE_PATH" \
+                --arg KAGGLE_PATH "$KAGGLE_PATH" \
+                '.Name=$NAME | 
+                .Command.ScriptLocation=$SCRIPT_LOCATION | 
+                .Role=$ROLE | 
+                .DefaultArguments["--TempDir"]=$TEMP_DIR | 
+                .DefaultArguments["--spark-event-logs-path"]=$EVENT_LOG | 
+                .DefaultArguments["--extra-py-files"]=$WHEEL | 
+                .DefaultArguments["--KAGGLE_KEY"]=$KAGGLE_KEY | 
+                .DefaultArguments["--KAGGLE_USERNAME"]=$KAGGLE_USERNAME |
+                .DefaultArguments["--GLUE_READ_PATH"] = $GLUE_READ_PATH |
+                .DefaultArguments["--GLUE_WRITE_PATH"] = $GLUE_WRITE_PATH |
+                .DefaultArguments["--KAGGLE_PATH"] = $KAGGLE_PATH' \
+                automation/create_glue_job.json > "automation/output_$filename.json"
+
+            aws glue create-job --cli-input-json file://"automation/output_$filename.json"
+
+        else
+
+            jq --arg NAME "$filename" \
+                --arg SCRIPT_LOCATION "s3://$s3_bucket/scripts/$filename.py" \
+                --arg ROLE "$role" \
+                --arg TEMP_DIR "s3://$s3_bucket/Logs/temp/" \
+                --arg EVENT_LOG "s3://$s3_bucket/Logs/UILogs/" \
+                --arg WHEEL "s3://$s3_bucket/scripts/app-0.9-py3-none-any.whl" \
+                --arg KAGGLE_KEY "$kaggle_key" \
+                --arg KAGGLE_USERNAME "$kaggle_username" \
+                --arg GLUE_READ_PATH "$GLUE_READ_PATH" \
+                --arg GLUE_WRITE_PATH "$GLUE_WRITE_PATH" \
+                --arg KAGGLE_PATH "$KAGGLE_PATH" \
+                '.JobName=$NAME | 
+                .JobUpdate.Command.ScriptLocation=$SCRIPT_LOCATION | 
+                .JobUpdate.Role=$ROLE | 
+                .JobUpdate.DefaultArguments["--TempDir"]=$TEMP_DIR | 
+                .JobUpdate.DefaultArguments["--spark-event-logs-path"]=$EVENT_LOG | 
+                .JobUpdate.DefaultArguments["--extra-py-files"]=$WHEEL | 
+                .JobUpdate.DefaultArguments["--KAGGLE_KEY"]=$KAGGLE_KEY | 
+                .JobUpdate.DefaultArguments["--KAGGLE_USERNAME"]=$KAGGLE_USERNAME |
+                .JobUpdate.DefaultArguments["--GLUE_READ_PATH"] = $GLUE_READ_PATH |
+                .JobUpdate.DefaultArguments["--GLUE_WRITE_PATH"] = $GLUE_WRITE_PATH |
+                .JobUpdate.DefaultArguments["--KAGGLE_PATH"] = $KAGGLE_PATH' \
+                automation/update_glue_job.json > "automation/output_$filename.json"
+
+            aws glue update-job --cli-input-json file://"automation/output_$filename.json"
+        fi
+    fi
+done
diff --git a/automation/update_glue_job.json b/automation/update_glue_job.json
@@ -0,0 +1,31 @@
+{
+    "JobName": "sample-name",
+    "JobUpdate": {
+      "Description": "",
+      "Role": "sample-role",
+      "ExecutionProperty": {
+        "MaxConcurrentRuns": 1
+      },
+      "Command": {
+        "Name": "glueetl",
+        "ScriptLocation": "sample-location",
+        "PythonVersion": "3"
+      },
+      "DefaultArguments": {
+        "--enable-glue-datacatalog": "true",
+        "--job-bookmark-option": "job-bookmark-enable",
+        "--TempDir": "s3://sample-bucket/scripts/temp/",
+        "--enable-metrics": "true",
+        "--enable-spark-ui": "true",
+        "--spark-event-logs-path": "s3://sample-bucket/Logs/UILogs/",
+        "--enable-job-insights": "true",
+        "--enable-continuous-cloudwatch-log": "true",
+        "--job-language": "python"
+      },
+      "MaxRetries": 0,
+      "Timeout": 10,
+      "WorkerType": "G.1X",
+      "NumberOfWorkers": 2,
+      "GlueVersion": "4.0"
+    }
+  }
diff --git a/jobs/__init__.py b/jobs/__init__.py
diff --git a/jobs/demo.py b/jobs/demo.py
@@ -0,0 +1,30 @@
+# This is demo file for writing your transformations
+from dotenv import load_dotenv
+import app.environment as env
+
+load_dotenv("app/.custom-env")
+
+# COMMAND ----------
+
+if "dbutils" in locals():
+    databricks = True
+else:
+    spark = None
+    dbutils = None
+    databricks = False
+
+# COMMAND ----------
+# This is the example specific for "mastmustu/insurance-claims-fraud-data" data, different frames will be returned based on your data
+# fmt: off
+
+# Keep this flag True if you want to extract data from kaggle, else False
+kaggle_extraction = True
+
+[employee, insurance, vendor] = env.get_data(databricks, kaggle_extraction, dbutils, spark) #pylint: disable=unbalanced-tuple-unpacking
+
+write_path = env.get_write_path(databricks)
+
+# fmt: on
+# COMMAND ----------
+
+# Write all your transformations below:
diff --git a/main.py → jobs/main.py b/main.py → jobs/main.py
@@ -9,7 +9,7 @@
 import app.environment as env
 import app.spark_wrapper as sw
 
-load_dotenv("app/.custom-env")
+load_dotenv("app/.custom_env")
 
 # COMMAND ----------
 

diff --git a/sonar-project.properties b/sonar-project.properties
@@ -7,9 +7,9 @@ sonar.python.pylint.reportPath=pylint-report.txt
 sonar.python.coverage.reportPaths=*coverage.xml
 sonar.python.pylint_config=.pylintrc
 sonar.python.pylint=/usr/local/bin/pylint
-sonar.inclusions=**/app/**,**/main.py
+sonar.inclusions=**/app/**,**/jobs/**
 sonar.exclusions=**/tests/test_*.py,**/__init__.py,**/tests/mock/*.*
-sonar.test.exclusions=**/tests/test_*.py,**/__init__.py,**/tests/mock/*.*,**/main.py
-sonar.coverage.exclusions=**/tests/test_*.py,**/__init__.py,**/tests/mock/*.*,**/main.py
+sonar.test.exclusions=**/tests/test_*.py,**/__init__.py,**/tests/mock/*.*,**/jobs/**
+sonar.coverage.exclusions=**/tests/test_*.py,**/__init__.py,**/tests/mock/*.*,**/jobs/**
 sonar.text.excluded.file.suffixes=csv
 sonar.python.version=3.7