Merge pull request #10 from wednesday-solutions/mod/docker-example

Feat: Docker init example file
wednesday-solutions · Feb 9, 2024 · 1fa130f · 1fa130f
2 parents 70bf2d3 + 84b505e
commit 1fa130f
Show file tree

Hide file tree

Showing 6 changed files with 30 additions and 15 deletions.
diff --git a/.gitignore b/.gitignore
@@ -3,3 +3,4 @@
 *__pycache__
 temp
 htmlcov
+.vscode
diff --git a/README.md b/README.md
@@ -12,25 +12,28 @@ To run the same ETL code in multiple cloud services based on your preference, th
 
 ## Requirements for Azure Databricks (for local connect only)
 - [Unity Catalog](https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/enable-workspaces) enabled workspace.
-- [Databricks Connect](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect/python/install) configured on local machine. Runing cluster.
+- [Databricks Connect](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect/python/install) configured on local machine. Running cluster.
 
 ## Requirements for AWS Glue (local setup)
 
-- For Unix-based systems you can refer: [Data Enginnering Onboarding Starter Setup](https://github.com/wednesday-solutions/Data-Engineering-Onboarding-Starter#setup)
+- For Unix-based systems you can refer: [Data Engineering Onboarding Starter Setup](https://github.com/wednesday-solutions/Data-Engineering-Onboarding-Starter#setup)
 
 - For Windows-based systems you can refer: [AWS Glue Developing using a Docker image](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html#develop-local-docker-image)
 
 ## Steps
 
-1. Clone this repo in your own repo.
+1. Clone this repo in your own repo. For Windows recommend use WSL.
 
-2. Give your s3, adlas & kaggle (optional) paths in the ```app/.custom-env``` file.
+2. Give your S3, ADLS & Kaggle (optional) paths in the ```app/.custom_env``` file for Databricks. Make ```.evn``` file in the root folder for local Docker Glue to use.
+Make sure to pass KAGGLE_KEY & KAGGLE_USERNAME values if you are going to use Kaggle. Else make the kaggle_extraction flag as False.
 
-3. Just run a Glue 4 docker conatiner & write your transformations in ```jobs``` folder. Refer ```demo.py``` file. Install dependancies using ```pip install -r requirements.txt```
+3. Run ```automation/init_docker.sh``` passing your aws credential location & project root location. If you are using Windows Powershell or CommandPrompt then run the commands manually by copy-pasting.
 
-4. Run your scirpts in the docker container locally using ```spark-sumbit jobs/main.py```
+4. Write your jobs in the ```jobs``` folder. Refer ```demo.py``` file. One example is the ```jobs/main.py``` file.
 
-## Deployemnt
+5. Check your setup is correct, by running scripts in the docker container locally using ```spark-submit jobs/demo.py```. Make sure you see the "Execution Complete" statement printed.
+
+## Deployment
 
 1. In your your GitHub Actions Secrets, setup the following keys with their values:
     ```
@@ -41,7 +44,7 @@ To run the same ETL code in multiple cloud services based on your preference, th
     AWS_REGION
     AWS_GLUE_ROLE
     ```
-    Rest all the key-value pairs in the ```app/.custom-env``` file are passed using aws cli using ```cd.yml``` file, so no need to pass them manually in the job.
+    Rest all the key-value pairs that you wrote in your .env file, make sure you pass them using the ```automation/deploy_glue_jobs.sh``` file.
 
 2. For Azure Databricks, make a workflow with the link of your repo & main file. Pass the following parameters with their correct values:
 
@@ -54,7 +57,7 @@ To run the same ETL code in multiple cloud services based on your preference, th
 
 ## Documentation
 
-[Multi-cloud Pipeline Documnentation](https://docs.google.com/document/d/1npCpT_FIpw7ZuxAzQrEH3IsPKCDt7behmF-6VjrSFoQ/edit?usp=sharing)
+[Multi-cloud Pipeline Documentation](https://docs.google.com/document/d/1npCpT_FIpw7ZuxAzQrEH3IsPKCDt7behmF-6VjrSFoQ/edit?usp=sharing)
 
 ## References
 
@@ -67,4 +70,4 @@ To run tests in the root of the directory use:
     coverage run --source=app -m unittest discover -s tests
     coverage report
 
-Note that awsglue libraries are not availabe to download, so use AWS Glue 4 Docker container.
+Note that AWS Glue libraries are not available to download, so use AWS Glue 4 Docker container.
diff --git a/app/connect_databricks.py b/app/connect_databricks.py
@@ -14,7 +14,7 @@ def create_mount(dbutils, container_name, mount_path):
                 f"fs.azure.account.key.{storage_name}.blob.core.windows.net": storage_key
             },
         )
-        print(f"{mount_path} Mount Successfull")
+        print(f"{mount_path} Mount Successful")
     else:
         dbutils.fs.refreshMounts()
         print(f"{mount_path} Already mounted")
diff --git a/automation/init_docker.sh b/automation/init_docker.sh
@@ -0,0 +1,8 @@
+aws_credentials="$1"
+project_root_location="$2"
+
+docker run -it -v $aws_credentials:/home/glue_user/.aws -v $project_root_location:/home/glue_user/workspace/ -e AWS_PROFILE=default -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_4.0.0_image_01
+
+export PYTHONPATH=$PYTHONPATH:/home/glue_user/workspace
+
+pip3 install -r requirements.txt
diff --git a/jobs/demo.py b/jobs/demo.py
@@ -2,7 +2,7 @@
 from dotenv import load_dotenv
 import app.environment as env
 
-load_dotenv("app/.custom-env")
+load_dotenv("../app/.custom-env")
 
 # COMMAND ----------
 
@@ -18,7 +18,7 @@
 # fmt: off
 
 # Keep this flag True if you want to extract data from kaggle, else False
-kaggle_extraction = True
+kaggle_extraction = False
 
 [employee, insurance, vendor] = env.get_data(databricks, kaggle_extraction, dbutils, spark) #pylint: disable=unbalanced-tuple-unpacking
 
@@ -28,3 +28,6 @@
 # COMMAND ----------
 
 # Write all your transformations below:
+
+
+print("\nExecution Complete\n")
diff --git a/jobs/main.py b/jobs/main.py
@@ -9,7 +9,7 @@
 import app.environment as env
 import app.spark_wrapper as sw
 
-load_dotenv("app/.custom_env")
+load_dotenv("../app/.custom_env")
 
 # COMMAND ----------
 
@@ -196,7 +196,7 @@ def get_cond(type1, type2):
 
 # COMMAND ----------
 
-# finally writting the data in transformed container
+# finally writing the data in transformed container
 df.coalesce(1).write.csv(write_path + "final_data.csv", header=True, mode="overwrite")
 
 print("Execution Complete")
-Original file line number
+Diff line change
@@ Expand Up / @@ -3,3 +3,4 @@ @@
     *__pycache__
     temp
     htmlcov
+    .vscode