Skip to content

Commit

Permalink
Merge pull request #10 from wednesday-solutions/mod/docker-example
Browse files Browse the repository at this point in the history
Feat: Docker init example file
  • Loading branch information
idipanshu authored Feb 9, 2024
2 parents 70bf2d3 + 84b505e commit 1fa130f
Show file tree
Hide file tree
Showing 6 changed files with 30 additions and 15 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@
*__pycache__
temp
htmlcov
.vscode
23 changes: 13 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,25 +12,28 @@ To run the same ETL code in multiple cloud services based on your preference, th

## Requirements for Azure Databricks (for local connect only)
- [Unity Catalog](https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/enable-workspaces) enabled workspace.
- [Databricks Connect](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect/python/install) configured on local machine. Runing cluster.
- [Databricks Connect](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect/python/install) configured on local machine. Running cluster.

## Requirements for AWS Glue (local setup)

- For Unix-based systems you can refer: [Data Enginnering Onboarding Starter Setup](https://github.com/wednesday-solutions/Data-Engineering-Onboarding-Starter#setup)
- For Unix-based systems you can refer: [Data Engineering Onboarding Starter Setup](https://github.com/wednesday-solutions/Data-Engineering-Onboarding-Starter#setup)

- For Windows-based systems you can refer: [AWS Glue Developing using a Docker image](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html#develop-local-docker-image)

## Steps

1. Clone this repo in your own repo.
1. Clone this repo in your own repo. For Windows recommend use WSL.

2. Give your s3, adlas & kaggle (optional) paths in the ```app/.custom-env``` file.
2. Give your S3, ADLS & Kaggle (optional) paths in the ```app/.custom_env``` file for Databricks. Make ```.evn``` file in the root folder for local Docker Glue to use.
Make sure to pass KAGGLE_KEY & KAGGLE_USERNAME values if you are going to use Kaggle. Else make the kaggle_extraction flag as False.

3. Just run a Glue 4 docker conatiner & write your transformations in ```jobs``` folder. Refer ```demo.py``` file. Install dependancies using ```pip install -r requirements.txt```
3. Run ```automation/init_docker.sh``` passing your aws credential location & project root location. If you are using Windows Powershell or CommandPrompt then run the commands manually by copy-pasting.

4. Run your scirpts in the docker container locally using ```spark-sumbit jobs/main.py```
4. Write your jobs in the ```jobs``` folder. Refer ```demo.py``` file. One example is the ```jobs/main.py``` file.

## Deployemnt
5. Check your setup is correct, by running scripts in the docker container locally using ```spark-submit jobs/demo.py```. Make sure you see the "Execution Complete" statement printed.

## Deployment

1. In your your GitHub Actions Secrets, setup the following keys with their values:
```
Expand All @@ -41,7 +44,7 @@ To run the same ETL code in multiple cloud services based on your preference, th
AWS_REGION
AWS_GLUE_ROLE
```
Rest all the key-value pairs in the ```app/.custom-env``` file are passed using aws cli using ```cd.yml``` file, so no need to pass them manually in the job.
Rest all the key-value pairs that you wrote in your .env file, make sure you pass them using the ```automation/deploy_glue_jobs.sh``` file.
2. For Azure Databricks, make a workflow with the link of your repo & main file. Pass the following parameters with their correct values:
Expand All @@ -54,7 +57,7 @@ To run the same ETL code in multiple cloud services based on your preference, th
## Documentation
[Multi-cloud Pipeline Documnentation](https://docs.google.com/document/d/1npCpT_FIpw7ZuxAzQrEH3IsPKCDt7behmF-6VjrSFoQ/edit?usp=sharing)
[Multi-cloud Pipeline Documentation](https://docs.google.com/document/d/1npCpT_FIpw7ZuxAzQrEH3IsPKCDt7behmF-6VjrSFoQ/edit?usp=sharing)
## References
Expand All @@ -67,4 +70,4 @@ To run tests in the root of the directory use:
coverage run --source=app -m unittest discover -s tests
coverage report
Note that awsglue libraries are not availabe to download, so use AWS Glue 4 Docker container.
Note that AWS Glue libraries are not available to download, so use AWS Glue 4 Docker container.
2 changes: 1 addition & 1 deletion app/connect_databricks.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ def create_mount(dbutils, container_name, mount_path):
f"fs.azure.account.key.{storage_name}.blob.core.windows.net": storage_key
},
)
print(f"{mount_path} Mount Successfull")
print(f"{mount_path} Mount Successful")
else:
dbutils.fs.refreshMounts()
print(f"{mount_path} Already mounted")
8 changes: 8 additions & 0 deletions automation/init_docker.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
aws_credentials="$1"
project_root_location="$2"

docker run -it -v $aws_credentials:/home/glue_user/.aws -v $project_root_location:/home/glue_user/workspace/ -e AWS_PROFILE=default -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_4.0.0_image_01

export PYTHONPATH=$PYTHONPATH:/home/glue_user/workspace

pip3 install -r requirements.txt
7 changes: 5 additions & 2 deletions jobs/demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
from dotenv import load_dotenv
import app.environment as env

load_dotenv("app/.custom-env")
load_dotenv("../app/.custom-env")

# COMMAND ----------

Expand All @@ -18,7 +18,7 @@
# fmt: off

# Keep this flag True if you want to extract data from kaggle, else False
kaggle_extraction = True
kaggle_extraction = False

[employee, insurance, vendor] = env.get_data(databricks, kaggle_extraction, dbutils, spark) #pylint: disable=unbalanced-tuple-unpacking

Expand All @@ -28,3 +28,6 @@
# COMMAND ----------

# Write all your transformations below:


print("\nExecution Complete\n")
4 changes: 2 additions & 2 deletions jobs/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
import app.environment as env
import app.spark_wrapper as sw

load_dotenv("app/.custom_env")
load_dotenv("../app/.custom_env")

# COMMAND ----------

Expand Down Expand Up @@ -196,7 +196,7 @@ def get_cond(type1, type2):

# COMMAND ----------

# finally writting the data in transformed container
# finally writing the data in transformed container
df.coalesce(1).write.csv(write_path + "final_data.csv", header=True, mode="overwrite")

print("Execution Complete")

0 comments on commit 1fa130f

Please sign in to comment.