Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ETL_Pipeline_Bidnamic_task-checkpoint.ipynb		ETL_Pipeline_Bidnamic_task-checkpoint.ipynb
ETL_flow_chart.jpg		ETL_flow_chart.jpg
README.md		README.md
adgroups.csv		adgroups.csv
aws_cofig.json		aws_cofig.json
campaigns.csv		campaigns.csv
challenge.md		challenge.md
hadoop-aws-3.3.1.jar		hadoop-aws-3.3.1.jar
ingest.py		ingest.py
logo.png		logo.png
s3bucket_screenshot.jpg		s3bucket_screenshot.jpg
search_terms.csv		search_terms.csv

Repository files navigation

bidnamic-data-challenge

Bidnamic's Data Engineering Coding Challenge

Ingested data from raw url files campaings, adgroups, search_terms into AWS S3 which considered as data lake with folowing Python code.
Extracted the ingested data objects AWS S3 bucket using boto3 with the following python Script in jupyter Notebook by this we can perform the tranfomations and data processing accroding to the business requirement.
In the data preprocessing steps for good practice while stroing extracted data into the postgreSQL we created data frames using the pyspark jupyter Notebook and inserted additional columns which are "created_timestamp", "created_by" and "modified_timestamp" by these columns we can track the data flow while extracting the data from source.
while data preprocessing i considered to store the data in three diffirent following levels of schema's which are
by segregating schema's in postgreSQL database we can track the data utlisation as per busness requirement.
These schema's created by using the psycopg2 library where it used to connect the PgAdmin with configuration and it helps to execute the SQL statments
Data loading is done by configuring the postgreSQL jdbc driver
For entire cofiguration process like AWS key acess, postgreSQL connection and other input paths are used by loading aws_config.json

These file used for the extracting the S3 bucket files to the loacal envieronent for this we need hadoop campatibility jar files