This project provides a comprehensive data pipeline solution to ETL Reddit data into a AWS Redshift data warehouse. The pipeline leverages a combination of tools and services including Apache Airflow, Celery, PostgreSQL, Amazon S3, AWS Glue, Amazon Athena, and Amazon Redshift.
- AWS Account with appropriate permissions for S3, Glue, Athena, and Redshift.
- Reddit API Credentials, you need to get the own reddit_secret_key and reddit_client_id.
- Docker Installation
- Python 3.9, It might not work if the version is lower or higher.
- Apache Airflow Fernet Key (specific secret key)
- PostGreSQL
Open the Python and print the below code.
- Referecned by https://cryptography.io/en/latest/fernet/
- Once you run this code, you will get the fernet key and keep it in secured place!
from cryptography.fernet import Fernet
fernet_key = Fernet.generate_key()
print(fernet_key.decode())
- Go to Docker.compose.yml and find the below lines.
airflow-init:
<<: *airflow-common
command: >
bash -c "pip install -r /opt/airflow/requirements.txt && airflow db init && airflow db upgrade && airflow users create --username admin --firstname admin --lastname admin --role Admin --email [email protected] --password admin"
restart: "no"
- You can change the username and password. Initially, we set id and password are admin.
- Clone the repository.
https://github.com/kmpark70/Reddit_Project.git
- Create a virtual environment.
python-3 -m venv venv
- Activate the virtual environment.
On windows:
myenv\Scripts\activate
On MacOS/Linux:
source venv/bin/activate
- Install the dependencies.
pip install -r requirements.txt
- Create a folder named config and place the config.conf.example file inside that folder.
- If Broken.DAG issue occurs
.../config/config.conf.example
- Starting the containers
docker-compose up -d --build
- Launch the Airflow web UI.
open http://localhost:8080
- If you prefer to use terminal then install the lines below.
- Brew install awscli
https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
-
Create a S3 bucket in AWS, then navigate to config.conf, and within [aws], add your created bucket under aws_bucket_name as cs4440-reddit-data-engineering.
-
Next, go to Apache Airflow, click Admin - Xcomps.
- The information of "return value", "reddit_extraction" is written in aws_S3_pipeline.py
-
Check if the CSV file has been successfully moved to the S3 bucket.
-
Create a folder in the location where raw files are stored and add Transformed_Data as a folder
-
Using AWS Glue, direct the transformed files to this location as the final destination.
-
Click on the orange button for Visual ETL.
-
Click on "Sources" and select "Amazon S3.
-
Click on "Targets" and select "Amazon S3.
-
Then, two boxes will be connected.
-
Click on the box above, then press "Browse," and choose the CSV file.
-
Click on "Infer schema".
-
The CSV file should be in the box above.
-
In the box below (destination schema), set the desired data format.
-
Click on "Job details," change the Name to "reddit_glue_job.
-
Click on "Advanced properties" to ensure it's set to "reddit_glue_job.py."
-
Click on "Script," press "-edit," confirm, and then add the code between AmazonS3_node and Amazon S3_node.
-
Click on "Save," then press "Run." Next, click on the "Runs" section.
-
Once you see the result marked as "Succeeded," go to the S3 bucket and check if it's in the "Transformed_" folder. Confirm that it's done properly.
-
Click on AWS Crawler and create a new one named "reddit_crawler".
-
Next, click on "Add a data source" -> browse S3 -> Choose the Run file within the Transformed folder.
-
Click "Create new IAM role" and select AWSGlueServiceRole-reddit-glue-role.
-
Click on "Add database" -> reddit_db -> then click "Refresh" and select reddit_db.
-
Click on "Run crawler". It will take approximately 2 minutes. Wait until it shows "completed".
-
After that, go to AWS Glue -> Databases -> reddit_db -> Check the table data.
-
Athena screen will open. From there, go to Amazon Athena -> Query editor -> Manage Settings -> Browse S3 -> Add another folder named "athena_scripts" to the S3 Bucket -> Then choose and save.
-
Create Workgroup: reddit-workgroup.
-
Associated IAM Roles - Click "Manage IAM roles" -> Choose S3.
-
Once this is completed, click on "Query data" in the top right corner.
-
You can view the CSV and Parquet files in AWS Redshift in a table format.
- Specify the database system(s) and version(s) used, along with installation instructions (a link to official documentation will be enough).
- PostGreSQL, makesure the portnumber: 5432
- AWS S3, Redshift, referenced by https://pypi.org/project/s3fs/ 2, 3. Describe how to acquire project data. Include a small dataset sample (< 5 MB) or provide scripts to download/scrape/process the data. How do we load this data into the database system?
- When you open the data folder, you'll find input and output as sub-folders. After running the first ETL, the input folder contains the raw log files, while the output folder contains the result of the ETL process, which is a CSV file.
- You can download those files.
- Do you have some scripts to do that? If so, how do we execute them?
- Please follow the installation guide above.
- Did you use some tools for loading? If so, provide appropriate details and links.
- We don't use any tools for loading, but when it comes to parquet file in AWS, you can download in AWS ecosystem. If you are benchmarking different database systems, detail any configuration modifications made.
- You have to follow the port number and PostgreSQL due to our initial configuration. If generating your own data, include a sample of the synthetic dataset/database.
- We don't use any our own data, The data is from the Real-time Reddit API.