Reddit ETL Pipe Line Project

This project provides a comprehensive data pipeline solution to ETL Reddit data into a AWS Redshift data warehouse. The pipeline leverages a combination of tools and services including Apache Airflow, Celery, PostgreSQL, Amazon S3, AWS Glue, Amazon Athena, and Amazon Redshift.

Prerequisites

AWS Account with appropriate permissions for S3, Glue, Athena, and Redshift.
Reddit API Credentials, you need to get the own reddit_secret_key and reddit_client_id.
Docker Installation
Python 3.9, It might not work if the version is lower or higher.
Apache Airflow Fernet Key (specific secret key)
PostGreSQL

How to get Apache Airflow Fernet Key

Open the Python and print the below code.

Referecned by https://cryptography.io/en/latest/fernet/
Once you run this code, you will get the fernet key and keep it in secured place!

from cryptography.fernet import Fernet

fernet_key = Fernet.generate_key()
print(fernet_key.decode())

How to get Reddit API

https://www.geeksforgeeks.org/how-to-get-client_id-and-client_secret-for-python-reddit-api-registration/![image](https://github.com/kmpark70/Reddit_Project/assets/72221217/1f6d8697-d575-43e4-b266-9ff9e8b94871)

How to change username and password in Apache Airflow

Go to Docker.compose.yml and find the below lines.

  airflow-init:
    <<: *airflow-common
    command: >
      bash -c "pip install -r /opt/airflow/requirements.txt && airflow db init && airflow db upgrade && airflow users create --username admin --firstname admin --lastname admin --role Admin --email [email protected] --password admin"
    restart: "no"

You can change the username and password. Initially, we set id and password are admin.

System Setup

Clone the repository.

https://github.com/kmpark70/Reddit_Project.git

Create a virtual environment.

python-3 -m venv venv

Activate the virtual environment.

On windows:
myenv\Scripts\activate

On MacOS/Linux:
source venv/bin/activate

Install the dependencies.

pip install -r requirements.txt

Create a folder named config and place the config.conf.example file inside that folder.

If Broken.DAG issue occurs

.../config/config.conf.example

Starting the containers

docker-compose up -d --build

Launch the Airflow web UI.

open http://localhost:8080

AWS ETL Guidelines

AWS S3

If you prefer to use terminal then install the lines below.

Brew install awscli

https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html

Create a S3 bucket in AWS, then navigate to config.conf, and within [aws], add your created bucket under aws_bucket_name as cs4440-reddit-data-engineering.
Next, go to Apache Airflow, click Admin - Xcomps.

The information of "return value", "reddit_extraction" is written in aws_S3_pipeline.py

Check if the CSV file has been successfully moved to the S3 bucket.
Create a folder in the location where raw files are stored and add Transformed_Data as a folder
Using AWS Glue, direct the transformed files to this location as the final destination.

AWS Glue

Part1

Click on the orange button for Visual ETL.
Click on "Sources" and select "Amazon S3.
Click on "Targets" and select "Amazon S3.
Then, two boxes will be connected.
Click on the box above, then press "Browse," and choose the CSV file.
Click on "Infer schema".
The CSV file should be in the box above.
In the box below (destination schema), set the desired data format.
Click on "Job details," change the Name to "reddit_glue_job.
Click on "Advanced properties" to ensure it's set to "reddit_glue_job.py."
Click on "Script," press "-edit," confirm, and then add the code between AmazonS3_node and Amazon S3_node.
Click on "Save," then press "Run." Next, click on the "Runs" section.
Once you see the result marked as "Succeeded," go to the S3 bucket and check if it's in the "Transformed_" folder. Confirm that it's done properly.

Part2

Click on AWS Crawler and create a new one named "reddit_crawler".
Next, click on "Add a data source" -> browse S3 -> Choose the Run file within the Transformed folder.
Click "Create new IAM role" and select AWSGlueServiceRole-reddit-glue-role.
Click on "Add database" -> reddit_db -> then click "Refresh" and select reddit_db.
Click on "Run crawler". It will take approximately 2 minutes. Wait until it shows "completed".
After that, go to AWS Glue -> Databases -> reddit_db -> Check the table data.
Athena screen will open. From there, go to Amazon Athena -> Query editor -> Manage Settings -> Browse S3 -> Add another folder named "athena_scripts" to the S3 Bucket -> Then choose and save.

AWS Redshift

Create Workgroup: reddit-workgroup.
Associated IAM Roles - Click "Manage IAM roles" -> Choose S3.
Once this is completed, click on "Query data" in the top right corner.
You can view the CSV and Parquet files in AWS Redshift in a table format.

Data Preparation and Setup

Specify the database system(s) and version(s) used, along with installation instructions (a link to official documentation will be enough).

PostGreSQL, makesure the portnumber: 5432
AWS S3, Redshift, referenced by https://pypi.org/project/s3fs/ 2, 3. Describe how to acquire project data. Include a small dataset sample (< 5 MB) or provide scripts to download/scrape/process the data. How do we load this data into the database system?
When you open the data folder, you'll find input and output as sub-folders. After running the first ETL, the input folder contains the raw log files, while the output folder contains the result of the ETL process, which is a CSV file.
You can download those files.

Do you have some scripts to do that? If so, how do we execute them?

Please follow the installation guide above.

Did you use some tools for loading? If so, provide appropriate details and links.

We don't use any tools for loading, but when it comes to parquet file in AWS, you can download in AWS ecosystem. If you are benchmarking different database systems, detail any configuration modifications made.
You have to follow the port number and PostgreSQL due to our initial configuration. If generating your own data, include a sample of the synthetic dataset/database.
We don't use any our own data, The data is from the Real-time Reddit API.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
dags		dags
data		data
etls		etls
logs		logs
pipelines		pipelines
utils		utils
.DS_Store		.DS_Store
Dockerfile		Dockerfile
README.md		README.md
airflow.env		airflow.env
docker-compose.yml		docker-compose.yml
parquetReader.ipynb		parquetReader.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit ETL Pipe Line Project

Prerequisites

How to get Apache Airflow Fernet Key

How to get Reddit API

How to change username and password in Apache Airflow

System Setup

AWS ETL Guidelines

AWS S3

AWS Glue

Part1

Part2

AWS Redshift

Data Preparation and Setup

About

Releases

Packages

Languages

kmpark70/Social-Media-Sentiment-Analysis

Folders and files

Latest commit

History

Repository files navigation

Reddit ETL Pipe Line Project

Prerequisites

How to get Apache Airflow Fernet Key

How to get Reddit API

How to change username and password in Apache Airflow

System Setup

AWS ETL Guidelines

AWS S3

AWS Glue

Part1

Part2

AWS Redshift

Data Preparation and Setup

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages