This project demonstrates an ETL (Extract, Transform, Load) automation pipeline using Directed Acyclic Graphs (DAGs) in Apache Airflow.
This project automates the ETL (Extract, Transform, Load) pipeline to process data from multiple sources, integrate it, and provide insights. It focuses on analyzing trends in fashion-related data by leveraging Apache Airflow to manage the workflow. The end goal is to predict seasonal trends and generate valuable reports. The trend is collected based on keyword such as fur jacket
, cardigan
, coat
, etc.
π Directory Layout (π where to look at..)
ETL-Fashion-Tren-Analysis/
β
βββ Analysis/ # Data Visualization
βββ DAGs/ # Directory for Airflow DAGs
βββ Privacy Policy/ # Privacy Policy App for Authorization Pinterest Access Token
βββ Script ETL/ # Functions that run on automation
- Extract
Data is collected from two sources
- Web scraping from X (formerly Twitter)
- API calls to Pinterest
- Transform
- Handle X data: CSV data from X is cleaned and converted into time-series format for engagement metrics
- Handle Pinterest data: JSON data from Pinterest is flattened and structured to extract growth-related insights
- Load The transformed and integrated data is stored in a PostgreSQL database (data warehouse)
- Analyze and Report The data warehouse serves as the foundation for generating trend insights and visualizations to predict upcoming fashion trends
Apache Airflow orchestrates each step of the ETL process through DAGs.
The pipeline pulls data from two primary sources:
- X (formerly Twitter): Data is extracted via web scraping and exported as a CSV file containing user engagement metrics and other relevant attributes.
- Pinterest: Data is fetched using an API, providing JSON files containing growth trends and user interaction data. These sources provide complementary datasets for fashion trend analysis.
- Web Scraping (X): Engagement metrics such as likes, retweets, and comments are collected and saved as a CSV file.
- API Call (Pinterest): JSON data is retrieved, including attributes related to growth trends and trend behaviors to the respect of time (yearly, monthly, weekly)
- Cleaning and reformatting
- CSV Data: Engagement data is cleaned and converted into a time-series format for weekly analysis
- JSON Data: Nested structures in JSON are flattened, and key-value pairs are extracted for time-series growth metrics
- Data integration
- The cleaned datasets are merged into a unified pandas DataFrame with consistent time intervals
- Columns such as
time_weekly
,data_tweet
, anddata_pinterest
are created to combine metrics from both sources
- The integrated DataFrame is uploaded into a PostgreSQL database serving as the data warehouse.
- The database provides a centralized storage solution for the processed data, enabling efficient querying and further analysis.
- YouTube: Watch on YouTube
- Article Post: Read the Article Post