A mini end-to-end ETL project.
This project aims to implement a manual (i.e. no scheduling) ETL pipeline for an online retail store which:
- Extracts and transforms a source
.xlsx
file from UCI Machine Learning Repository. - Loads the transformed data into tables of a data warehouse.
- Includes a simple dashboard as a report.
The data warehouse schema is designed following the Kimball's dimensional modelling technique with the available data from the source .xlsx
file.
- Apache Airflow for data orchestration.
- pandas as the main data processing library.
- Apache Superset for data visualization.
- Docker and Docker Compose for the containerized approach.
- Clone this repo and navigate to project directory:
git clone https://github.com/minkminkk/etl-online-retail
cd etl-online-retail
- Build images and start containers:
docker compose up
If you already have the containers set up and do not want to recreate containers, do:
docker compose start
- Airflow webserver:
localhost:8080
. - Superset webserver:
localhost:8088
.
Note: Both webservers have the same login credential:
- Username:
admin
. - Password:
admin
.
After you are done, if you want to simply stop containers, do:
docker compose stop
Then next time, you can do docker compose start
to restart your containers.
If you want to remove all containers, do:
docker compose down
After executing part 3.1
, upon accessing webserver, you will be prompted to login. Refer to part 3.2
for username and password.
You will be transferred to the main Airflow UI:
To run the DAG, just click on the "play" button on the Actions
column.
Once the DAG finished, access Superset webserver (localhost:8088
), login using credentials in part 3.2
. Then enter the available dashboard and see the visualization.
You can also create your own dashboards here too!