A data pipeline with Kafka, Spark Streaming, dbt, Docker, Airflow, GCP and more.
The project will stream data from a simulation that sells event tickets and create a data pipeline that consumes the real-time data. The data would then be processed in realtime and stored in data lake every two minutes. An hourly batch job will consume this data, apply transformations and create tables in data warehouse for analytics and reports. We will analyze basic attributes of the data like total users, average waiting time, etc ...
Ticketsim is inspired by article by Kevin Brown. Using simpy, the program will generate wait and buy ticket events.
- Cloud - Google Cloud Platform
- Infrastructure as Code software - Terraform
- Containerization - Docker, Docker Compose
- Stream Processing - Kafka, Spark Structured Streaming
- Orchestration - Airflow
- Transformation - dbt
- Data Lake - Google Cloud Storage
- Data Warehouse - BigQuery
- Data Visualization - Data Studio
- Language - Python
You can watch the dashboard here.
In this project, I used 300$ free credit when create a new GCP account. The project consists of 3 vm instance: 1 ubuntu for running ticketsim and kafka stack, 1 dataproc for running spark jobs and 1 ubuntu for running airflow to orchestrate periodic jobs on data lake and data warehouse. The vm names are listed as in the picture.
- Google Cloud Platform
- Terraform
- Setup GCP - Setup
- Setup infrastructure using terraform - Setup
- Setup Kafka Compute Instance and start sending messages from Eventsim - Setup
- Setup Spark Cluster for stream processing - Setup
- Setup Airflow on Compute Instance to trigger the hourly data pipeline - Setup
I'd like to thank the DataTalks.Club for offering this Data Engineering course for completely free. The course knowledge help me kickstart this project. I also want to thank Ankur for his streamify. I copy a lot of his code and ideas to study the concepts in Data Engineering field.