This project demonstrates a real-time data processing pipeline for ride events using modern data tools. It simulates ride event data ingestion, processes it in real-time, and visualizes key metrics.
The pipeline consists of the following components:
- Data Ingestor (Python): Simulates ride events and publishes them to Kafka.
- Event Storage (Kafka): Acts as a message broker to store ride event streams.
- Real-Time Processor (Apache Flink): Consumes ride events from Kafka, processes them in real-time, and writes the results back to Kafka.
- Kafka to ClickHouse Integration: A Kafka table engine in ClickHouse is used to ingest processed data from Kafka into ClickHouse.
- Real-Time Data Warehouse (ClickHouse): Stores processed ride event data for fast querying.
- Visualization (Grafana): Connects to ClickHouse to visualize ride metrics in real-time.
- The Python ingestor generates ride event data and publishes it to a Kafka topic.
- Flink reads from Kafka, processes the events (e.g., aggregations, filtering), and writes the results back to Kafka.
- ClickHouse consumes processed ride event data from Kafka using the Kafka table engine.
- Grafana queries ClickHouse to visualize real-time ride analytics.
- Docker & Docker Compose
- Python >= 3.9
- Apache Kafka
- Apache Flink
- ClickHouse
- Grafana
- Start Services:
make up
- Stop and Clean Services:
make down
- Full Cleanup (Remove All Docker Images):
make cleanup
Feel free to open issues or submit pull requests to improve the project!