Data Engineering Systems Design
-
Source Database (PostgreSQL):
- Store the schema and initial data in a PostgreSQL database.
-
Message Broker (Kafka):
- Capture changes from PostgreSQL and publish them to Kafka topics.
-
Data Warehouse (ClickHouse):
- Consume messages from Kafka and load them into ClickHouse for OLAP queries.
-
Dashboard:
- Use a BI tool (e.g., Grafana) to visualize the data in ClickHouse in real-time.
-
Set Up PostgreSQL with Docker:Create a Docker container for PostgreSQL and load the provided schema.
-
Set Up Kafka with Docker:Create a Docker container for Kafka to handle data streaming.
-
Set Up ClickHouse with Docker:Create a Docker container for ClickHouse to act as the OLAP database.
-
Data Pipeline:
-
Use Kafka Connect to capture changes from PostgreSQL.
-
Set up Kafka Connect ClickHouse sink to load data into ClickHouse.
-
-
Dashboard:
- Set up Grafana to visualize the data from ClickHouse.
data_enginer_design/
├── .github/
│ └── workflows/
│ └── ci.yml
├── clickhouse-server-logs/
│ ├── clickhouse-server.err.log
│ └── clickhouse-server.log
├── app/
│ ├── Dockerfile
│ ├── clickhouse_publisher.py
│ ├── generate_and_insert_records.py
│ ├── requirements.txt
│ └── tests/
│ ├── datapublisher.py
│ └── test_data_generate.py
├── docker/
│ ├── Dockerfile.clickhouse
│ └── Dockerfile.postgres
├── docker-compose.yml
└── README.md
Data Engineering Architectural Workflows
Create a Kafka Connect configuration to capture changes from PostgreSQL and stream them to ClickHouse. This involves configuring Kafka Connect with the PostgreSQL source connector and ClickHouse sink connector.
Configure Grafana to connect to ClickHouse and create dashboards for real-time reporting based on the data stored in ClickHouse.
This setup will enable real-time data streaming from PostgreSQL to ClickHouse using Kafka, and visualizing the data in Grafana. This approach ensures low latency and real-time insights for Rembo Company’s business needs. The next steps involve implementing each component and ensuring they are correctly integrated to achieve the desired real-time reporting functionality.
- The solution is not complete.
- The last git action on Postgres services start is main error that is still needs to be debugged.
- I didn't initiate the Graphana/Apache Superset for Visualization because of point No. 2.
- I picked the Conterized service for kafka because of: - It's easy to integrate with clickhouse drivers. - The CDC would be easily introduced to Kafka using Debezium of Table Primary Keys on the Topic.
- I understand using Airflow container if feasible but not in the future needs such as CDC and Scalability of the same.(NB; I haven't explored using Kafka on Airflow but that would be another feasible solution.)
- Check the git actions to gather my thought processes.