Data Engineering Infrastructure

Data Engineering Systems Design

System Design Flow

Source Database (PostgreSQL):
- Store the schema and initial data in a PostgreSQL database.
Message Broker (Kafka):
- Capture changes from PostgreSQL and publish them to Kafka topics.
Data Warehouse (ClickHouse):
- Consume messages from Kafka and load them into ClickHouse for OLAP queries.
Dashboard:
- Use a BI tool (e.g., Grafana) to visualize the data in ClickHouse in real-time.

Implementation Steps

Set Up PostgreSQL with Docker:Create a Docker container for PostgreSQL and load the provided schema.
Set Up Kafka with Docker:Create a Docker container for Kafka to handle data streaming.
Set Up ClickHouse with Docker:Create a Docker container for ClickHouse to act as the OLAP database.
Data Pipeline:
- Use Kafka Connect to capture changes from PostgreSQL.
- Set up Kafka Connect ClickHouse sink to load data into ClickHouse.
Dashboard:
- Set up Grafana to visualize the data from ClickHouse.

Project Tree

data_enginer_design/

├── .github/

│ └── workflows/

│ └── ci.yml

├── clickhouse-server-logs/

│ ├── clickhouse-server.err.log

│ └── clickhouse-server.log

├── app/

│ ├── Dockerfile

│ ├── clickhouse_publisher.py

│ ├── generate_and_insert_records.py

│ ├── requirements.txt

│ └── tests/

│ ├── datapublisher.py

│ └── test_data_generate.py

├── docker/

│ ├── Dockerfile.clickhouse

│ └── Dockerfile.postgres

├── docker-compose.yml

└── README.md

Data Engineering Architectural Workflows

Kafka Connect Configuration

Create a Kafka Connect configuration to capture changes from PostgreSQL and stream them to ClickHouse. This involves configuring Kafka Connect with the PostgreSQL source connector and ClickHouse sink connector.

BI Dashboard

Configure Grafana to connect to ClickHouse and create dashboards for real-time reporting based on the data stored in ClickHouse.

Summary

This setup will enable real-time data streaming from PostgreSQL to ClickHouse using Kafka, and visualizing the data in Grafana. This approach ensures low latency and real-time insights for Rembo Company’s business needs. The next steps involve implementing each component and ensuring they are correctly integrated to achieve the desired real-time reporting functionality.

Disclaimer and Assumption

The solution is not complete.
The last git action on Postgres services start is main error that is still needs to be debugged.
I didn't initiate the Graphana/Apache Superset for Visualization because of point No. 2.
I picked the Conterized service for kafka because of: - It's easy to integrate with clickhouse drivers. - The CDC would be easily introduced to Kafka using Debezium of Table Primary Keys on the Topic.
I understand using Airflow container if feasible but not in the future needs such as CDC and Scalability of the same.(NB; I haven't explored using Kafka on Airflow but that would be another feasible solution.)
Check the git actions to gather my thought processes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Infrastructure

System Design Flow

Implementation Steps

Project Tree

Kafka Connect Configuration

BI Dashboard

Summary

Disclaimer and Assumption

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.github/workflows		.github/workflows
app		app
clickhouse		clickhouse
kafka		kafka
postgres		postgres
scripts		scripts
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml

AmosBunde/data_enginer_design

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Infrastructure

System Design Flow

Implementation Steps

Project Tree

Kafka Connect Configuration

BI Dashboard

Summary

Disclaimer and Assumption

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages