Sales Data Pipeline with Docker, ClickHouse, and Grafana

This project sets up a scalable, containerized data pipeline for ingesting, processing, storing, and visualizing sales data. It leverages Minio S3 for raw data storage, ClickHouse for data analytics, and Grafana for visualizations.

Architecture Overview

Sales Raw Data: Collected from various sources and stored in S3.
Preprocessing and Metric Extraction: Data is cleaned, processed, and transformed into metrics.
ClickHouse: Stores processed metrics and provides fast analytical queries.
Grafana: Visualizes the aggregated data using bar charts and dashboards.
API Interaction: The system can interact with APIs for data input/output.

Technologies Used

Docker: Containerization for the entire pipeline.
Docker Compose: Orchestration for multiple containers.
Python: For preprocessing and interacting with S3.
ClickHouse: Analytical database for storing and querying metrics.
Grafana: Dashboard tool for data visualization.
AWS S3: Object storage for raw data.

Setup Instructions

1. Prerequisites

Ensure you have the following installed:

2. Project Directory Structure

project-root/
├── docker-compose.yml
├── backend.Dockerfile # Custom image for creating the API server
├── run_app.py # Flask API for interacting with S3 and ClickHouse
├── requirements.txt
├── file.csv   # Raw sales data (optional)
├── sales_dashboard.json # Graphana dashboard
└── diagram.jpg       # Architecture diagram

3. Build and Run the Containers

Build the Containers and start services:
```
docker-compose up -d
```
Verify the Services:
- Minio s3: Accessible at http://localhost:9001 (Credentials in docker-compose).
- ClickHouse: Accessible at http://localhost:8123.
- Grafana: Accessible at http://localhost:3000 (default credentials: admin/admin).
- API documentation: Accesible at http://localhost:5000/swagger-ui/#/

4. Usage

Data Ingestion and Preprocessing:

Create a bucket to start uploading files:

curl -X POST -H "Content-Type: application/json" -d '{"bucket_name": "bucket-name"}' http://localhost:5000/bucket

Upload raw sales data to the S3 bucket, using the API.

 curl -X POST -F "[email protected]" 'http://localhost:5000/ingest/sales?bucket_name=bucket-name'

You can verify using:

 curl -X GET 'http://localhost:5000/buckets'
 
 curl -X GET 'http://localhost:5000/objects?bucket_name=bucket-name'

To start the preprocessing, the following endpoint reads raw data from S3, processes it, and inserts metrics into ClickHouse.
```
 curl -X POST 'http://localhost:5000/transform/sales?bucket_name=bucket-name&file_name=file.csv'
```

Querying ClickHouse:

Use ClickHouse's SQL queries to analyze data. Example query:

SELECT product_id, SUM(total_sales_sum) AS sum_sales
FROM sales
GROUP BY product_id
ORDER BY sum_sales DESC;

Visualizing Data in Grafana:

Log in to Grafana at http://localhost:3000.
Import the sales_dashboard.json.

5. Troubleshooting

Check Logs:
```
docker-compose logs <service_name>
```
Grafana Not Connecting: Verify ClickHouse is accessible and configured as a data source in Grafana.
Data Not Processing: Ensure the S3 endpoint and ClickHouse host settings are correct.
Permission Issues: Verify file permissions and ensure all containers have access to the required volumes.

6. Customization

ClickHouse Queries: Modify or add custom queries in Grafana to fit your analysis needs.
Grafana Dashboards: Customize dashboards by editing JSON or using the Grafana UI.
Preprocessing: Adjust the preprocessing endpoint to include additional transformations or validation rules.

8. Future Enhancements

S3 Storage: Configure S3 credentials in the .env file or pass them as environment variables.
CI/CD Integration: Add CI/CD pipelines for automated deployments and testing.
Security Enhancements: Implement role-based access controls (RBAC) and secure API endpoints.
Scalability Improvements: Configure a K8s cluster for running this resources in a distributed framework to handle large datasets.
Monitoring and Alerts: Integrate monitoring tools to track the performance of the pipeline and set up alerts for failures.

This README provides the necessary setup and configuration details to deploy and run your end-to-end sales data pipeline using Docker Compose. 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
static		static
README.md		README.md
backend.Dockerfile		backend.Dockerfile
diagram.jpg		diagram.jpg
docker-compose.yaml		docker-compose.yaml
file.csv		file.csv
helper.py		helper.py
requirements.txt		requirements.txt
run_app.py		run_app.py
sales_dashboard.json		sales_dashboard.json
testing.txt		testing.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sales Data Pipeline with Docker, ClickHouse, and Grafana

Architecture Overview

Technologies Used

Setup Instructions

1. Prerequisites

2. Project Directory Structure

3. Build and Run the Containers

4. Usage

Data Ingestion and Preprocessing:

Querying ClickHouse:

Visualizing Data in Grafana:

5. Troubleshooting

6. Customization

8. Future Enhancements

About

Releases

Packages

Languages

rv0lt/DataEngPipeline

Folders and files

Latest commit

History

Repository files navigation

Sales Data Pipeline with Docker, ClickHouse, and Grafana

Architecture Overview

Technologies Used

Setup Instructions

1. Prerequisites

2. Project Directory Structure

3. Build and Run the Containers

4. Usage

Data Ingestion and Preprocessing:

Querying ClickHouse:

Visualizing Data in Grafana:

5. Troubleshooting

6. Customization

8. Future Enhancements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages