This project is a comprehensive data engineering solution developed as part of the ITI Data Engineering Track graduation project. It demonstrates real-time data ingestion, processing, storage, and live analytics using a multi-cloud architecture. The project utilizes a combination of Apache Kafka, Apache Airflow, PySpark, Hadoop, Hive, SQL Server, Azure SQL Database, Docker, and Power BI.
- Architecture Overview
- Technologies Used
- Pipeline Workflow
- Setup and Installation
- Usage
- Dockerization
- Contributors
- Demo and Presentation
The project consists of a scalable, real-time data pipeline that ingests, processes, and stores data using different services and technologies across multiple cloud platforms.
- Data Ingestion: Streaming data using Apache Kafka.
- Data Orchestration: Managed by Apache Airflow running on AWS EC2.
- Data Processing (PySpark):
- First branch writes raw streaming data to Azure SQL Database for backup.
- Second branch runs a machine learning classification model and stores the results in SQL Server.
- Third branch writes data to Hadoop and processes it with Hive SQL for big data solutions.
- Data Visualization: Processed data is displayed on Power BI for live analytics and reporting.
- Multi-cloud Deployment: Services run across AWS EC2 and Azure (Azure SQL Database).
- Kafka: Used for real-time data streaming.
- Airflow: Workflow orchestration and scheduling.
- PySpark: Data processing and machine learning.
- Azure SQL Database: Stores raw data for backup.
- SQL Server: Stores processed data and results of ML classification.
- Hadoop: Distributed data storage for big data processing.
- Hive: Query engine for large datasets stored in Hadoop.
- Power BI: Real-time dashboards for visualizing processed data.
- Docker: Containerization of services for portability.
- AWS EC2: Hosting for Kafka, Airflow, PySpark, and Hadoop services.
- Apache Kafka simulates data streaming, allowing continuous data ingestion.
- Apache Airflow orchestrates the data ingestion process, scheduling tasks and managing dependencies.
- Branch 1 (Backup): Consumes the data stream from Kafka and writes raw data to Azure SQL Database for backup.
- Branch 2 (ML Model & ODS):
- Runs a PySpark ML classification model.
- Stores the results in SQL Server as:
- Operational Data Store (ODS): A table that records all operations and results in real-time.
- Distinct Data Table: A table that stores distinct values after filtering, used for live analytics.
- Branch 3 (Big Data Storage): Writes streaming data to Hadoop and uses Hive SQL for querying the big data.
- The cleaned and distinct data is connected to Power BI for real-time visualization and dashboard creation.
- AWS EC2: Set up an EC2 instance to run Kafka, Airflow, PySpark, and Hadoop services.
- Azure SQL Database: Create a database in Azure to store raw data.
- SQL Server: Ensure SQL Server is accessible for storing processed ML data.
- Docker: Install Docker for containerized service management.
- Power BI: Set up Power BI for visualizing the data.
- Start Kafka: Ensure Kafka is streaming data.
- Trigger Airflow DAGs: Use Airflow to manage and trigger the PySpark jobs.
- Monitor PySpark Jobs: PySpark will process the data in the three branches, saving raw data in Azure SQL Database, ML results in SQL Server, and streaming data in Hadoop.
- Visualize Data in Power BI: Open Power BI and ensure real-time data is visualized from SQL Server.
The following services have been Dockerized for ease of deployment and scaling:
- Kafka and Airflow are Dockerized on EC2.
- They serve as shared resources for managing data streaming (Kafka) and workflow orchestration (Airflow) in the pipeline.
- PySpark, Hadoop, and Hive are Dockerized locally.
- Networking between local containers and EC2-based services is essential for proper pipeline functioning.
To start the Dockerized services locally (PySpark, Hadoop, Hive):
docker-compose up -d
- Demo Video: Watch the demo
- Presentation Slides: View Presentation