Kafka Topic -> Spark Streaming (Window()) = Data Aggregated -> Cassandra -> BackEnd (Websocket) -> Dashboard UI
This repository contains the source code and configuration for a real-time data processing pipeline that aggregates data from a Kafka topic, performs real-time analytics using Spark Streaming, stores the aggregated data in Cassandra, and updates a dashboard UI in real-time through a WebSocket connection.
- Data is ingested into the pipeline through a Kafka topic.
- Kafka is a distributed event streaming platform, providing a scalable and fault-tolerant mechanism for data ingestion.
- Utilizing Spark Streaming for real-time data processing.
- The
Window()
function is applied for windowed operations to aggregate data over specific time intervals.
- Data is aggregated within the Spark Streaming step using various aggregation functions.
- Common operations include summing, averaging, counting, etc., depending on the specific use case.
- Aggregated data is stored in Cassandra, a highly scalable NoSQL database.
- Cassandra is chosen for its ability to handle large volumes of data across multiple nodes.
- The backend of the application communicates with the front end through a WebSocket connection.
- WebSocket enables bidirectional communication, allowing real-time updates to be sent from the server to the client.
- The Dashboard UI provides a user interface for visualizing and interacting with real-time aggregated data.
- Updates are received in real-time through the WebSocket connection, ensuring the dashboard reflects the latest information.
To set up and run the real-time data processing pipeline, follow the steps outlined in the Installation Guide and Configuration Documentation.
- Docker
- Docker Compose
-
Clone the repository:
git clone https://github.com/anthoai97/simple-end-to-end-data-streaming cd simple-end-to-end-data-streaming docker-compose up
An Thoai |
This project is licensed under the terms of the MIT license.