Project Context

In a world where data is considered the new gold, it is imperative for organizations to be able to process and analyze data in real-time to make informed decisions. This program is designed for data professionals looking to acquire practical skills in implementing real-time data pipelines.

Configuration and Dependencies :

Install the required libraries and dependencies for Spark, pyspark Kafka, Cassandra, and MongoDB :

Kafka Installation & Configuration.

To begin, we will first explain the installation process for Kafka and how to use it. Let's get started! To install Kafka, follow the link. In my case, I'm using Kafka version 3.6.0 link to binary.

After completing the installation, please make sure to extract the files and place them in the C:/kafka directory, naming it as desired. Next, open the Windows environment variables by typing env in the Windows search bar and add the following path: C:\kafka\bin\windows. Ensure that you select the windows directory inside the 'bin' folder.

Now, let's get the work done by running Kafka and Zookeeper.

Usage Instructions:

First, open the C:\Kafka\bin\windows\ directory with your command prompt and then type the following command make sure every time you open new command prompt for each process:

To Start zookeeper

zookeeper-server-start.bat ..\..\config\zookeeper.properties

To Start kafka

kafka-server-start.bat ..\..\config\server.properties

To ensure everything is ready to go, please execute the following command below to create a topic. In my case, I named it 'first_topic,' but you can choose any name you prefer. Voilà, your topic is now created!

created new topic:

kafka-topics --bootstrap-server 127.0.0.1:9092 --topic first_topic --create --partitions 3 --replication-factor 1

Note: I'm using the IP address 127.0.0.1 (localhost) on the default port 9092.

If you want to check or list your topics, please enter the following command:

list topics:

kafka-topics.bat --list --bootstrap-server localhost:9092

Topic describe :

kafka-topics --bootstrap-server 127.0.0.1:9092 --topic first_topic --describe

Create console producer :

kafka-console-producer.bat --broker-list localhost:9092 --topic first_topic

Create console consumer :

kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic first_topic --from-beginning

SPARK Installation & Configuration.

To install spark, follow the link. In my case, I'm using spark version spark-3.2.4-bin-hadoop2.7

Follow the same process as we did with Kafka, but make sure to add the following path to the system's 'Path': %SPARK_HOME%\bin. Additionally, include the following path in the system variables: C:\Spark\spark. Ensure that you select the 'bin' directory during this process.

To ensure that everything has been completed successfully, please open a new Command Prompt (cmd) window and type the following command:

spark-shell

If any problems occur, please check if Java is installed on your computer.

java --version

To install Java, please follow the same instructions and ensure that you have installed Java JDK 8 or 11, which is compatible with this version of Spark.

After completing all the necessary steps, it's time to install PySpark to enable Python usage. PySpark serves as the Python library for Apache Spark, facilitating seamless integration and utilization of Spark functionalities within Python applications.

Run the command below to proceed, or visit the link for more information:

pip install pyspark

If any problems occur, please check if python is installed on your computer make sure you read this docs to guarantee compatibility with the correct Python version.

CASSANDRA Installation & Configuration.

We are going to install Cassandra on Docker. Please follow this link for instructions.

MONGODB Installation & Configuration.

The process for MongoDB is the same and quite straightforward. Please ensure everything is good to go.

Project:

Kafka Producer using `(confluent-kafka lib)`:

fetches random user data from the API (randomuser) and streams it to a Kafka topic, creating a data pipeline for further processing or analysis.

PySpark Consumer for `real-time` processing:

Utilize the provided code to extract data from the Kafka topic "jane__essadi__topic."
Implement data transformations, which encompass parsing, validation, and data enrichment.
insert data into Cassandra
Execute data aggregation to derive insights, such as the count of users by nationality and the average user age. Store these results in MongoDB through the save_to_mongodb_collection function.
Configure debugging and monitoring mechanisms to track the pipeline's performance and identify potential issues.
Develop data visualization dashboards with Python Dash to present aggregated data effectively.
Verify the accurate insertion of data into the Cassandra table and MongoDB collections.

Data Visualization

Employ Python Dash to construct data visualization dashboards, enabling the presentation of aggregated data, such as user counts by nationality and average user age.

GDPR Compliance :

In our professional context, our primary focus is on safeguarding our clients from data-related risks. We meticulously adhere to GDPR regulations, including refraining from collecting data on individuals below the age of 18. Additionally, we prioritize securing sensitive client information through encryption or deletion, ultimately ensuring the safety of our users.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
random-users		random-users
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Context

Configuration and Dependencies :

Kafka Installation & Configuration.

Usage Instructions:

To Start zookeeper

To Start kafka

created new topic:

list topics:

Topic describe :

Create console producer :

Create console consumer :

SPARK Installation & Configuration.

CASSANDRA Installation & Configuration.

MONGODB Installation & Configuration.

Project:

Kafka Producer using `(confluent-kafka lib)`:

PySpark Consumer for `real-time` processing:

Data Visualization

GDPR Compliance :

About

Releases

Packages

Languages

License

yassinessadi/Real_Time_Pipeline_Kafka_Spark_Cassandra_mongo_db

Folders and files

Latest commit

History

Repository files navigation

Project Context

Configuration and Dependencies :

Kafka Installation & Configuration.

Usage Instructions:

To Start zookeeper

To Start kafka

created new topic:

list topics:

Topic describe :

Create console producer :

Create console consumer :

SPARK Installation & Configuration.

CASSANDRA Installation & Configuration.

MONGODB Installation & Configuration.

Project:

Kafka Producer using (confluent-kafka lib):

PySpark Consumer for real-time processing:

Data Visualization

GDPR Compliance :

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Kafka Producer using `(confluent-kafka lib)`:

PySpark Consumer for `real-time` processing:

Packages