Data Engineering Assignment

Introduction

This repository contains the code for the data engineering assignment. The assignment is to create a data pipeline that extracts data from a source.

Ingest clickstream data from Kafka.
Store the ingested data in datastore, with the schema:

RowKey: ColumnFamily: Click_data: Userid: Timestamp: URL: geo_data: Country: City:

Index the processed data in Elasticsearch.

Architecture

The architecture of the data pipeline is as follows:

datasource -> Kafka -> Spark Streaming -> Elasticsearch

Dataset is from the kaggle repo https://www.kaggle.com/datasets/tunguz/clickstream-data-for-online-shopping .

Checkout the notebook where the processing is done. But We can split the function and try to perform data orchestration using airflow. (inprogress)

Setup

Docker file containing services for Kafka, Spark, Elasticsearch are provided

Images:

InProgress

integrating airflow to the pipeline checkit out at airflow_branch
The schema isnt perticularly followed as the data set doesnt contain most of the requirements specified.
The data set is not complete just processing the data directly from kafka to elastic search.

Improvements:

Include the use of hbase or cassandra for storing the data.
schedule jobs to run in timely manner using airflow.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
img		img
pyspark-notebooks		pyspark-notebooks
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Assignment

Introduction

Architecture

Setup

Images:

InProgress

Improvements:

About

Releases

Packages

Languages

seepala98/ViralNation_Assessment

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Assignment

Introduction

Architecture

Setup

Images:

InProgress

Improvements:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages