This repository contains the code for the data engineering assignment. The assignment is to create a data pipeline that extracts data from a source.
- Ingest clickstream data from Kafka.
- Store the ingested data in datastore, with the schema:
RowKey: ColumnFamily: Click_data: Userid: Timestamp: URL: geo_data: Country: City:
- Index the processed data in Elasticsearch.
The architecture of the data pipeline is as follows:
datasource -> Kafka -> Spark Streaming -> Elasticsearch
Dataset is from the kaggle repo https://www.kaggle.com/datasets/tunguz/clickstream-data-for-online-shopping .
Checkout the notebook where the processing is done. But We can split the function and try to perform data orchestration using airflow. (inprogress)
Docker file containing services for Kafka, Spark, Elasticsearch are provided
- integrating airflow to the pipeline checkit out at airflow_branch
- The schema isnt perticularly followed as the data set doesnt contain most of the requirements specified.
- The data set is not complete just processing the data directly from kafka to elastic search.
- Include the use of hbase or cassandra for storing the data.
- schedule jobs to run in timely manner using airflow.