Spark ETL to ingest Web Events

Spark data pipeline that ingests and transforms airbyte events data.

Data Architecture

Implemented Data Lakehouse architecture with the following layers:

Raw: Contains raw data files directly ingested from an event stream, e.g. Kafka. This data should generally not be accessible.
Standardised: Contains standardised data (catalogued tables) based on the raw data with basic cleaning transformations applied -> PII masking, flatten nested columns and basic data quality checks.
Curated: Contains transformed data (catalogued tables) according to business and data quality rules.

Delta is used as the table format.

Data pipeline design

The data pipeline consists of the following tasks:

Standardise task: Flattens data and applies basic data quality checks. This may be beneficial for many teams.
Curate task: consumes the dataset from Standardised, performs transformations and business logic, and persists into Curated.

The datasets are initially partitioned by execution date (with the option to add more partitioning columns).

Each task runs Data Quality checks on the output dataset just after writing. Data Quality checks are defined using Soda.

Configuration management

Configuration is defined in app_config.yaml and managed by the ConfigManager class, which is a wrapper around Dynaconf.

Packaging and dependency management

Poetry is used for Python packaging and dependency management.

Pre-requisites

Execution instructions

The repo includes a Makefile. Please run make help to see usage.

make setup

make build

Raw data is currently under data-lake-raw-dev/airbyte/2023/04/12 (feel free to choose other data). To obtain the results run:

make run-local task=standardise execution-date=2023-04-12

make run-local task=curate execution-date=2023-04-12

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github/workflows		.github/workflows
data-lake-raw-dev/web_events_airbyte/2023/04/12		data-lake-raw-dev/web_events_airbyte/2023/04/12
spark_web_events_etl		spark_web_events_etl
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
app_config.yaml		app_config.yaml
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark ETL to ingest Web Events

Data Architecture

Data pipeline design

Configuration management

Packaging and dependency management

Pre-requisites

Execution instructions

About

Releases 6

Packages

Languages

joseguerr/spark-web-events

Folders and files

Latest commit

History

Repository files navigation

Spark ETL to ingest Web Events

Data Architecture

Data pipeline design

Configuration management

Packaging and dependency management

Pre-requisites

Execution instructions

About

Resources

Stars

Watchers

Forks

Releases 6

Packages 0

Languages

Packages