Skip to content

joseguerr/spark-web-events

Repository files navigation

Spark ETL to ingest Web Events

codecov

Spark data pipeline that ingests and transforms airbyte events data.

Data Architecture

Implemented Data Lakehouse architecture with the following layers:

  • Raw: Contains raw data files directly ingested from an event stream, e.g. Kafka. This data should generally not be accessible.
  • Standardised: Contains standardised data (catalogued tables) based on the raw data with basic cleaning transformations applied -> PII masking, flatten nested columns and basic data quality checks.
  • Curated: Contains transformed data (catalogued tables) according to business and data quality rules.

Delta is used as the table format.

Data pipeline design

The data pipeline consists of the following tasks:

  • Standardise task: Flattens data and applies basic data quality checks. This may be beneficial for many teams.
  • Curate task: consumes the dataset from Standardised, performs transformations and business logic, and persists into Curated.

The datasets are initially partitioned by execution date (with the option to add more partitioning columns).

Each task runs Data Quality checks on the output dataset just after writing. Data Quality checks are defined using Soda.

Configuration management

Configuration is defined in app_config.yaml and managed by the ConfigManager class, which is a wrapper around Dynaconf.

Packaging and dependency management

Poetry is used for Python packaging and dependency management.

Pre-requisites

Execution instructions

The repo includes a Makefile. Please run make help to see usage.

make setup
make build

Raw data is currently under data-lake-raw-dev/airbyte/2023/04/12 (feel free to choose other data). To obtain the results run:

make run-local task=standardise execution-date=2023-04-12
make run-local task=curate execution-date=2023-04-12

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published