A reference implementation of the write-audit-publish pattern with Bauplan and DBOS
A common need on S3-backed analytics systems (e.g. a data lakehouse) is safely ingesting new data into tables available to downstream consumers.
Data engineering best practices suggest the Write-Audit-Publish (WAP) pattern, which consists of three main logical steps:
- Write: ingest data into a ''staging'' / ''temporary'' section of the lakehouse (a data branch) - the data is not visible yet to downstream consumers;
- Audit: run quality checks on the data, to verify integrity and quality (avoid the ''garbage in, garbage out'' problem);
- Publish: if the quality checks succeed, proceed to publish the data to the production branch - the data is now visible to downstream consumers; otherwise, raise an error and perform some clean-up operation.
This repository showcases how DBOS and Bauplan can be used to implement WAP in ~150 lines of no-nonsense pure Python code: no knowledge of the JVM, SQL or Iceberg is required.
If you are impatient and want to see the project in action, this is us running the code from our laptop.
While the workflow looks and feels like a simple, no-nonsense Python script, a lot of magic happens behind the scenes in the cloud, over object storage. In particular, the WAP logic maps exactly to Bauplan operations over the datalake:
- create a data branch, a zero-copy sandbox of the entire data lake in which to perform the ingestion safely;
- create an Iceberg table inside this ingestion branch, loading the files in S3 into it;
- retrieve a selected column from the Iceberg table to make sure there are no nulls (quality check);
- merge the data branch into the production branch (on success), and clean-up the data branch before exiting.
What to the developer looks like a function call (wrapped by DBOS for durable execution) is actually a complex sequence of infrastructure and cloud operations performed by Bauplan for you: you do not need to know anything about Iceberg specs, data branches, columnar querying, but just focus on the business logic.
Bauplan is the programmable lakehouse: you can load, transform, query data all from your code (CLI or Python). You can start by reading our docs, dive deep into the underlying architecture, or explore how the API simplifies advanced use cases.
To use Bauplan, you need an API key for our demo environment: you can request one here. Run the 3 minutes quick start to get familiar with the platform first.
Note: the current SDK version is 0.0.3a292
but it is subject to change as the platform evolves - ping us if you need help with any of the APIs used in this project.
To run a Write-Audit-Publish flow you need some files to write first!
When using the Bauplan demo environment, any parquet or CSV file in a publicly readable bucket will do: just load your (non-sensitive!) file(s) in a S3 bucket and set the appropriate permissions.
Note: our example video demo below is based on the Yellow Trip Dataset - adjust the quality check function accordingly if you use a different dataset.
Install the required dependencies in a virtual environment:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Run the local DBOS setup to get started - i.e. install the CLI tool and setup the database with one of the recommended methods. For example, if you have Docker installed, you can use the following commands to start a containerized Postgres database (customized variables at your discretion):
docker pull postgres
docker run --name some-postgres -e POSTGRES_USER=postgres -e POSTGRES_PASSWORD=password -p 5432:5432 -d postgres
Once the database is running, make sure your dbos-config.yaml
has both the database
and the env
section properly set up. For example:
env:
TABLE_NAME: 'yellow_trips'
BRANCH_NAME: 'mybauplanuser.dbos_ingestion'
S3_PATH: 's3://mybucket/yellow_tripdata_2024-01.parquet'
NAMESPACE: 'dbos'
Remember to run migrate on the database when you first set up the DBOS project in your Postgres:
dbos migrate
You can run the workflow with DBOS through the CLI:
dbos start
If you want to see the end result, you can watch this video demonstration of the flow in action, both in case of successful audit and in case of failure.
The code in the project is licensed under the MIT License (DBOS and Bauplan are owned by their respective owners and have their own licenses).