Table branching support #139

mrmasterplan · 2022-08-26T15:19:38Z

Use-case

The life-cycle of a table is such that it is produced in a long-running computation once its logic has been developed. Once the table is produced, it gets maintained with incremental updates in periodic intervals. Sometimes, the transformations that produce a set of tables change. The change of the transforming logic can be controlled with python library versions. The subject of this github issue is library support for the versioning of the underlying data product.

Data products (tables) are sometimes used by users in other departments, who may be disconnected from the discussions about possible wipe-and-rebuild plans. Even worse, some users may not be able to accept data product down-times. If such cases apply for data products that take a long time to rebuild, data product versioning may be a solution.

This story describes how table versioning may solve the issue:

A table is produced by transformation v1, the table is called version alpha
after full-loading, transformation v1 is deployed in a daily job to incrementally update version alpha
developer Amy needs to change the transformation to a v2 in such a way that a rebuilding of the table will be necessary. rebuilding takes three days
the table is in use in version alpha, and business depends on acces to the table and on the table being up-to-date every day.
the daily job is tagged with v1, like the transformation that it executes. Job v1 is not stopped and updates table alpha.
Meanwhile, Amy deploys job v2 side-by-side with job v1. Job v2 runs a full-load of table version beta.
After three days job v2 is done and now runs daily to maintain the state of table version beta.
Amy verifies that table version beta meets the quality requirements.
Amy now directs the users of the table to use version beta as the new standard version.
Amy now stops and deletes job v1.
After a few days the users of the data product are fully satisfied that table version beta meets their needs and that the daily update succeeds. Now Amy may now choose to delete table version alpha which is no longer needed.

mrmasterplan · 2022-08-26T16:30:32Z

Proposed solution:

In the TableConfigurator, and additional property could be set to mark the version of a certain table.

GoldDb:
  name: gold{gold_version}{ID}
  path: /mnt/mystorage/gold{gold_version}/

CustomerTable:
  name: "{SparkTestDb}.customers"
  path: "{SparkTestDb_path}/customers"

TableConfigurator().set_extra(gold_version="v42")

The gold_version also needs to be a part of:

the job name to allow deployment of multiple job versions at the same time

The work flow would be that Amy sets up a branch where the gold_version has been updated. She then triggers a deploy of this branch manually. Since the gold_version is part of the job name, a new set of jobs will be deployed without touching the old jobs.

Here are a some problems for which I don't have a good solution:

Data users may want to access the "stable" version of a table under the handle gold.customers without version tags. If the official tables are deployed as part of the standard deployment pipeline, How does the pipeline decide which version the official handles should point to. Maybe this can be handed with a separate deployment pipeline that does nothing else than setting up the official release handles. The price here is a new pipeline.
When Amy deploys a full set of jobs where only the gold version has changed, but not the 7 other database versions that are part of her ETL framework, should she then accept that all of her jobs are deployed in a copy? Most of her tables point at the same tables as those of the primary transformations. Since the job setup is identical in the branch deployment, race conditions may arise when copy-jobs are running. Does Amy need to mark up every single transformation step with information on whether it publishes the main version or a future, version?
If a table changes its version, all downstream tables also need to update their version. How can this be supported or enforced?

mrmasterplan · 2022-08-29T21:56:38Z

I now have what may be a complete solution:

The production library will contain two json version files that contain the version taks used in the ETL processes
- branch-local versions. This file contains the version of each tag to be used when this branch is deployed
- main-versions. This file contains the versions that the untagged tables (to be used by external users) are supposed to use
The situation that the two files contain the exact same file contents is defined as a 'stable' deployment. If the two files differ in any way, we call this a 'side' deployment (the two terms were simply chosen to simplify the descriptions below)
The PR pipeline contains a test to verify the following
- if the target branch of the PR is the main branch, then a side deployment fails the test pipeline.
The library contains sql setup files as follows:

DROP TABLE mydata.mytable; 
DROP DATABASE mydata;

CREATE DATABASE mydata LOCATION "/mnt/storage/{gold_version_main}";
CREATE TABLE mydata.mytable LOCATION "/mnt/storage/{gold_version_main}/mytable";

CREATE DATABASE mydata{gold_version} LOCATION "/mnt/storage/{gold_version}";
CREATE TABLE mydata{gold_version}.mytable 
(
   a integer,
  s string
)
LOCATION "/mnt/storage/{gold_version}/mytable";

(The whole (recommended) complexity of having the names and locations controlled in yaml files has been removed here, for clarity)

External users should use the table mydata.mytable for all their work.
The internal library should refer to the name mydata{gold_version}.mytable for all ETL processes.
In a 'stable' deployment, the two names refer to the same data and all work is as without this complexity.
In a 'side' deployment, the following happens:
- All job schedules are deployed as "PAUSED"
- All job titles are appended with a short hash of the local version file (for uniqueness of the job deployment)
- All version tags from the local version file are added to the job as job tags. This allows easy reference to what versions are used.
- Since both the local and main versions are accessible in the library, it is possible to write "primer" scripts that prime the new version with a copy of the main version in a copy operation.

To answer the questions in my pervious post:

access to stable versions has been addressed
race conditions are avoided because 'side' deployments are always paused. Manual execution of only the relevant parts can be used.
table lineage is not addressed here. It may deserve its own post.

EDIT:

The user experience for the developpers will be that unless they want to work on a major evolution, they don't need to change anything in their workflow.

If they want to change a table, they start by updating the local version. They can then run a deployment of the branch. This will be deployed side-by-side with the main jobs. The developer can then run any ETL processes they want, even running multi-day processes. If main updates, the developer can merge main into their feature branch. and deploy their feature branch again. This deployment will override the previous feature deployment, because the deployment is dependent on the state of the local version file.

mrmasterplan · 2022-09-08T08:23:00Z

I have now worked extensively with the ideas presented above. From now on the subject shall be called table branching, to distinguish it from snapshots or delta versioning of tables that already exists today. We need a way to work with more than one table branch in parallel, doing updates to both, for a time.

mrmasterplan · 2022-10-10T14:41:58Z

The implementation is too big to be handled in a single PR. The plan is as follows:

I will refactor and streamline the TableConfigurator to prepare it for adding the new functionality
add the new branch support functionality

LauJohansson added the enhancement New feature or request label Sep 1, 2022

mrmasterplan changed the title ~~Table version support~~ Table branching support Sep 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table branching support #139

Table branching support #139

mrmasterplan commented Aug 26, 2022 •

edited

Loading

mrmasterplan commented Aug 26, 2022 •

edited

Loading

mrmasterplan commented Aug 29, 2022 •

edited

Loading

mrmasterplan commented Sep 8, 2022

mrmasterplan commented Oct 10, 2022

Table branching support #139

Table branching support #139

Comments

mrmasterplan commented Aug 26, 2022 • edited Loading

Use-case

mrmasterplan commented Aug 26, 2022 • edited Loading

Proposed solution:

mrmasterplan commented Aug 29, 2022 • edited Loading

mrmasterplan commented Sep 8, 2022

mrmasterplan commented Oct 10, 2022

mrmasterplan commented Aug 26, 2022 •

edited

Loading

mrmasterplan commented Aug 26, 2022 •

edited

Loading

mrmasterplan commented Aug 29, 2022 •

edited

Loading