Skip to content
This repository has been archived by the owner on Aug 25, 2023. It is now read-only.

Table branching support #139

Open
mrmasterplan opened this issue Aug 26, 2022 · 4 comments
Open

Table branching support #139

mrmasterplan opened this issue Aug 26, 2022 · 4 comments
Labels
enhancement New feature or request

Comments

@mrmasterplan
Copy link
Contributor

mrmasterplan commented Aug 26, 2022

Use-case

The life-cycle of a table is such that it is produced in a long-running computation once its logic has been developed. Once the table is produced, it gets maintained with incremental updates in periodic intervals. Sometimes, the transformations that produce a set of tables change. The change of the transforming logic can be controlled with python library versions. The subject of this github issue is library support for the versioning of the underlying data product.

Data products (tables) are sometimes used by users in other departments, who may be disconnected from the discussions about possible wipe-and-rebuild plans. Even worse, some users may not be able to accept data product down-times. If such cases apply for data products that take a long time to rebuild, data product versioning may be a solution.

This story describes how table versioning may solve the issue:

  • A table is produced by transformation v1, the table is called version alpha
  • after full-loading, transformation v1 is deployed in a daily job to incrementally update version alpha
  • developer Amy needs to change the transformation to a v2 in such a way that a rebuilding of the table will be necessary. rebuilding takes three days
  • the table is in use in version alpha, and business depends on acces to the table and on the table being up-to-date every day.
  • the daily job is tagged with v1, like the transformation that it executes. Job v1 is not stopped and updates table alpha.
  • Meanwhile, Amy deploys job v2 side-by-side with job v1. Job v2 runs a full-load of table version beta.
  • After three days job v2 is done and now runs daily to maintain the state of table version beta.
  • Amy verifies that table version beta meets the quality requirements.
  • Amy now directs the users of the table to use version beta as the new standard version.
  • Amy now stops and deletes job v1.
  • After a few days the users of the data product are fully satisfied that table version beta meets their needs and that the daily update succeeds. Now Amy may now choose to delete table version alpha which is no longer needed.
@mrmasterplan
Copy link
Contributor Author

mrmasterplan commented Aug 26, 2022

Proposed solution:

In the TableConfigurator, and additional property could be set to mark the version of a certain table.

GoldDb:
  name: gold{gold_version}{ID}
  path: /mnt/mystorage/gold{gold_version}/

CustomerTable:
  name: "{SparkTestDb}.customers"
  path: "{SparkTestDb_path}/customers"
TableConfigurator().set_extra(gold_version="v42")

The gold_version also needs to be a part of:

  • the job name to allow deployment of multiple job versions at the same time

The work flow would be that Amy sets up a branch where the gold_version has been updated. She then triggers a deploy of this branch manually. Since the gold_version is part of the job name, a new set of jobs will be deployed without touching the old jobs.

Here are a some problems for which I don't have a good solution:

  • Data users may want to access the "stable" version of a table under the handle gold.customers without version tags. If the official tables are deployed as part of the standard deployment pipeline, How does the pipeline decide which version the official handles should point to. Maybe this can be handed with a separate deployment pipeline that does nothing else than setting up the official release handles. The price here is a new pipeline.
  • When Amy deploys a full set of jobs where only the gold version has changed, but not the 7 other database versions that are part of her ETL framework, should she then accept that all of her jobs are deployed in a copy? Most of her tables point at the same tables as those of the primary transformations. Since the job setup is identical in the branch deployment, race conditions may arise when copy-jobs are running. Does Amy need to mark up every single transformation step with information on whether it publishes the main version or a future, version?
  • If a table changes its version, all downstream tables also need to update their version. How can this be supported or enforced?

@mrmasterplan
Copy link
Contributor Author

mrmasterplan commented Aug 29, 2022

I now have what may be a complete solution:

  • The production library will contain two json version files that contain the version taks used in the ETL processes
    • branch-local versions. This file contains the version of each tag to be used when this branch is deployed
    • main-versions. This file contains the versions that the untagged tables (to be used by external users) are supposed to use
  • The situation that the two files contain the exact same file contents is defined as a 'stable' deployment. If the two files differ in any way, we call this a 'side' deployment (the two terms were simply chosen to simplify the descriptions below)
  • The PR pipeline contains a test to verify the following
    • if the target branch of the PR is the main branch, then a side deployment fails the test pipeline.
  • The library contains sql setup files as follows:
DROP TABLE mydata.mytable; 
DROP DATABASE mydata;

CREATE DATABASE mydata LOCATION "/mnt/storage/{gold_version_main}";
CREATE TABLE mydata.mytable LOCATION "/mnt/storage/{gold_version_main}/mytable";

CREATE DATABASE mydata{gold_version} LOCATION "/mnt/storage/{gold_version}";
CREATE TABLE mydata{gold_version}.mytable 
(
   a integer,
  s string
)
LOCATION "/mnt/storage/{gold_version}/mytable";

(The whole (recommended) complexity of having the names and locations controlled in yaml files has been removed here, for clarity)

  • External users should use the table mydata.mytable for all their work.
  • The internal library should refer to the name mydata{gold_version}.mytable for all ETL processes.
  • In a 'stable' deployment, the two names refer to the same data and all work is as without this complexity.
  • In a 'side' deployment, the following happens:
    • All job schedules are deployed as "PAUSED"
    • All job titles are appended with a short hash of the local version file (for uniqueness of the job deployment)
    • All version tags from the local version file are added to the job as job tags. This allows easy reference to what versions are used.
    • Since both the local and main versions are accessible in the library, it is possible to write "primer" scripts that prime the new version with a copy of the main version in a copy operation.

To answer the questions in my pervious post:

  • access to stable versions has been addressed
  • race conditions are avoided because 'side' deployments are always paused. Manual execution of only the relevant parts can be used.
  • table lineage is not addressed here. It may deserve its own post.

EDIT:

The user experience for the developpers will be that unless they want to work on a major evolution, they don't need to change anything in their workflow.

If they want to change a table, they start by updating the local version. They can then run a deployment of the branch. This will be deployed side-by-side with the main jobs. The developer can then run any ETL processes they want, even running multi-day processes. If main updates, the developer can merge main into their feature branch. and deploy their feature branch again. This deployment will override the previous feature deployment, because the deployment is dependent on the state of the local version file.

@LauJohansson LauJohansson added the enhancement New feature or request label Sep 1, 2022
@mrmasterplan mrmasterplan changed the title Table version support Table branching support Sep 8, 2022
@mrmasterplan
Copy link
Contributor Author

I have now worked extensively with the ideas presented above. From now on the subject shall be called table branching, to distinguish it from snapshots or delta versioning of tables that already exists today. We need a way to work with more than one table branch in parallel, doing updates to both, for a time.

@mrmasterplan
Copy link
Contributor Author

The implementation is too big to be handled in a single PR. The plan is as follows:

  1. I will refactor and streamline the TableConfigurator to prepare it for adding the new functionality
  2. add the new branch support functionality

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants