-
Notifications
You must be signed in to change notification settings - Fork 4
Table branching support #139
Comments
Proposed solution:In the TableConfigurator, and additional property could be set to mark the version of a certain table. GoldDb:
name: gold{gold_version}{ID}
path: /mnt/mystorage/gold{gold_version}/
CustomerTable:
name: "{SparkTestDb}.customers"
path: "{SparkTestDb_path}/customers" TableConfigurator().set_extra(gold_version="v42") The
The work flow would be that Amy sets up a branch where the Here are a some problems for which I don't have a good solution:
|
I now have what may be a complete solution:
DROP TABLE mydata.mytable;
DROP DATABASE mydata;
CREATE DATABASE mydata LOCATION "/mnt/storage/{gold_version_main}";
CREATE TABLE mydata.mytable LOCATION "/mnt/storage/{gold_version_main}/mytable";
CREATE DATABASE mydata{gold_version} LOCATION "/mnt/storage/{gold_version}";
CREATE TABLE mydata{gold_version}.mytable
(
a integer,
s string
)
LOCATION "/mnt/storage/{gold_version}/mytable"; (The whole (recommended) complexity of having the names and locations controlled in yaml files has been removed here, for clarity)
To answer the questions in my pervious post:
EDIT: The user experience for the developpers will be that unless they want to work on a major evolution, they don't need to change anything in their workflow. If they want to change a table, they start by updating the local version. They can then run a deployment of the branch. This will be deployed side-by-side with the main jobs. The developer can then run any ETL processes they want, even running multi-day processes. If main updates, the developer can merge main into their feature branch. and deploy their feature branch again. This deployment will override the previous feature deployment, because the deployment is dependent on the state of the local version file. |
I have now worked extensively with the ideas presented above. From now on the subject shall be called table branching, to distinguish it from snapshots or delta versioning of tables that already exists today. We need a way to work with more than one table branch in parallel, doing updates to both, for a time. |
The implementation is too big to be handled in a single PR. The plan is as follows:
|
Use-case
The life-cycle of a table is such that it is produced in a long-running computation once its logic has been developed. Once the table is produced, it gets maintained with incremental updates in periodic intervals. Sometimes, the transformations that produce a set of tables change. The change of the transforming logic can be controlled with python library versions. The subject of this github issue is library support for the versioning of the underlying data product.
Data products (tables) are sometimes used by users in other departments, who may be disconnected from the discussions about possible wipe-and-rebuild plans. Even worse, some users may not be able to accept data product down-times. If such cases apply for data products that take a long time to rebuild, data product versioning may be a solution.
This story describes how table versioning may solve the issue:
The text was updated successfully, but these errors were encountered: