Smarter fake data generation #419

WoutV · 2024-08-27T15:20:30Z

Is your feature request related to a problem? Please describe.
The fake data generation is very barebones and server little more than technical unit testing of ETL development when there is no direct access to the source data.

Describe the solution you'd like
Ensuring the generated fake data maintains (in order of perceived increasing complexity to implement)

Referential integrity: it honors primary and foreign keys in case of a multi-table dataset
Combinatorial integrity

Within a row - e.g. some measurements only have some valid values
Within a table - e.g. getting one drug precludes getting another one
Across tables - e.g. some diseases are specific to women

Temporal integrity: order of certain events is maintained

Describe alternatives you've considered
Investigated several open-source and commercial synthetic data generation tools, each with their own specific shortcomings:

inability to work in any dataset, without prior knowledge about the data model
unable to run without labeling each variable as categorical/numerical/date/... before applying the generation algorithm
cloud-based solutions

Additional context
Already discussed and shared ideas with @schuemie on where we could start to implement this, but any additional input and/or feedback is very welcome.

howff · 2024-09-13T11:57:56Z

Could you say which ones you've tried and what shortcomings you found in them?
(just checking you've tried SynthPop and BIT-ADRUK-synthetic-data-tool)

AhmedYoussefAli · 2024-09-16T09:54:10Z

The main two open-source solutions that we have tried are:
Synthetic Data Vault (SDV): One of its main advantage that we have needed it its capability to handle relational databases in addition to the single table format; we have tried their approach of Hierarchical Modelling Algorithm for multi-table. The main issues that we experienced are related the temporal integrity either row-wise or column-wise in addition to missing the association between the logically correlated columns (e.g. Survival and Death-date). It is worth mentioning that the tool supports the option to add constraints between the columns but it is with limited capabilities. In conclusion this tool more specifically this model (which is the only public model now) lacks grasping the association between the columns efficiently in addition to the poor temporal integrity.
The other tool is PrivBayes Data Synthesiser: this tool's strong property is the preserving the correlation between the different columns, however it requires preprocessing the data to identify the data type such as categorical, numerical, date-time, etc.. After experimenting this tool, we have observed that it is has shortcomings to grasp the correlation between the string-formatted categorical variables also it couldn't provide an efficient temporal integrity.

Conclusively, our requirements that we list below are hardly met even partially by the already experimented tools;
Requirements: 1- Preserving the Multivariate correlations.
2- Handling the multi-table databases. 3- Preserving the temporal integrity vertically and horizontally. 4- Preserving privacy while handling the referential integrity.
For the tools you have mentioned above we have no experience with, we are planning to investigate them.

WoutV added the enhancement label Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smarter fake data generation #419

Smarter fake data generation #419

WoutV commented Aug 27, 2024

howff commented Sep 13, 2024

AhmedYoussefAli commented Sep 16, 2024

Smarter fake data generation #419

Smarter fake data generation #419

Comments

WoutV commented Aug 27, 2024

howff commented Sep 13, 2024

AhmedYoussefAli commented Sep 16, 2024