You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
The fake data generation is very barebones and server little more than technical unit testing of ETL development when there is no direct access to the source data.
Describe the solution you'd like
Ensuring the generated fake data maintains (in order of perceived increasing complexity to implement)
Referential integrity: it honors primary and foreign keys in case of a multi-table dataset
Combinatorial integrity
Within a row - e.g. some measurements only have some valid values
Within a table - e.g. getting one drug precludes getting another one
Across tables - e.g. some diseases are specific to women
Temporal integrity: order of certain events is maintained
Describe alternatives you've considered
Investigated several open-source and commercial synthetic data generation tools, each with their own specific shortcomings:
inability to work in any dataset, without prior knowledge about the data model
unable to run without labeling each variable as categorical/numerical/date/... before applying the generation algorithm
cloud-based solutions
Additional context
Already discussed and shared ideas with @schuemie on where we could start to implement this, but any additional input and/or feedback is very welcome.
The text was updated successfully, but these errors were encountered:
The main two open-source solutions that we have tried are:
Synthetic Data Vault (SDV): One of its main advantage that we have needed it its capability to handle relational databases in addition to the single table format; we have tried their approach of Hierarchical Modelling Algorithm for multi-table. The main issues that we experienced are related the temporal integrity either row-wise or column-wise in addition to missing the association between the logically correlated columns (e.g. Survival and Death-date). It is worth mentioning that the tool supports the option to add constraints between the columns but it is with limited capabilities. In conclusion this tool more specifically this model (which is the only public model now) lacks grasping the association between the columns efficiently in addition to the poor temporal integrity.
The other tool is PrivBayes Data Synthesiser: this tool's strong property is the preserving the correlation between the different columns, however it requires preprocessing the data to identify the data type such as categorical, numerical, date-time, etc.. After experimenting this tool, we have observed that it is has shortcomings to grasp the correlation between the string-formatted categorical variables also it couldn't provide an efficient temporal integrity.
Conclusively, our requirements that we list below are hardly met even partially by the already experimented tools;
Requirements: 1- Preserving the Multivariate correlations.
2- Handling the multi-table databases. 3- Preserving the temporal integrity vertically and horizontally. 4- Preserving privacy while handling the referential integrity.
For the tools you have mentioned above we have no experience with, we are planning to investigate them.
Is your feature request related to a problem? Please describe.
The fake data generation is very barebones and server little more than technical unit testing of ETL development when there is no direct access to the source data.
Describe the solution you'd like
Ensuring the generated fake data maintains (in order of perceived increasing complexity to implement)
Describe alternatives you've considered
Investigated several open-source and commercial synthetic data generation tools, each with their own specific shortcomings:
Additional context
Already discussed and shared ideas with @schuemie on where we could start to implement this, but any additional input and/or feedback is very welcome.
The text was updated successfully, but these errors were encountered: