Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Generate Synthetic Data with a GAN While Preserving Row Constraints and Outcome Patterns? #2355

Open
WilsimanEvangelista opened this issue Jan 21, 2025 · 1 comment
Labels
question General question about the software under discussion Issue is currently being discussed

Comments

@WilsimanEvangelista
Copy link

How can I generate synthetic data based on an existing table with 6 columns (including the outcome column), where the outcome can be 0, 1, or 2, using a GAN, and considering the following constraints?

  • The cell values must be between 0 and 1 (inclusive), with up to two decimal places.

  • The sum of all cells in a row, except the last column, must be exactly 1.

Additionally:

  • New rows should follow the same "pattern" as the original rows based on the "outcome" (last column). For instance, if rows with "outcome" equal to 0 have a specific "pattern" in the values of the other columns, the newly generated rows for that "outcome" should preserve that pattern.

  • The process should allow generating different amounts of synthetic data for each "outcome," while maintaining the above constraints.

How can I implement this using a GAN in Python? If possible, provide examples of libraries and code to set up and train the GAN to meet these constraints and the requested pattern.

@npatki npatki added question General question about the software new Automatic label applied to new issues labels Jan 21, 2025
@npatki
Copy link
Contributor

npatki commented Jan 22, 2025

Hi @WilsimanEvangelista, nice to meet you. Have you been able to try using the SDV library with your data already? I think it would be helpful to run through the resources below with your dataset -- as the SDV's synthetic data is designed to meet almost all the points that you have inputted above.

General SDV Resources:

Possibly the only pattern that will be difficult for a GAN learn out-of-the-box is this one:

The sum of all cells in a row, except the last column, must be exactly 1.

To achieve this, you can apply constraints

Constraints resources:

If you are able to run through this and have any specific questions, we request that you please share the code and any output(s) you are getting. Thanks.

@npatki npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software under discussion Issue is currently being discussed
Projects
None yet
Development

No branches or pull requests

2 participants