Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating out-of-distribution data #2399

Open
weirdfishs opened this issue Feb 27, 2025 · 1 comment
Open

Generating out-of-distribution data #2399

weirdfishs opened this issue Feb 27, 2025 · 1 comment
Labels
question General question about the software under discussion Issue is currently being discussed

Comments

@weirdfishs
Copy link

My current environment is fairly irellevant to the question at hand so I thought to simply omit this

Problem description

I am currently trying to generate "novel" (out-of-distribution) data to evaluate a novel-detecting classification framework. Currently, I am comparing many different generation methods from the SDV library, but of course these generators aim to generate data that is very similar to the original. I'm just aiming to open a discussion as to whether it would be possible to generate this kind of data using SDV.

What I already tried

While yet to try, I am taking a light assumption that potentially specifying probability distributions differing to the original set could produce results which are outside of the original distribution, am I assuming correctly - is this how the library should work?

Secondly, I am looking at conditional sampling to ensure a certain amount of some samples are included, hopefully to change the statistical properties of the new set, so that it differs from the original. I am going to assume this method will be beneficial as this is partially what I need to do anyway.

In case I am missing something (perhaps a generator which could be tuned to do this), I would appreciate any further input anyone is able to provide - if any. I have had a suggestion to use a VAE which samples from rarely visited areas in the latent space, but as far as I am aware, this is likely not possible with SDV unless modifying source code.

Any input is greatly appreaciated, thanks for reading!

@weirdfishs weirdfishs added new Automatic label applied to new issues question General question about the software labels Feb 27, 2025
@srinify
Copy link
Contributor

srinify commented Mar 1, 2025

Hi @weirdfishs this sounds like an interesting project. You're definitely correct that SDV synthesizers are designed to learn and mimic the patterns in the real data, not generate data that's dissimilar.

Your best bet might be to use the SDV for the subset of samples you need that are in fact statistically similar to the real data and use a different approach for the "novel" samples you want, as you suggested.

I unfortunately can't provide much guidance here for generating out-of-distribution data to help generate outliers. I know that this is probably an unsatisfying answer though!

@srinify srinify added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Mar 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software under discussion Issue is currently being discussed
Projects
None yet
Development

No branches or pull requests

2 participants