Generating out-of-distribution data #2399

weirdfishs · 2025-02-27T14:56:50Z

My current environment is fairly irellevant to the question at hand so I thought to simply omit this

Problem description

I am currently trying to generate "novel" (out-of-distribution) data to evaluate a novel-detecting classification framework. Currently, I am comparing many different generation methods from the SDV library, but of course these generators aim to generate data that is very similar to the original. I'm just aiming to open a discussion as to whether it would be possible to generate this kind of data using SDV.

What I already tried

While yet to try, I am taking a light assumption that potentially specifying probability distributions differing to the original set could produce results which are outside of the original distribution, am I assuming correctly - is this how the library should work?

Secondly, I am looking at conditional sampling to ensure a certain amount of some samples are included, hopefully to change the statistical properties of the new set, so that it differs from the original. I am going to assume this method will be beneficial as this is partially what I need to do anyway.

In case I am missing something (perhaps a generator which could be tuned to do this), I would appreciate any further input anyone is able to provide - if any. I have had a suggestion to use a VAE which samples from rarely visited areas in the latent space, but as far as I am aware, this is likely not possible with SDV unless modifying source code.

Any input is greatly appreaciated, thanks for reading!

srinify · 2025-03-01T21:21:34Z

Hi @weirdfishs this sounds like an interesting project. You're definitely correct that SDV synthesizers are designed to learn and mimic the patterns in the real data, not generate data that's dissimilar.

Your best bet might be to use the SDV for the subset of samples you need that are in fact statistically similar to the real data and use a different approach for the "novel" samples you want, as you suggested.

I unfortunately can't provide much guidance here for generating out-of-distribution data to help generate outliers. I know that this is probably an unsatisfying answer though!

weirdfishs added new Automatic label applied to new issues question General question about the software labels Feb 27, 2025

srinify added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Mar 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating out-of-distribution data #2399

Generating out-of-distribution data #2399

weirdfishs commented Feb 27, 2025

srinify commented Mar 1, 2025 •

edited

Loading

Generating out-of-distribution data #2399

Generating out-of-distribution data #2399

Comments

weirdfishs commented Feb 27, 2025

Problem description

What I already tried

srinify commented Mar 1, 2025 • edited Loading

srinify commented Mar 1, 2025 •

edited

Loading