Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] / [QST] Dataset to support random splitting & splitting based on value of some column #103

Open
radekosmulski opened this issue Jun 23, 2022 · 0 comments

Comments

@radekosmulski
Copy link

I would like to raise this for broader consideration -- I am not sure what the answer here is, though I find myself hoping for this functionality more and more as I work with Datasets in NVTabular.

You might want to preprocess your data to a Dataset, maybe apply some NVTabular ops, and only split it down the road. You might want to do this to streamline your pipeline and give yourself the ability to experiment faster. Or you might want to see how the model responds to training on different splits of data.

With certain preprocessing techniques that is not what you might want to do to avoid leakage from train to your validation split. But these scenarios are likely an overwhelmingly small minority.

There are ways to work around this (splitting the data before creating a Dataset or doing to_ddf and creating a new Dataset from the output, after modifying the ddf) but this is quite cumbersome.

The way random splitting could work might be by just passing the portion of data to retain in one of the splits, say 0.7, or 0.2.

And splitting based on value might work by passing in the column to use and the values to retain in one of the splits vs the other.

Something like this
image

(the take here -- all names are very tentative -- might be a nice related functionality that would be nice to have, and could be very useful for experimenting).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant