[FEA] / [QST] Dataset to support random splitting & splitting based on value of some column #103

radekosmulski · 2022-06-23T01:32:24Z

I would like to raise this for broader consideration -- I am not sure what the answer here is, though I find myself hoping for this functionality more and more as I work with Datasets in NVTabular.

You might want to preprocess your data to a Dataset, maybe apply some NVTabular ops, and only split it down the road. You might want to do this to streamline your pipeline and give yourself the ability to experiment faster. Or you might want to see how the model responds to training on different splits of data.

With certain preprocessing techniques that is not what you might want to do to avoid leakage from train to your validation split. But these scenarios are likely an overwhelmingly small minority.

There are ways to work around this (splitting the data before creating a Dataset or doing to_ddf and creating a new Dataset from the output, after modifying the ddf) but this is quite cumbersome.

The way random splitting could work might be by just passing the portion of data to retain in one of the splits, say 0.7, or 0.2.

And splitting based on value might work by passing in the column to use and the values to retain in one of the splits vs the other.

Something like this

(the take here -- all names are very tentative -- might be a nice related functionality that would be nice to have, and could be very useful for experimenting).

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] / [QST] Dataset to support random splitting & splitting based on value of some column #103

[FEA] / [QST] Dataset to support random splitting & splitting based on value of some column #103

radekosmulski commented Jun 23, 2022

[FEA] / [QST] Dataset to support random splitting & splitting based on value of some column #103

[FEA] / [QST] Dataset to support random splitting & splitting based on value of some column #103

Comments

radekosmulski commented Jun 23, 2022