You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to raise this for broader consideration -- I am not sure what the answer here is, though I find myself hoping for this functionality more and more as I work with Datasets in NVTabular.
You might want to preprocess your data to a Dataset, maybe apply some NVTabular ops, and only split it down the road. You might want to do this to streamline your pipeline and give yourself the ability to experiment faster. Or you might want to see how the model responds to training on different splits of data.
With certain preprocessing techniques that is not what you might want to do to avoid leakage from train to your validation split. But these scenarios are likely an overwhelmingly small minority.
There are ways to work around this (splitting the data before creating a Dataset or doing to_ddf and creating a new Dataset from the output, after modifying the ddf) but this is quite cumbersome.
The way random splitting could work might be by just passing the portion of data to retain in one of the splits, say 0.7, or 0.2.
And splitting based on value might work by passing in the column to use and the values to retain in one of the splits vs the other.
Something like this
(the take here -- all names are very tentative -- might be a nice related functionality that would be nice to have, and could be very useful for experimenting).
The text was updated successfully, but these errors were encountered:
I would like to raise this for broader consideration -- I am not sure what the answer here is, though I find myself hoping for this functionality more and more as I work with
Dataset
s inNVTabular
.You might want to preprocess your data to a
Dataset
, maybe apply someNVTabular
ops, and only split it down the road. You might want to do this to streamline your pipeline and give yourself the ability to experiment faster. Or you might want to see how the model responds to training on different splits of data.With certain preprocessing techniques that is not what you might want to do to avoid leakage from train to your validation split. But these scenarios are likely an overwhelmingly small minority.
There are ways to work around this (splitting the data before creating a
Dataset
or doingto_ddf
and creating a newDataset
from the output, after modifying theddf
) but this is quite cumbersome.The way random splitting could work might be by just passing the portion of data to retain in one of the splits, say 0.7, or 0.2.
And splitting based on value might work by passing in the column to use and the values to retain in one of the splits vs the other.
Something like this
(the
take
here -- all names are very tentative -- might be a nice related functionality that would be nice to have, and could be very useful for experimenting).The text was updated successfully, but these errors were encountered: