You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using the Nvidia Merlin Docker 23.08 Tensorflow container.
I've created my training and validation datasets and saved them into parquet following the standard procedure done with the nvt.workflow.
I am now facing some issues training a two towers model based largely on the examples provided in the notebooks, but with many more list features (such as genres in the MovieLens dataset).
The training starts and the loss function decreases but at the validation step I get an Unknown error that seems to originate from a missing index in the underlying cudf DataFrame, which in turn comes out from a StopIteration when validation data is evaluated.
I've then tried to run some test iterations on the valid dataset and found with my surprise that even the mm.Loader cannot correctly iterate on the validation dataset.
In other words, I've verified that I cannot consume all the batches from the dataset, unless I set the batch_size to 1 which every number is divisible from.
Indeed, this simple loop raise StopIteration.
I hope this is something bad on my side. I didn't do the shuffle_by_keys method on the loaded dataset, nor in the phase of its creation. Is this related?
The text was updated successfully, but these errors were encountered:
@CarloNicolini please provide a minimal reproducible example so that we can run and reproduce the issue you are facing.
what are the dtypes of your list columnss? Are you properly categorifying the list features using NVTabular and are you transforming your validation data accordingly?
why do you think you need shuffle_by_keys? we have shuffle_by_keys in the Groupbyop, in case one is doing groupby for a given column (say unique session id) but their unique session id is scattered over different parquet files, BUT we dont recommend to use it for large datasets. are you doing something like that? you are finetuning a Two-tower model right? not a session based model, I believe.
❓ Questions & Help
I am using the Nvidia Merlin Docker 23.08 Tensorflow container.
I've created my training and validation datasets and saved them into parquet following the standard procedure done with the nvt.workflow.
I am now facing some issues training a two towers model based largely on the examples provided in the notebooks, but with many more list features (such as
genres
in the MovieLens dataset).The training starts and the loss function decreases but at the validation step I get an
Unknown error
that seems to originate from a missing index in the underlying cudf DataFrame, which in turn comes out from a StopIteration when validation data is evaluated.I've then tried to run some test iterations on the valid dataset and found with my surprise that even the
mm.Loader
cannot correctly iterate on the validation dataset.In other words, I've verified that I cannot consume all the batches from the dataset, unless I set the
batch_size
to 1 which every number is divisible from.Indeed, this simple loop raise
StopIteration
.I hope this is something bad on my side. I didn't do the
shuffle_by_keys
method on the loaded dataset, nor in the phase of its creation. Is this related?The text was updated successfully, but these errors were encountered: