-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significant Change in Epoch Time with Dataset Size #208
Comments
Dear @dzeego , That sounds rather surprising. Thank you for reporting the issue and sorry for getting back to you rather late due to my vacation. Is it possible to reproduce the issue with the toy dataset so I can have a look locally as well? Theoretically, training time should remain independent of the dataset size since the same number of batches/samples is sampled in each epoch. 12 minutes per Epoch also sounds extremely fast, usually epoch times range somewhere between 20-40 minutes (sometimes slightly longer) depending on the configured strides of the network and the available GPU (assuming no other bottleneck are present). Best, Edit: the only case which I could think of is the presence of an IO bottleneck and by reducing the number of samples the OS can cache the inputs which alleviates the IO bottleneck. Even then, 12 minutes for an epoch sounds quite quick though and would highly depend on the input to the network (e.g. 3D data which is rather small in resolution) |
Hi @mibaumgartner, Indeed, the bottleneck was the data IO, which as you said, by reducing the number of samples the OS can cache the inputs. Best regards, |
Dear @dzeego , thank you for the suggestion, I'll definitely look into it! Best, |
Hello,
I recently started using the nnDetection and have noticed that my training epoch time significantly increased when the size of my training dataset increased.
To be more specific, I ran the nnDetection preprocessing on a large dataset of ~2k CT volumes, then trained a model using the generated splits_final.pkl file. One epoch with this configuration lasted 3 hours.
However, for the exact same preprocessing and training configurations, having only modified the splits_final.pkl file to randomly include a subset (~200 CT volumes) of the original training dataset (~2k CT volumes), the epoch time was reduced to 12 minutes per epoch!
Is there an explanation for this behavior?
Many thanks in advance.
The text was updated successfully, but these errors were encountered: