-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Threading error after last training epoch #201
Comments
Hey, For me, a quick fix was neglecting the failing validation in the end of the training and proceeding with the nndet_sweep command. Here the validation went through. Maybe this works for you too until the main problem is fixed. BR, |
Hi Chris (@chomm), In your case, are the checkpoints and output files generated after the threading error? Best, |
Hi @NMVRodrigues, sorry for the delayed response, I was out of office for the last two weeks. The checkpoints should be stored continuously during training, they should definitely be available. I'm not quite sure if the problem is fixable in nnDetection or if it is rather a batchgenerators problem. We are planning to move to the Torch Dataloader in the next release which should alleviate the problem. Please double check that you are not running out of RAM during/after training which might cause the issue. |
Hi @mibaumgartner , Regarding the checkpoints, they are indeed being stored, I was looking into an old folder, my bad. Regarding RAM, it does not seem to be the issue. We tried with both a +- 160 sample dataset, and also with a smaller dummy version of only 10 samples, while monitoring the RAM, and on both occasions, it was nowhere near full. Best, |
Hi @mibaumgartner is there a timeframe for that release (Torch Dataloader)? Because I am still having the same issue as everyone has reported in this issue. Many thanks in advance, |
Hey, no timeframe yet, I'm working on the next release with all of my energy right now :) - Since it is a very large release, it took/takes much more time than originally anticipated, sorry. |
No problem! All the best :) |
With the checkpoints it should be possible to run nndet_sweep. Could you potentially provide the exact error? When I encountered that problem, everything was finished (training + sweep). |
This issue is stale because it has been open for 30 days with no activity. |
Hi,
I'm having an issue where after the last training epoch ends, when the validation set should be evaluated, a threading error caused by
batchgenerators
occurs.I'm using the provided Docker image, with the predefined
env_det_num_threads=6
andOMP_NUM_THREADS=1
. ( I have additionally tried making a container withenv_det_num_threads=1
to see if the problem was related to this flag, but the problem persisted).To check if the problem was from our dataset I also tried this on the example dataset
000
, and exact same problem happened.Following is a set of traces of the problems that arise. It's always a thread error/exception. Looking at it, it feels like the
batchgenerators
class is having problems closing some threads?Any help would be greatly appreciated :)
Best regards,
Nuno
The text was updated successfully, but these errors were encountered: