[Question] Threading error after last training epoch #201

NMVRodrigues · 2023-10-04T11:21:51Z

Hi,
I'm having an issue where after the last training epoch ends, when the validation set should be evaluated, a threading error caused by batchgenerators occurs.
I'm using the provided Docker image, with the predefined env_det_num_threads=6 and OMP_NUM_THREADS=1. ( I have additionally tried making a container with env_det_num_threads=1 to see if the problem was related to this flag, but the problem persisted).
To check if the problem was from our dataset I also tried this on the example dataset 000, and exact same problem happened.

Following is a set of traces of the problems that arise. It's always a thread error/exception. Looking at it, it feels like the batchgenerators class is having problems closing some threads?

Any help would be greatly appreciated :)
Best regards,
Nuno

Exception in thread Thread-5:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the print"
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Exception ignored in: <function MultiThreadedAugmenter.__del__ at 0x7f3efd1a5ca0>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 294, in __del__
    self._finish()
  File "/opt/conda/lib/python3.8/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 276, in _finish
    self._queues[i].close()
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 137, in close
    self._reader.close()
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 177, in close
    self._close()
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 361, in _close
    _close(self._handle)
OSError: [Errno 9] Bad file descriptor

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the print"
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

The text was updated successfully, but these errors were encountered:

chomm · 2023-10-12T08:39:23Z

Hey,
I am having the exact same problem, however, not using the docker environment but a conda environment.

For me, a quick fix was neglecting the failing validation in the end of the training and proceeding with the nndet_sweep command. Here the validation went through. Maybe this works for you too until the main problem is fixed.

BR,
Chris

NMVRodrigues · 2023-10-15T19:23:13Z

Hi Chris (@chomm),

In your case, are the checkpoints and output files generated after the threading error?
I believe in our case they were not being stored, so unsure if running sweep would work afterwards.
I will try it out anyway, thank you so much!

Best,
Nuno

mibaumgartner · 2023-10-16T08:19:15Z

Hi @NMVRodrigues,

sorry for the delayed response, I was out of office for the last two weeks.

The checkpoints should be stored continuously during training, they should definitely be available.

I'm not quite sure if the problem is fixable in nnDetection or if it is rather a batchgenerators problem. We are planning to move to the Torch Dataloader in the next release which should alleviate the problem. Please double check that you are not running out of RAM during/after training which might cause the issue.

NMVRodrigues · 2023-10-18T09:07:41Z

Hi @mibaumgartner ,

Regarding the checkpoints, they are indeed being stored, I was looking into an old folder, my bad.

Regarding RAM, it does not seem to be the issue. We tried with both a +- 160 sample dataset, and also with a smaller dummy version of only 10 samples, while monitoring the RAM, and on both occasions, it was nowhere near full.

Best,
Nuno

dzeego · 2023-12-01T17:37:44Z

Hi @mibaumgartner is there a timeframe for that release (Torch Dataloader)? Because I am still having the same issue as everyone has reported in this issue.

Many thanks in advance,
Dzeego

mibaumgartner · 2023-12-02T14:15:46Z

Hey, no timeframe yet, I'm working on the next release with all of my energy right now :) - Since it is a very large release, it took/takes much more time than originally anticipated, sorry.

dzeego · 2023-12-04T09:01:47Z

Hi @mibaumgartner

No problem! All the best :)
In the meantime, how can we manage this error? Knowing that only the checkpoints are saved (nothing else) how can we proceed to the inference on an external test set (nndet_sweep did not work)?

mibaumgartner · 2023-12-04T10:48:26Z

With the checkpoints it should be possible to run nndet_sweep. Could you potentially provide the exact error? When I encountered that problem, everything was finished (training + sweep).

github-actions · 2024-01-04T00:55:19Z

This issue is stale because it has been open for 30 days with no activity.

github-actions bot added the stale Issue without activity, will be closed soon label Jan 4, 2024

mibaumgartner added fix with next release Will be fixed after the next release and removed stale Issue without activity, will be closed soon labels Jan 10, 2024

mibaumgartner mentioned this issue Jan 10, 2024

[Question]About the result of Task021 #221

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Threading error after last training epoch #201

[Question] Threading error after last training epoch #201

NMVRodrigues commented Oct 4, 2023

chomm commented Oct 12, 2023

NMVRodrigues commented Oct 15, 2023

mibaumgartner commented Oct 16, 2023

NMVRodrigues commented Oct 18, 2023

dzeego commented Dec 1, 2023

mibaumgartner commented Dec 2, 2023

dzeego commented Dec 4, 2023

mibaumgartner commented Dec 4, 2023

github-actions bot commented Jan 4, 2024

[Question] Threading error after last training epoch #201

[Question] Threading error after last training epoch #201

Comments

NMVRodrigues commented Oct 4, 2023

chomm commented Oct 12, 2023

NMVRodrigues commented Oct 15, 2023

mibaumgartner commented Oct 16, 2023

NMVRodrigues commented Oct 18, 2023

dzeego commented Dec 1, 2023

mibaumgartner commented Dec 2, 2023

dzeego commented Dec 4, 2023

mibaumgartner commented Dec 4, 2023

github-actions bot commented Jan 4, 2024