RuntimeError: "Some background workers are no longer alive" during nnUNetv2_train Validation #2607

mtw2156 · 2024-11-18T17:00:31Z

Hi,

I wanted to share an issue regarding validation cases. I am using nnUNet on the HaNSeg dataset, I trained my model on all 5 folds using a custom configuration with no issues, and the training logs do not indicate any problems in training or predicting the validation cases. Needless to say, the validation cases did not save (or only some are available/saved), and I am running the validation cases using the "nnUNetv2_train --val." It works for some of the cases, but usually crashes before reaching the end, it is very expensive computationally and often switches to the CPU. I decided to create a new configuration and ran the dataset through preprocessing for this new configuration, then transferred the model files to the new configuration, and ran it through. Still get the crash, but the predictions stay on the GPU. This applies for other folds as well.

Any help would be appreciated!

Here is my input and output.

CUDA_VISIBLE_DEVICES=0 nnUNetv2_train 999 3d_fullres_v3 4 --val
Using device: cuda:0

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################

2024-11-15 14:51:26.293600: Using splits from existing split file: /data/intern/Matt_Wilson_2024/nnUnet_preprocessed/Dataset999_HaNSeg/splits_final.json
2024-11-15 14:51:26.302730: The split file contains 5 splits.
2024-11-15 14:51:26.302880: Desired fold for training: 4
2024-11-15 14:51:26.302988: This split has 34 training and 6 validation cases.
2024-11-15 14:51:26.303285: predicting case_02
2024-11-15 14:51:26.643660: case_02, shape torch.Size([1, 136, 466, 466]), rank 0
2024-11-15 14:53:46.883094: predicting case_37
2024-11-15 14:53:47.123915: case_37, shape torch.Size([1, 124, 385, 385]), rank 0
2024-11-15 14:54:49.823392: predicting case_38
2024-11-15 14:54:50.046503: case_38, shape torch.Size([1, 136, 357, 357]), rank 0
2024-11-15 14:56:14.719611: predicting case_39
2024-11-15 14:56:15.031698: case_39, shape torch.Size([1, 135, 425, 425]), rank 0
2024-11-15 14:58:25.082591: predicting case_40
2024-11-15 14:58:25.407599: case_40, shape torch.Size([1, 126, 400, 400]), rank 0
Traceback (most recent call last):
File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/resource_sharer.py", line 138, in _serve
with self._listener.accept() as conn:
File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/connection.py", line 466, in accept
answer_challenge(c, self._authkey)
File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/connection.py", line 757, in answer_challenge
response = connection.recv_bytes(256) # reject large message
File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Traceback (most recent call last):
File "/home/mtw2156/anaconda3/envs/env/bin/nnUNetv2_train", line 8, in
sys.exit(run_training_entry())
File "/data/intern/Matt_Wilson_2024/nnUnet_preprocessed/Dataset999_HaNSeg/nnUNet/nnUNet/nnUNet/nnunetv2/run/run_training.py", line 268, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "/data/intern/Matt_Wilson_2024/nnUnet_preprocessed/Dataset999_HaNSeg/nnUNet/nnUNet/nnUNet/nnunetv2/run/run_training.py", line 208, in run_training
nnunet_trainer.perform_actual_validation(export_validation_probabilities)
File "/data/intern/Matt_Wilson_2024/nnUnet_preprocessed/Dataset999_HaNSeg/nnUNet/nnUNet/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1183, in perform_actual_validation
proceed = not check_workers_alive_and_busy(segmentation_export_pool, worker_list, results,
File "/data/intern/Matt_Wilson_2024/nnUnet_preprocessed/Dataset999_HaNSeg/nnUNet/nnUNet/nnUNet/nnunetv2/utilities/file_path_utilities.py", line 103, in check_workers_alive_and_busy
raise RuntimeError('Some background workers are no longer alive')
RuntimeError: Some background workers are no longer alive

sunburstillend · 2024-11-21T16:26:26Z

Hello,

I am also experiencing the issue mentioned above. Here is a part of my logs for context :

2024-11-21 15:41:56.820905: This split has 157 training and 39 validation cases.
2024-11-21 15:41:56.821133: predicting case_196_0004
2024-11-21 15:41:56.823359: case_196_0004, shape torch.Size([1, 448, 946, 448]), rank 0
2024-11-21 15:43:39.985436: predicting case_196_0007
2024-11-21 15:43:40.018088: case_196_0007, shape torch.Size([1, 512, 755, 512]), rank 0
2024-11-21 15:45:33.126272: predicting case_196_0009
2024-11-21 15:45:33.154442: case_196_0009, shape torch.Size([1, 575, 399, 575]), rank 0
2024-11-21 15:46:56.187559: predicting case_196_0010
2024-11-21 15:46:56.204679: case_196_0010, shape torch.Size([1, 575, 399, 575]), rank 0
2024-11-21 15:48:36.251001: predicting case_196_0019
2024-11-21 15:48:36.273754: case_196_0019, shape torch.Size([1, 512, 647, 512]), rank 0
2024-11-21 15:50:14.743588: predicting case_196_0026
2024-11-21 15:50:14.771919: case_196_0026, shape torch.Size([1, 467, 642, 467]), rank 0
2024-11-21 15:51:34.419982: predicting case_196_0027
2024-11-21 15:51:34.440981: case_196_0027, shape torch.Size([1, 636, 722, 636]), rank 0
2024-11-21 15:54:19.972141: predicting case_196_0031
2024-11-21 15:54:20.014876: case_196_0031, shape torch.Size([1, 447, 2073, 447]), rank 0
Prediction on device was unsuccessful, probably due to a lack of memory. Moving results arrays to CPU
2024-11-21 16:02:21.664565: predicting case_196_0033
2024-11-21 16:02:21.720253: case_196_0033, shape torch.Size([1, 639, 940, 639]), rank 0
W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] torch._dynamo hit config.cache_size_limit (8)
W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] function: 'forward' (/home/user/anaconda3/envs/nnunet-2.4/lib/python3.10/site-packages/dynamic_network_architectures/architectures/unet.py:116)
W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] last reason: tensor 'L['x']' stride mismatch at index 0. expected 189865984, actual 383821740
W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] To diagnose recompilation issues, see https://pytorch.org/docs/master/compile/troubleshooting.html.
Prediction on device was unsuccessful, probably due to a lack of memory. Moving results arrays to CPU

It seems like the memory issue is forcing the process to switch to CPU, and I am also seeing warnings related to torch._dynamo, what could it be?

Thank you in advance !!

mtw2156 · 2024-11-21T16:50:04Z

Additionally for reference,

Hello,

I am also experiencing the issue mentioned above. Here is a part of my logs for context :

2024-11-21 15:41:56.820905: This split has 157 training and 39 validation cases. 2024-11-21 15:41:56.821133: predicting case_196_0004 2024-11-21 15:41:56.823359: case_196_0004, shape torch.Size([1, 448, 946, 448]), rank 0 2024-11-21 15:43:39.985436: predicting case_196_0007 2024-11-21 15:43:40.018088: case_196_0007, shape torch.Size([1, 512, 755, 512]), rank 0 2024-11-21 15:45:33.126272: predicting case_196_0009 2024-11-21 15:45:33.154442: case_196_0009, shape torch.Size([1, 575, 399, 575]), rank 0 2024-11-21 15:46:56.187559: predicting case_196_0010 2024-11-21 15:46:56.204679: case_196_0010, shape torch.Size([1, 575, 399, 575]), rank 0 2024-11-21 15:48:36.251001: predicting case_196_0019 2024-11-21 15:48:36.273754: case_196_0019, shape torch.Size([1, 512, 647, 512]), rank 0 2024-11-21 15:50:14.743588: predicting case_196_0026 2024-11-21 15:50:14.771919: case_196_0026, shape torch.Size([1, 467, 642, 467]), rank 0 2024-11-21 15:51:34.419982: predicting case_196_0027 2024-11-21 15:51:34.440981: case_196_0027, shape torch.Size([1, 636, 722, 636]), rank 0 2024-11-21 15:54:19.972141: predicting case_196_0031 2024-11-21 15:54:20.014876: case_196_0031, shape torch.Size([1, 447, 2073, 447]), rank 0 Prediction on device was unsuccessful, probably due to a lack of memory. Moving results arrays to CPU 2024-11-21 16:02:21.664565: predicting case_196_0033 2024-11-21 16:02:21.720253: case_196_0033, shape torch.Size([1, 639, 940, 639]), rank 0 W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] torch._dynamo hit config.cache_size_limit (8) W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] function: 'forward' (/home/user/anaconda3/envs/nnunet-2.4/lib/python3.10/site-packages/dynamic_network_architectures/architectures/unet.py:116) W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] last reason: tensor 'L['x']' stride mismatch at index 0. expected 189865984, actual 383821740 W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] To log all recompilation reasons, use TORCH_LOGS="recompiles". W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] To diagnose recompilation issues, see https://pytorch.org/docs/master/compile/troubleshooting.html. Prediction on device was unsuccessful, probably due to a lack of memory. Moving results arrays to CPU

It seems like the memory issue is forcing the process to switch to CPU, and I am also seeing warnings related to torch._dynamo, what could it be?

Thank you in advance !!

mtw2156 · 2024-11-21T16:51:13Z

I have also seen warnings related to torch._dynamo!

Thanks to anyone who provides help!!

mtw2156 changed the title ~~Issue regarding validation cases~~ RuntimeError: "Some background workers are no longer alive" during nnUNetv2_train Validation Nov 18, 2024

FabianIsensee assigned seziegler Nov 19, 2024

mtw2156 closed this as completed Nov 21, 2024

mtw2156 reopened this Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: "Some background workers are no longer alive" during nnUNetv2_train Validation #2607

RuntimeError: "Some background workers are no longer alive" during nnUNetv2_train Validation #2607

mtw2156 commented Nov 18, 2024

sunburstillend commented Nov 21, 2024

mtw2156 commented Nov 21, 2024

mtw2156 commented Nov 21, 2024

RuntimeError: "Some background workers are no longer alive" during nnUNetv2_train Validation #2607

RuntimeError: "Some background workers are no longer alive" during nnUNetv2_train Validation #2607

Comments

mtw2156 commented Nov 18, 2024

sunburstillend commented Nov 21, 2024

mtw2156 commented Nov 21, 2024

mtw2156 commented Nov 21, 2024