-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: "Some background workers are no longer alive" during nnUNetv2_train Validation #2607
Comments
Hello, I am also experiencing the issue mentioned above. Here is a part of my logs for context : 2024-11-21 15:41:56.820905: This split has 157 training and 39 validation cases. It seems like the memory issue is forcing the process to switch to CPU, and I am also seeing warnings related to torch._dynamo, what could it be? Thank you in advance !! |
Additionally for reference,
|
I have also seen warnings related to torch._dynamo! Thanks to anyone who provides help!! |
Hi,
I wanted to share an issue regarding validation cases. I am using nnUNet on the HaNSeg dataset, I trained my model on all 5 folds using a custom configuration with no issues, and the training logs do not indicate any problems in training or predicting the validation cases. Needless to say, the validation cases did not save (or only some are available/saved), and I am running the validation cases using the "nnUNetv2_train --val." It works for some of the cases, but usually crashes before reaching the end, it is very expensive computationally and often switches to the CPU. I decided to create a new configuration and ran the dataset through preprocessing for this new configuration, then transferred the model files to the new configuration, and ran it through. Still get the crash, but the predictions stay on the GPU. This applies for other folds as well.
Any help would be appreciated!
Here is my input and output.
CUDA_VISIBLE_DEVICES=0 nnUNetv2_train 999 3d_fullres_v3 4 --val
Using device: cuda:0
#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################
2024-11-15 14:51:26.293600: Using splits from existing split file: /data/intern/Matt_Wilson_2024/nnUnet_preprocessed/Dataset999_HaNSeg/splits_final.json
2024-11-15 14:51:26.302730: The split file contains 5 splits.
2024-11-15 14:51:26.302880: Desired fold for training: 4
2024-11-15 14:51:26.302988: This split has 34 training and 6 validation cases.
2024-11-15 14:51:26.303285: predicting case_02
2024-11-15 14:51:26.643660: case_02, shape torch.Size([1, 136, 466, 466]), rank 0
2024-11-15 14:53:46.883094: predicting case_37
2024-11-15 14:53:47.123915: case_37, shape torch.Size([1, 124, 385, 385]), rank 0
2024-11-15 14:54:49.823392: predicting case_38
2024-11-15 14:54:50.046503: case_38, shape torch.Size([1, 136, 357, 357]), rank 0
2024-11-15 14:56:14.719611: predicting case_39
2024-11-15 14:56:15.031698: case_39, shape torch.Size([1, 135, 425, 425]), rank 0
2024-11-15 14:58:25.082591: predicting case_40
2024-11-15 14:58:25.407599: case_40, shape torch.Size([1, 126, 400, 400]), rank 0
Traceback (most recent call last):
File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/resource_sharer.py", line 138, in _serve
with self._listener.accept() as conn:
File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/connection.py", line 466, in accept
answer_challenge(c, self._authkey)
File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/connection.py", line 757, in answer_challenge
response = connection.recv_bytes(256) # reject large message
File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Traceback (most recent call last):
File "/home/mtw2156/anaconda3/envs/env/bin/nnUNetv2_train", line 8, in
sys.exit(run_training_entry())
File "/data/intern/Matt_Wilson_2024/nnUnet_preprocessed/Dataset999_HaNSeg/nnUNet/nnUNet/nnUNet/nnunetv2/run/run_training.py", line 268, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "/data/intern/Matt_Wilson_2024/nnUnet_preprocessed/Dataset999_HaNSeg/nnUNet/nnUNet/nnUNet/nnunetv2/run/run_training.py", line 208, in run_training
nnunet_trainer.perform_actual_validation(export_validation_probabilities)
File "/data/intern/Matt_Wilson_2024/nnUnet_preprocessed/Dataset999_HaNSeg/nnUNet/nnUNet/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1183, in perform_actual_validation
proceed = not check_workers_alive_and_busy(segmentation_export_pool, worker_list, results,
File "/data/intern/Matt_Wilson_2024/nnUnet_preprocessed/Dataset999_HaNSeg/nnUNet/nnUNet/nnUNet/nnunetv2/utilities/file_path_utilities.py", line 103, in check_workers_alive_and_busy
raise RuntimeError('Some background workers are no longer alive')
RuntimeError: Some background workers are no longer alive
The text was updated successfully, but these errors were encountered: