Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error during training #283

Open
giuliarubiu opened this issue Nov 11, 2024 · 3 comments
Open

Error during training #283

giuliarubiu opened this issue Nov 11, 2024 · 3 comments
Assignees

Comments

@giuliarubiu
Copy link

Hello during training I have following error:
ERROR Was not able to read git information, trying to continue without.
ERROR Could not log req: stderr not empty
Traceback (most recent call last):
File "/data/Documents/data/storage2/P228_CE_mark/AID_Chest_CT_Nodules/nnDetection/train.py", line 497, in
train()
File "/data/Documents/data/storage2/P228_CE_mark/AID_Chest_CT_Nodules/nnDetection/nndet/utils/check.py", line 62, in wrapper
return func(*args, **kwargs)
File "/data/Documents/data/storage2/P228_CE_mark/AID_Chest_CT_Nodules/nnDetection/train.py", line 70, in train
_train(
File "/data/Documents/data/storage2/P228_CE_mark/AID_Chest_CT_Nodules/nnDetection/train.py", line 290, in _train
trainer.fit(module, datamodule=datamodule)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
self._run(model)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run
self._dispatch()
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
self.accelerator.start_training(self)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
self.training_type_plugin.start_training(trainer)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
self._results = trainer.run_stage()
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
return self._run_train()
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_train
self._run_sanity_check(self.lightning_module)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1115, in _run_sanity_check
self._evaluation_loop.run()
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
dl_outputs = self.epoch_loop.run(
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 110, in advance
output = self.evaluation_step(batch, batch_idx, dataloader_idx)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 154, in evaluation_step
output = self.trainer.accelerator.validation_step(step_kwargs)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 211, in validation_step
return self.training_type_plugin.validation_step(*step_kwargs.values())
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 178, in validation_step
return self.model.validation_step(*args, **kwargs)
File "/data/Documents/data/storage2/P228_CE_mark/AID_Chest_CT_Nodules/nnDetection/nndet/ptmodule/retinaunet/base.py", line 172, in validation_step
losses, prediction = self.model.train_step(
File "/data/Documents/data/storage2/P228_CE_mark/AID_Chest_CT_Nodules/nnDetection/nndet/core/retina.py", line 146, in train_step
prediction = self.postprocess_for_inference(
File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data/Documents/data/storage2/P228_CE_mark/AID_Chest_CT_Nodules/nnDetection/nndet/core/retina.py", line 187, in postprocess_for_inference
boxes, probs, labels = self.postprocess_detections(
File "/data/Documents/data/storage2/P228_CE_mark/AID_Chest_CT_Nodules/nnDetection/nndet/core/retina.py", line 326, in postprocess_detections
boxes, probs, labels = self.postprocess_detections_single_image(boxes, probs, image_shape)
File "/data/Documents/data/storage2/P228_CE_mark/AID_Chest_CT_Nodules/nnDetection/nndet/core/retina.py", line 375, in postprocess_detections_single_image
keep = box_utils.batched_nms(boxes, probs, labels, self.nms_thresh)
File "/data/Documents/data/storage2/P228_CE_mark/AID_Chest_CT_Nodules/nnDetection/nndet/core/boxes/nms.py", line 106, in batched_nms
return nms(boxes_for_nms, scores, iou_threshold)
File "/usr/local/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/data/Documents/data/storage2/P228_CE_mark/AID_Chest_CT_Nodules/nnDetection/nndet/core/boxes/nms.py", line 78, in nms
return nms_fn(boxes.float(), scores.float(), iou_threshold)
TypeError: 'NoneType' object is not callable
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.9/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the print"
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

Does anyone know how it can be solved?

@mibaumgartner
Copy link
Collaborator

Dear @giuliarubiu,

it seems like the installation of nnDetection is not successful - the cuda code was not compiled correctly. Please refer to the FAQ for further information on potential debugging steps and let us know if anything else comes up.

Best,
Michael

@mibaumgartner mibaumgartner self-assigned this Nov 12, 2024
@giuliarubiu
Copy link
Author

giuliarubiu commented Nov 12, 2024

Thanks! I checked on the FAQ and I print this
----- PyTorch Information -----
PyTorch Version: 2.0.1+cu118
PyTorch Debug: False
PyTorch CUDA: 11.8
PyTorch Backend cudnn: 8700
PyTorch CUDA Arch List: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90']
PyTorch Current Device Capability: (8, 6)
PyTorch CUDA available: True

----- System Information -----
/bin/sh: 1: nvcc: not found
System NVCC:
System Arch List: None
System OMP_NUM_THREADS: 1
System CUDA_HOME is None: True
System CPU Count: 24
Python Version: 3.9.20 (main, Sep 12 2024, 21:07:53)
[GCC 12.2.0]

----- nnDetection Information -----
det_num_threads 6
det_data is set False
det_models is set False

I think the problem is in the nvcc but I'm not able to solve, do you have any suggestion?

@mibaumgartner
Copy link
Collaborator

Dear @giuliarubiu ,

indeed it seems like your CUDA installation is not correct. I would recommend starting the CUDA installation from scratch and following the official documentation from NVIDIA. Also, make sure to read the whole documentation by them since the needed environment variables are only introduced at the end of the document.

Best,
Michael

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants