-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error during training #283
Comments
Dear @giuliarubiu, it seems like the installation of nnDetection is not successful - the cuda code was not compiled correctly. Please refer to the FAQ for further information on potential debugging steps and let us know if anything else comes up. Best, |
Thanks! I checked on the FAQ and I print this ----- System Information ----- ----- nnDetection Information ----- I think the problem is in the nvcc but I'm not able to solve, do you have any suggestion? |
Dear @giuliarubiu , indeed it seems like your CUDA installation is not correct. I would recommend starting the CUDA installation from scratch and following the official documentation from NVIDIA. Also, make sure to read the whole documentation by them since the needed environment variables are only introduced at the end of the document. Best, |
Hello during training I have following error:
ERROR Was not able to read git information, trying to continue without.
ERROR Could not log req: stderr not empty
Traceback (most recent call last):
File "/data/Documents/data/storage2/P228_CE_mark/AID_Chest_CT_Nodules/nnDetection/train.py", line 497, in
train()
File "/data/Documents/data/storage2/P228_CE_mark/AID_Chest_CT_Nodules/nnDetection/nndet/utils/check.py", line 62, in wrapper
return func(*args, **kwargs)
File "/data/Documents/data/storage2/P228_CE_mark/AID_Chest_CT_Nodules/nnDetection/train.py", line 70, in train
_train(
File "/data/Documents/data/storage2/P228_CE_mark/AID_Chest_CT_Nodules/nnDetection/train.py", line 290, in _train
trainer.fit(module, datamodule=datamodule)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
self._run(model)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run
self._dispatch()
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
self.accelerator.start_training(self)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
self.training_type_plugin.start_training(trainer)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
self._results = trainer.run_stage()
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
return self._run_train()
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_train
self._run_sanity_check(self.lightning_module)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1115, in _run_sanity_check
self._evaluation_loop.run()
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
dl_outputs = self.epoch_loop.run(
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 110, in advance
output = self.evaluation_step(batch, batch_idx, dataloader_idx)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 154, in evaluation_step
output = self.trainer.accelerator.validation_step(step_kwargs)
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 211, in validation_step
return self.training_type_plugin.validation_step(*step_kwargs.values())
File "/usr/local/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 178, in validation_step
return self.model.validation_step(*args, **kwargs)
File "/data/Documents/data/storage2/P228_CE_mark/AID_Chest_CT_Nodules/nnDetection/nndet/ptmodule/retinaunet/base.py", line 172, in validation_step
losses, prediction = self.model.train_step(
File "/data/Documents/data/storage2/P228_CE_mark/AID_Chest_CT_Nodules/nnDetection/nndet/core/retina.py", line 146, in train_step
prediction = self.postprocess_for_inference(
File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data/Documents/data/storage2/P228_CE_mark/AID_Chest_CT_Nodules/nnDetection/nndet/core/retina.py", line 187, in postprocess_for_inference
boxes, probs, labels = self.postprocess_detections(
File "/data/Documents/data/storage2/P228_CE_mark/AID_Chest_CT_Nodules/nnDetection/nndet/core/retina.py", line 326, in postprocess_detections
boxes, probs, labels = self.postprocess_detections_single_image(boxes, probs, image_shape)
File "/data/Documents/data/storage2/P228_CE_mark/AID_Chest_CT_Nodules/nnDetection/nndet/core/retina.py", line 375, in postprocess_detections_single_image
keep = box_utils.batched_nms(boxes, probs, labels, self.nms_thresh)
File "/data/Documents/data/storage2/P228_CE_mark/AID_Chest_CT_Nodules/nnDetection/nndet/core/boxes/nms.py", line 106, in batched_nms
return nms(boxes_for_nms, scores, iou_threshold)
File "/usr/local/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/data/Documents/data/storage2/P228_CE_mark/AID_Chest_CT_Nodules/nnDetection/nndet/core/boxes/nms.py", line 78, in nms
return nms_fn(boxes.float(), scores.float(), iou_threshold)
TypeError: 'NoneType' object is not callable
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.9/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the print"
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Does anyone know how it can be solved?
The text was updated successfully, but these errors were encountered: