Issues finding the exact file to execute while using command from instruction #2576

MenghanLiu212 · 2024-10-31T01:33:52Z

Hi,

Thank you authors for the great work!
I'm actually running into two kind of issues (on two computers), and I'm gonna explain each issue in the following.

Issue 1:
I'm running an issue about mis-matching of exact file to run and command (shortcut). After I set up the torch environment and installed nnUNet in conda, using this code (standardized baseline)
pip install nnunetv2
and I tried to run the fingerprint extraction:
nnUNetv2_plan_and_preprocess -d DATASET_ID --verify_dataset_integrity
And I got a result of ModuleNotFoundError: No module named 'torch'
However, when I run
python -c "import torch; print(torch.__version__)"
It returns 2.5.1, which means my pytorch is installed.

Then I tried
python -m nnunetv2.experiment_planning.plan_and_preprocess_entrypoints -d DATASET_ID --verify_dataset_integrity
It worked.
Same thing happens for training command:
CUDA_VISIBLE_DEVICES=3 nnUNetv2_train 1 2d 3
is not working, but
CUDA_VISIBLE_DEVICES=3 python -m nnunetv2.run.run_training 1 2d 3
worked.
I guess there could be an issue with shortcuts and the actual files.

Issue 2
On another computer, I used integrative framework installation:
git clone https://github.com/MIC-DKFZ/nnUNet.git cd nnUNet pip install -e .
And pytorch is successfully recognized and I passed the fingerprint extraction using
nnUNetv2_plan_and_preprocess -d DATASET_ID --verify_dataset_integrity
It worked perfectly.
But when I run the training process using the provided command
CUDA_VISIBLE_DEVICES=3 nnUNetv2_train 1 2d 3
or the command
CUDA_VISIBLE_DEVICES=3 python -m nnunetv2.run.run_training 1 2d 3,
None of them worked and I get the following error:
`CUDA_VISIBLE_DEVICES=3 python -m nnunetv2.run.run_training 1 2d 3

############################
INFO: You are using the old nnU-Net default plans. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md
############################

Using device: cuda:0
/data/menghan/nnUNetFrame/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:164: FutureWarning: torch.cuda.amp.GradScaler(args...) is deprecated. Please use torch.amp.GradScaler('cuda', args...) instead.
self.grad_scaler = GradScaler() if self.device.type == 'cuda' else None

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################

2024-10-30 18:21:12.507816: do_dummy_2d_data_aug: False
2024-10-30 18:21:12.509523: Using splits from existing split file: /data/menghan/nnUNetFrame/dataset/nnUNet_preprocessed/Dataset001_PDGM/splits_final.json
2024-10-30 18:21:12.509829: The split file contains 5 splits.
2024-10-30 18:21:12.509867: Desired fold for training: 3
2024-10-30 18:21:12.509893: This split has 317 training and 79 validation cases.
using pin_memory on device 0
using pin_memory on device 0
2024-10-30 18:21:15.261662: Using torch.compile...
/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/optim/lr_scheduler.py:62: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn(

This is the configuration used by this training:
Configuration name: 2d
{'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 105, 'patch_size': [192, 160], 'median_image_size_in_voxels': [174.0, 137.0], 'spacing': [1.0, 1.0], 'normalization_schemes': ['ZScoreNormalization', 'ZScoreNormalization', 'ZScoreNormalization'], 'use_mask_for_norm': [True, True, True], 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'architecture': {'network_class_name': 'dynamic_network_architectures.architectures.unet.PlainConvUNet', 'arch_kwargs': {'n_stages': 6, 'features_per_stage': [32, 64, 128, 256, 512, 512], 'conv_op': 'torch.nn.modules.conv.Conv2d', 'kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'strides': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'n_conv_per_stage': [2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2], 'conv_bias': True, 'norm_op': 'torch.nn.modules.instancenorm.InstanceNorm2d', 'norm_op_kwargs': {'eps': 1e-05, 'affine': True}, 'dropout_op': None, 'dropout_op_kwargs': None, 'nonlin': 'torch.nn.LeakyReLU', 'nonlin_kwargs': {'inplace': True}}, '_kw_requires_import': ['conv_op', 'norm_op', 'dropout_op', 'nonlin']}, 'batch_dice': True}

These are the global plan.json settings:
{'dataset_name': 'Dataset001_PDGM', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [1.0, 1.0, 1.0], 'original_median_shape_after_transp': [141, 174, 137], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 13284.6953125, 'mean': 1707.9705810546875, 'median': 1655.8569946289062, 'min': 0.0, 'percentile_00_5': 285.3706359863281, 'percentile_99_5': 4148.60693359375, 'std': 844.5523071289062}, '1': {'max': 18961.615234375, 'mean': 2843.3720703125, 'median': 2556.35791015625, 'min': 0.0, 'percentile_00_5': 382.8267822265625, 'percentile_99_5': 8660.498046875, 'std': 1415.30126953125}, '2': {'max': 7245.7900390625, 'mean': 1720.8861083984375, 'median': 1666.87939453125, 'min': 0.0, 'percentile_00_5': 443.10186767578125, 'percentile_99_5': 3492.31591796875, 'std': 563.9266357421875}}}

2024-10-30 18:21:15.923255: unpacking dataset...
2024-10-30 18:21:19.219472: unpacking done...
2024-10-30 18:21:19.224577: Unable to plot network architecture: nnUNet_compile is enabled!
2024-10-30 18:21:19.232440:
2024-10-30 18:21:19.232761: Epoch 0
2024-10-30 18:21:19.233011: Current learning rate: 0.01
/usr/bin/ld: cannot find -lcuda
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -lcuda
collect2: error: ld returned 1 exit status
Traceback (most recent call last):
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/output_graph.py", line 1446, in call_user_compiler
compiled_fn = compiler_fn(gm, self.example_inputs())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/dynamo/repro/after_dynamo.py", line 129, in call
compiled_gm = compiler_fn(gm, example_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/init.py", line 2234, in call
return compile_fx(model, inputs, config_patches=self.config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1521, in compile_fx
return aot_autograd(
^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/backends/common.py", line 72, in call
cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1071, in aot_module_simplified
compiled_fn = dispatch_and_compile()
^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1056, in dispatch_and_compile
compiled_fn, _ = create_aot_dispatcher_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 522, in create_aot_dispatcher_function
return _create_aot_dispatcher_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 759, in _create_aot_dispatcher_function
compiled_fn, fw_metadata = compiler_fn(
^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 588, in aot_dispatch_autograd
compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1350, in fw_compiler_base
return _fw_compiler_base(model, example_inputs, is_inference)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1421, in _fw_compiler_base
return inner_compile(
^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 475, in compile_fx_inner
return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/repro/after_aot.py", line 85, in debug_wrapper
inner_compiled_fn = compiler_fn(gm, example_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 661, in _compile_fx_inner
compiled_graph = FxGraphCache.load(
^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/codecache.py", line 1334, in load
compiled_graph = compile_fx_fn(
^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 570, in codegen_and_compile
compiled_graph = fx_codegen_and_compile(gm, example_inputs, **fx_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 878, in fx_codegen_and_compile
compiled_fn = graph.compile_to_fn()
^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/graph.py", line 1913, in compile_to_fn
return self.compile_to_module().call
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/graph.py", line 1839, in compile_to_module
return self._compile_to_module()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/graph.py", line 1845, in _compile_to_module
self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/graph.py", line 1784, in codegen
self.scheduler.codegen()
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/scheduler.py", line 3383, in codegen
return self._codegen()
^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/scheduler.py", line 3461, in _codegen
self.get_backend(device).codegen_node(node)
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/codegen/cuda_combined_scheduling.py", line 80, in codegen_node
return self._triton_scheduling.codegen_node(node)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/codegen/simd.py", line 1155, in codegen_node
return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/codegen/simd.py", line 1364, in codegen_node_schedule
src_code = kernel.codegen_kernel()
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/codegen/triton.py", line 2661, in codegen_kernel
**self.inductor_meta_common(),
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/codegen/triton.py", line 2532, in inductor_meta_common
"backend_hash": torch.utils._triton.triton_hash_with_backend(),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/utils/_triton.py", line 53, in triton_hash_with_backend
backend = triton_backend()
^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/utils/_triton.py", line 45, in triton_backend
target = driver.active.get_current_target()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/triton/runtime/driver.py", line 23, in getattr
self._initialize_obj()
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
self._obj = self._init_fn()
^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/triton/runtime/driver.py", line 9, in _create_driver
return actives0
^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 371, in init
self.utils = CudaUtils() # TODO: make static
^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 80, in init
mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src
so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/triton/runtime/build.py", line 48, in _build
ret = subprocess.check_call(cc_cmd)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmp13q643ul/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmp13q643ul/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/triton/backends/nvidia/lib', '-L/lib/x86_64-linux-gnu', '-I/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/triton/backends/nvidia/include', '-I/tmp/tmp13q643ul', '-I/home/menghan/.conda/envs/nnUNet_new/include/python3.12']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/data/menghan/nnUNetFrame/nnUNet/nnunetv2/run/run_training.py", line 285, in
run_training_entry()
File "/data/menghan/nnUNetFrame/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "/data/menghan/nnUNetFrame/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training
nnunet_trainer.run_training()
File "/data/menghan/nnUNetFrame/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1370, in run_training
train_outputs.append(self.train_step(next(self.dataloader_train)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/menghan/nnUNetFrame/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 994, in train_step
output = self.network(data)
^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 465, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1269, in call
return self._torchdynamo_orig_callable(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1064, in call
result = self._inner_convert(
^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 526, in call
return _compile(
^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 924, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 666, in compile_inner
return _compile_inner(code, one_graph, hooks, transform)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_utils_internal.py", line 87, in wrapper_function
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 699, in _compile_inner
out_code = transform_code_object(code, transform)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/bytecode_transformation.py", line 1322, in transform_code_object
transformations(instructions, code_options)
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 219, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 634, in transform
tracer.run()
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 2796, in run
super().run()
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 983, in run
while self.step():
^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 895, in step
self.dispatch_table[inst.opcode](self, inst)
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 2987, in RETURN_VALUE
self._return(inst)
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 2972, in _return
self.output.compile_subgraph(
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/output_graph.py", line 1142, in compile_subgraph
self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/output_graph.py", line 1369, in compile_and_call_fx_graph
compiled_fn = self.call_user_compiler(gm)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/output_graph.py", line 1416, in call_user_compiler
return self._call_user_compiler(gm)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/output_graph.py", line 1465, in _call_user_compiler
raise BackendCompilerFailed(self.compiler_fn, e) from e
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmp13q643ul/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmp13q643ul/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/triton/backends/nvidia/lib', '-L/lib/x86_64-linux-gnu', '-I/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/triton/backends/nvidia/include', '-I/tmp/tmp13q643ul', '-I/home/menghan/.conda/envs/nnUNet_new/include/python3.12']' returned non-zero exit status 1.

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True

Exception in thread Thread-2 (results_loop):
Traceback (most recent call last):
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/threading.py", line 1012, in run
self._target(*self._args, **self._kwargs)
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
raise e
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message`
And it seems like it cannot find -lcuda, and I don't know how to deal with it though...

Could you please help me with the issues?

Thanks!

The text was updated successfully, but these errors were encountered:

FabianIsensee assigned Karol-G Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues finding the exact file to execute while using command from instruction #2576

Issues finding the exact file to execute while using command from instruction #2576

MenghanLiu212 commented Oct 31, 2024

Issues finding the exact file to execute while using command from instruction #2576

Issues finding the exact file to execute while using command from instruction #2576

Comments

MenghanLiu212 commented Oct 31, 2024