Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues finding the exact file to execute while using command from instruction #2576

Open
MenghanLiu212 opened this issue Oct 31, 2024 · 0 comments
Assignees

Comments

@MenghanLiu212
Copy link

Hi,

Thank you authors for the great work!
I'm actually running into two kind of issues (on two computers), and I'm gonna explain each issue in the following.

Issue 1:
I'm running an issue about mis-matching of exact file to run and command (shortcut). After I set up the torch environment and installed nnUNet in conda, using this code (standardized baseline)
pip install nnunetv2
and I tried to run the fingerprint extraction:
nnUNetv2_plan_and_preprocess -d DATASET_ID --verify_dataset_integrity
And I got a result of ModuleNotFoundError: No module named 'torch'
However, when I run
python -c "import torch; print(torch.__version__)"
It returns 2.5.1, which means my pytorch is installed.

Then I tried
python -m nnunetv2.experiment_planning.plan_and_preprocess_entrypoints -d DATASET_ID --verify_dataset_integrity
It worked.
Same thing happens for training command:
CUDA_VISIBLE_DEVICES=3 nnUNetv2_train 1 2d 3
is not working, but
CUDA_VISIBLE_DEVICES=3 python -m nnunetv2.run.run_training 1 2d 3
worked.
I guess there could be an issue with shortcuts and the actual files.

Issue 2
On another computer, I used integrative framework installation:
git clone https://github.com/MIC-DKFZ/nnUNet.git cd nnUNet pip install -e .
And pytorch is successfully recognized and I passed the fingerprint extraction using
nnUNetv2_plan_and_preprocess -d DATASET_ID --verify_dataset_integrity
It worked perfectly.
But when I run the training process using the provided command
CUDA_VISIBLE_DEVICES=3 nnUNetv2_train 1 2d 3
or the command
CUDA_VISIBLE_DEVICES=3 python -m nnunetv2.run.run_training 1 2d 3,
None of them worked and I get the following error:
`CUDA_VISIBLE_DEVICES=3 python -m nnunetv2.run.run_training 1 2d 3

############################
INFO: You are using the old nnU-Net default plans. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md
############################

Using device: cuda:0
/data/menghan/nnUNetFrame/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:164: FutureWarning: torch.cuda.amp.GradScaler(args...) is deprecated. Please use torch.amp.GradScaler('cuda', args...) instead.
self.grad_scaler = GradScaler() if self.device.type == 'cuda' else None

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################

2024-10-30 18:21:12.507816: do_dummy_2d_data_aug: False
2024-10-30 18:21:12.509523: Using splits from existing split file: /data/menghan/nnUNetFrame/dataset/nnUNet_preprocessed/Dataset001_PDGM/splits_final.json
2024-10-30 18:21:12.509829: The split file contains 5 splits.
2024-10-30 18:21:12.509867: Desired fold for training: 3
2024-10-30 18:21:12.509893: This split has 317 training and 79 validation cases.
using pin_memory on device 0
using pin_memory on device 0
2024-10-30 18:21:15.261662: Using torch.compile...
/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/optim/lr_scheduler.py:62: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn(

This is the configuration used by this training:
Configuration name: 2d
{'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 105, 'patch_size': [192, 160], 'median_image_size_in_voxels': [174.0, 137.0], 'spacing': [1.0, 1.0], 'normalization_schemes': ['ZScoreNormalization', 'ZScoreNormalization', 'ZScoreNormalization'], 'use_mask_for_norm': [True, True, True], 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'architecture': {'network_class_name': 'dynamic_network_architectures.architectures.unet.PlainConvUNet', 'arch_kwargs': {'n_stages': 6, 'features_per_stage': [32, 64, 128, 256, 512, 512], 'conv_op': 'torch.nn.modules.conv.Conv2d', 'kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'strides': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'n_conv_per_stage': [2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2], 'conv_bias': True, 'norm_op': 'torch.nn.modules.instancenorm.InstanceNorm2d', 'norm_op_kwargs': {'eps': 1e-05, 'affine': True}, 'dropout_op': None, 'dropout_op_kwargs': None, 'nonlin': 'torch.nn.LeakyReLU', 'nonlin_kwargs': {'inplace': True}}, '_kw_requires_import': ['conv_op', 'norm_op', 'dropout_op', 'nonlin']}, 'batch_dice': True}

These are the global plan.json settings:
{'dataset_name': 'Dataset001_PDGM', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [1.0, 1.0, 1.0], 'original_median_shape_after_transp': [141, 174, 137], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 13284.6953125, 'mean': 1707.9705810546875, 'median': 1655.8569946289062, 'min': 0.0, 'percentile_00_5': 285.3706359863281, 'percentile_99_5': 4148.60693359375, 'std': 844.5523071289062}, '1': {'max': 18961.615234375, 'mean': 2843.3720703125, 'median': 2556.35791015625, 'min': 0.0, 'percentile_00_5': 382.8267822265625, 'percentile_99_5': 8660.498046875, 'std': 1415.30126953125}, '2': {'max': 7245.7900390625, 'mean': 1720.8861083984375, 'median': 1666.87939453125, 'min': 0.0, 'percentile_00_5': 443.10186767578125, 'percentile_99_5': 3492.31591796875, 'std': 563.9266357421875}}}

2024-10-30 18:21:15.923255: unpacking dataset...
2024-10-30 18:21:19.219472: unpacking done...
2024-10-30 18:21:19.224577: Unable to plot network architecture: nnUNet_compile is enabled!
2024-10-30 18:21:19.232440:
2024-10-30 18:21:19.232761: Epoch 0
2024-10-30 18:21:19.233011: Current learning rate: 0.01
/usr/bin/ld: cannot find -lcuda
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -lcuda
collect2: error: ld returned 1 exit status
Traceback (most recent call last):
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/output_graph.py", line 1446, in call_user_compiler
compiled_fn = compiler_fn(gm, self.example_inputs())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/dynamo/repro/after_dynamo.py", line 129, in call
compiled_gm = compiler_fn(gm, example_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/init.py", line 2234, in call
return compile_fx(model
, inputs
, config_patches=self.config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1521, in compile_fx
return aot_autograd(
^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/backends/common.py", line 72, in call
cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1071, in aot_module_simplified
compiled_fn = dispatch_and_compile()
^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1056, in dispatch_and_compile
compiled_fn, _ = create_aot_dispatcher_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 522, in create_aot_dispatcher_function
return _create_aot_dispatcher_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 759, in _create_aot_dispatcher_function
compiled_fn, fw_metadata = compiler_fn(
^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 588, in aot_dispatch_autograd
compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1350, in fw_compiler_base
return _fw_compiler_base(model, example_inputs, is_inference)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1421, in _fw_compiler_base
return inner_compile(
^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 475, in compile_fx_inner
return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/repro/after_aot.py", line 85, in debug_wrapper
inner_compiled_fn = compiler_fn(gm, example_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 661, in _compile_fx_inner
compiled_graph = FxGraphCache.load(
^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/codecache.py", line 1334, in load
compiled_graph = compile_fx_fn(
^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 570, in codegen_and_compile
compiled_graph = fx_codegen_and_compile(gm, example_inputs, **fx_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 878, in fx_codegen_and_compile
compiled_fn = graph.compile_to_fn()
^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/graph.py", line 1913, in compile_to_fn
return self.compile_to_module().call
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/graph.py", line 1839, in compile_to_module
return self._compile_to_module()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/graph.py", line 1845, in _compile_to_module
self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/graph.py", line 1784, in codegen
self.scheduler.codegen()
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/scheduler.py", line 3383, in codegen
return self._codegen()
^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/scheduler.py", line 3461, in _codegen
self.get_backend(device).codegen_node(node)
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/codegen/cuda_combined_scheduling.py", line 80, in codegen_node
return self._triton_scheduling.codegen_node(node)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/codegen/simd.py", line 1155, in codegen_node
return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/codegen/simd.py", line 1364, in codegen_node_schedule
src_code = kernel.codegen_kernel()
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/codegen/triton.py", line 2661, in codegen_kernel
**self.inductor_meta_common(),
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_inductor/codegen/triton.py", line 2532, in inductor_meta_common
"backend_hash": torch.utils._triton.triton_hash_with_backend(),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/utils/_triton.py", line 53, in triton_hash_with_backend
backend = triton_backend()
^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/utils/_triton.py", line 45, in triton_backend
target = driver.active.get_current_target()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/triton/runtime/driver.py", line 23, in getattr
self._initialize_obj()
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
self._obj = self._init_fn()
^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/triton/runtime/driver.py", line 9, in _create_driver
return actives0
^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 371, in init
self.utils = CudaUtils() # TODO: make static
^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 80, in init
mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src
so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/triton/runtime/build.py", line 48, in _build
ret = subprocess.check_call(cc_cmd)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmp13q643ul/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmp13q643ul/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/triton/backends/nvidia/lib', '-L/lib/x86_64-linux-gnu', '-I/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/triton/backends/nvidia/include', '-I/tmp/tmp13q643ul', '-I/home/menghan/.conda/envs/nnUNet_new/include/python3.12']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/data/menghan/nnUNetFrame/nnUNet/nnunetv2/run/run_training.py", line 285, in
run_training_entry()
File "/data/menghan/nnUNetFrame/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "/data/menghan/nnUNetFrame/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training
nnunet_trainer.run_training()
File "/data/menghan/nnUNetFrame/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1370, in run_training
train_outputs.append(self.train_step(next(self.dataloader_train)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/menghan/nnUNetFrame/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 994, in train_step
output = self.network(data)
^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 465, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1269, in call
return self._torchdynamo_orig_callable(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1064, in call
result = self._inner_convert(
^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 526, in call
return _compile(
^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 924, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 666, in compile_inner
return _compile_inner(code, one_graph, hooks, transform)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_utils_internal.py", line 87, in wrapper_function
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 699, in _compile_inner
out_code = transform_code_object(code, transform)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/bytecode_transformation.py", line 1322, in transform_code_object
transformations(instructions, code_options)
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 219, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 634, in transform
tracer.run()
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 2796, in run
super().run()
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 983, in run
while self.step():
^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 895, in step
self.dispatch_table[inst.opcode](self, inst)
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 2987, in RETURN_VALUE
self._return(inst)
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 2972, in _return
self.output.compile_subgraph(
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/output_graph.py", line 1142, in compile_subgraph
self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/output_graph.py", line 1369, in compile_and_call_fx_graph
compiled_fn = self.call_user_compiler(gm)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/output_graph.py", line 1416, in call_user_compiler
return self._call_user_compiler(gm)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/torch/_dynamo/output_graph.py", line 1465, in _call_user_compiler
raise BackendCompilerFailed(self.compiler_fn, e) from e
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmp13q643ul/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmp13q643ul/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/triton/backends/nvidia/lib', '-L/lib/x86_64-linux-gnu', '-I/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/triton/backends/nvidia/include', '-I/tmp/tmp13q643ul', '-I/home/menghan/.conda/envs/nnUNet_new/include/python3.12']' returned non-zero exit status 1.

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True

Exception in thread Thread-2 (results_loop):
Traceback (most recent call last):
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/threading.py", line 1012, in run
self._target(*self._args, **self._kwargs)
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
raise e
File "/home/menghan/.conda/envs/nnUNet_new/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message`
And it seems like it cannot find -lcuda, and I don't know how to deal with it though...

Could you please help me with the issues?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants