Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.distributed.elastic.multiprocessing.errors.ChildFailedError #173

Open
MyGitHub-G opened this issue Oct 3, 2024 · 6 comments
Open

Comments

@MyGitHub-G
Copy link

你好,我运行sh tools/dist_train.sh projects/configs/co_dino_vit/co_dino_5scale_vit_large_coco.py 1时,会如下错误,我看之前也有人报这个错,请问这个问题如何解决?谢谢
image

@MyGitHub-G
Copy link
Author

我的环境配置:

  • python 3.7.16
  • torch 1.10.0
  • torchvision 0.10.0
  • mmcv-full 1.6.1 torch1.10

@syxkk
Copy link

syxkk commented Oct 9, 2024

你好,我运行sh tools/dist_train.sh projects/configs/co_dino_vit/co_dino_5scale_vit_large_coco.py 1时,会如下错误,我看之前也有人报这个错,请问这个问题如何解决?谢谢 image

请问您解决了 我也遇到了相同的错误

@TempleX98
Copy link
Collaborator

@MyGitHub-G 我看图片里的报错是没有指明work-dir这个参数

@syxkk
Copy link

syxkk commented Oct 9, 2024

@MyGitHub-G 我看图片里的报错是没有指明work-dir这个参数

image
您好 其实我的跟他不一样 我的是在训练完一轮之后 测试到最后一个batch报错了
运行下面的命令
sh tools/dist_train.sh projects/configs/co_deformable_detr/co_deformable_detr_r50_1x_coco.py 2 path_to_exp
配置python=3.7.11,pytorch=1.11.0,cuda=11.3 mmcv-full=1.5.0 gpu是a100 40gb
期待您的回复

@MyGitHub-G
Copy link
Author

@MyGitHub-G 我看图片里的报错是没有指明work-dir这个参数

这个应该不是没有指明work-dir参数的问题,代码里面如果没有指明会有默认目录创建。这个问题貌似是程序运行的问题,我如果不用sh运行,直接运行train.py,会报错core dump

@adjawdka
Copy link

adjawdka commented Jan 4, 2025

请问你解决了吗?我在训练完一轮,接着测试就报错了

  • mmdet - INFO - Saving checkpoint at 1 epochs
    [ ] 0/5000, elapsed: 0s, ETA:Traceback (most recent call last):
    File "/home/Co-DETR-main/tools/train.py", line 245, in
    main()
    File "/home/Co-DETR-main/tools/train.py", line 234, in main
    train_detector(
    File "/home/Co-DETR-main/mmdet/apis/train.py", line 245, in train_detector
    runner.run(data_loaders, cfg.workflow)
    File "/home/anaconda3/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
    File "/home/anaconda3/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train
    self.call_hook('after_train_epoch')
    File "/home/anaconda3/lib/python3.9/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook
    getattr(hook, fn_name)(self)
    File "/home/anaconda3/lib/python3.9/site-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch
    self._do_evaluate(runner)
    File "/home/Co-DETR-main/mmdet/core/evaluation/eval_hooks.py", line 126, in _do_evaluate
    results = multi_gpu_test(
    File "/home/Co-DETR-main/mmdet/apis/test.py", line 109, in multi_gpu_test
    result = model(return_loss=False, rescale=True, **data)
    File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
    File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
    File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
    File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
    File "/home/anaconda3/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 110, in new_func
    return old_func(*args, **kwargs)
    File "/home/Co-DETR-main/mmdet/models/detectors/base.py", line 174, in forward
    return self.forward_test(img, img_metas, **kwargs)
    File "/home/Co-DETR-main/mmdet/models/detectors/base.py", line 137, in forward_test
    img_meta[img_id]['batch_input_shape'] = tuple(img.size()[-2:])
    TypeError: 'DataContainer' object is not subscriptable
    ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2163076) of binary: /home/anaconda3/bin/python
    Traceback (most recent call last):
    File "/home/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
    File "/home/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
    File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 195, in
    main()
    File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
    File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
    File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
    File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
    return launch_agent(self._config, self._entrypoint, list(args))
    File "/home/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants