Something wrong with examples/distributed_training.py #1314

lyangfan · 2023-08-20T05:25:38Z

lyangfan
Aug 20, 2023

When I want to run examples/distributed_training.py using the command(I copy this script to my server and only change the dataset dir)
python -m torch.distributed.launch --nproc_per_node=2 distributed_training.py --launcher pytorch
I meet this crush

/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
usage: distributed_training.py [-h] [--launcher {none,pytorch,slurm,mpi}] [--local_rank LOCAL_RANK]
distributed_training.py: error: unrecognized arguments: --local-rank=1
usage: distributed_training.py [-h] [--launcher {none,pytorch,slurm,mpi}] [--local_rank LOCAL_RANK]
distributed_training.py: error: unrecognized arguments: --local-rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 20683) of binary: /public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/bin/python
Traceback (most recent call last):
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
distributed_training.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-08-20_11:50:39
  host      : gpu1
  rank      : 1 (local_rank: 1)
  exitcode  : 2 (pid: 20684)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-20_11:50:39
  host      : gpu1
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 20683)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Maybe the distributed_training.py doesn't have the arg --local-rank. So I change local_rank to local-rank then new crush appeared:

/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
08/20 13:16:54 - mmengine - INFO - 
------------------------------------------------------------
System environment:
    sys.platform: linux
    Python: 3.8.16 | packaged by conda-forge | (default, Feb  1 2023, 16:01:55) [GCC 11.3.0]
    CUDA available: True
    numpy_random_seed: 537005317
    GPU 0,1: NVIDIA A100-PCIE-40GB
    CUDA_HOME: /usr/local/cuda
    NVCC: Cuda compilation tools, release 11.3, V11.3.109
    GCC: gcc (GCC) 4.8.5
    PyTorch: 2.0.1+cu117
    PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.7
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.5
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

    TorchVision: 0.15.2+cu117
    OpenCV: 4.7.0
    MMEngine: 0.8.4

Runtime environment:
    dist_cfg: {'backend': 'nccl'}
    seed: 537005317
    Distributed launcher: pytorch
    Distributed training: True
    GPU number: 2
------------------------------------------------------------

08/20 13:16:54 - mmengine - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH   ) RuntimeInfoHook                    
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
before_train:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_train_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(NORMAL      ) DistSamplerSeedHook                
 -------------------- 
before_train_iter:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_train_iter:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) LoggerHook                         
(LOW         ) ParamSchedulerHook                 
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
after_train_epoch:
(NORMAL      ) IterTimerHook                      
(LOW         ) ParamSchedulerHook                 
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_val:
(VERY_HIGH   ) RuntimeInfoHook                    
 -------------------- 
before_val_epoch:
(NORMAL      ) IterTimerHook                      
 -------------------- 
before_val_iter:
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_val_iter:
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
after_val_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) LoggerHook                         
(LOW         ) ParamSchedulerHook                 
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
after_val:
(VERY_HIGH   ) RuntimeInfoHook                    
 -------------------- 
after_train:
(VERY_HIGH   ) RuntimeInfoHook                    
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_test:
(VERY_HIGH   ) RuntimeInfoHook                    
 -------------------- 
before_test_epoch:
(NORMAL      ) IterTimerHook                      
 -------------------- 
before_test_iter:
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_test_iter:
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
after_test_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
after_test:
(VERY_HIGH   ) RuntimeInfoHook                    
 -------------------- 
after_run:
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
Traceback (most recent call last):
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/config/config.py", line 1475, in pretty_text
    text, _ = FormatCode(
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/yapf/yapflib/yapf_api.py", line 119, in FormatCode
    style.SetGlobalStyle(style.CreateStyleFromConfig(style_config))
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/yapf/yapflib/style.py", line 295, in CreateStyleFromConfig
    style_factory = _STYLE_NAME_TO_FACTORY.get(style_config.lower())
AttributeError: 'dict' object has no attribute 'lower'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "distributed_training.py", line 101, in <module>
    main()
  File "distributed_training.py", line 86, in main
    runner = Runner(
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/runner/runner.py", line 431, in __init__
    self.dump_config()
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/dist/utils.py", line 401, in wrapper
    return func(*args, **kwargs)
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/runner/runner.py", line 2252, in dump_config
    self.cfg.dump(osp.join(self.work_dir, filename))
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/config/config.py", line 1565, in dump
    f.write(self.pretty_text)
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/config/config.py", line 1478, in pretty_text
    raise SyntaxError('Failed to format the config file, please '
SyntaxError: Failed to format the config file, please check the syntax of: 

Traceback (most recent call last):
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/config/config.py", line 1475, in pretty_text
    text, _ = FormatCode(
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/yapf/yapflib/yapf_api.py", line 119, in FormatCode
    style.SetGlobalStyle(style.CreateStyleFromConfig(style_config))
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/yapf/yapflib/style.py", line 295, in CreateStyleFromConfig
    style_factory = _STYLE_NAME_TO_FACTORY.get(style_config.lower())
AttributeError: 'dict' object has no attribute 'lower'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "distributed_training.py", line 101, in <module>
    main()
  File "distributed_training.py", line 97, in main
    runner.train()
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1720, in train
    self.call_hook('before_run')
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1807, in call_hook
    getattr(hook, fn_name)(self, **kwargs)
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/hooks/runtime_info_hook.py", line 51, in before_run
    cfg=runner.cfg.pretty_text,
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/config/config.py", line 1478, in pretty_text
    raise SyntaxError('Failed to format the config file, please '
SyntaxError: Failed to format the config file, please check the syntax of: 

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2871) of binary: /public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/bin/python
Traceback (most recent call last):
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
distributed_training.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-08-20_13:16:56
  host      : gpu1
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2872)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-20_13:16:56
  host      : gpu1
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2871)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

These are my environment information:

OrderedDict([('sys.platform', 'linux'), ('Python', '3.8.16 | packaged by conda-forge | (default, Feb  1 2023, 16:01:55) [GCC 11.3.0]'), ('CUDA available', True), ('numpy_random_seed', 2147483648), ('GPU 0,1', 'NVIDIA A100-PCIE-40GB'), ('CUDA_HOME', '/usr/local/cuda'), ('NVCC', 'Cuda compilation tools, release 11.3, V11.3.109'), ('GCC', 'gcc (GCC) 4.8.5'), ('PyTorch', '2.0.1+cu117'), ('PyTorch compiling details', 'PyTorch built with:\n  - GCC 9.3\n  - C++ Version: 201703\n  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications\n  - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n  - LAPACK is enabled (usually provided by MKL)\n  - NNPACK is enabled\n  - CPU capability usage: AVX2\n  - CUDA Runtime 11.7\n  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86\n  - CuDNN 8.5\n  - Magma 2.6.1\n  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n'), ('TorchVision', '0.15.2+cu117'), ('OpenCV', '4.7.0'), ('MMEngine', '0.8.4')])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Something wrong with examples/distributed_training.py #1314

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Something wrong with examples/distributed_training.py #1314

lyangfan Aug 20, 2023

Replies: 0 comments

lyangfan
Aug 20, 2023