You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I want to run examples/distributed_training.py using the command(I copy this script to my server and only change the dataset dir) python -m torch.distributed.launch --nproc_per_node=2 distributed_training.py --launcher pytorch
I meet this crush
/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
usage: distributed_training.py [-h] [--launcher {none,pytorch,slurm,mpi}] [--local_rank LOCAL_RANK]
distributed_training.py: error: unrecognized arguments: --local-rank=1
usage: distributed_training.py [-h] [--launcher {none,pytorch,slurm,mpi}] [--local_rank LOCAL_RANK]
distributed_training.py: error: unrecognized arguments: --local-rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 20683) of binary: /public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/bin/python
Traceback (most recent call last):
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
main()
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
distributed_training.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-08-20_11:50:39
host : gpu1
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 20684)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-08-20_11:50:39
host : gpu1
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 20683)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Maybe the distributed_training.py doesn't have the arg --local-rank. So I change local_rank to local-rank then new crush appeared:
/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
08/20 13:16:54 - mmengine - INFO -
------------------------------------------------------------
System environment:
sys.platform: linux
Python: 3.8.16 | packaged by conda-forge | (default, Feb 1 2023, 16:01:55) [GCC 11.3.0]
CUDA available: True
numpy_random_seed: 537005317
GPU 0,1: NVIDIA A100-PCIE-40GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.3, V11.3.109
GCC: gcc (GCC) 4.8.5
PyTorch: 2.0.1+cu117
PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.7
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
- CuDNN 8.5
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.15.2+cu117
OpenCV: 4.7.0
MMEngine: 0.8.4
Runtime environment:
dist_cfg: {'backend': 'nccl'}
seed: 537005317
Distributed launcher: pytorch
Distributed training: True
GPU number: 2
------------------------------------------------------------
08/20 13:16:54 - mmengine - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH ) RuntimeInfoHook
(BELOW_NORMAL) LoggerHook
--------------------
before_train:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(VERY_LOW ) CheckpointHook
--------------------
before_train_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(NORMAL ) DistSamplerSeedHook
--------------------
before_train_iter:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
--------------------
after_train_iter:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook
--------------------
after_train_epoch:
(NORMAL ) IterTimerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook
--------------------
before_val:
(VERY_HIGH ) RuntimeInfoHook
--------------------
before_val_epoch:
(NORMAL ) IterTimerHook
--------------------
before_val_iter:
(NORMAL ) IterTimerHook
--------------------
after_val_iter:
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
--------------------
after_val_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook
--------------------
after_val:
(VERY_HIGH ) RuntimeInfoHook
--------------------
after_train:
(VERY_HIGH ) RuntimeInfoHook
(VERY_LOW ) CheckpointHook
--------------------
before_test:
(VERY_HIGH ) RuntimeInfoHook
--------------------
before_test_epoch:
(NORMAL ) IterTimerHook
--------------------
before_test_iter:
(NORMAL ) IterTimerHook
--------------------
after_test_iter:
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
--------------------
after_test_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
--------------------
after_test:
(VERY_HIGH ) RuntimeInfoHook
--------------------
after_run:
(BELOW_NORMAL) LoggerHook
--------------------
Traceback (most recent call last):
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/config/config.py", line 1475, in pretty_text
text, _ = FormatCode(
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/yapf/yapflib/yapf_api.py", line 119, in FormatCode
style.SetGlobalStyle(style.CreateStyleFromConfig(style_config))
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/yapf/yapflib/style.py", line 295, in CreateStyleFromConfig
style_factory = _STYLE_NAME_TO_FACTORY.get(style_config.lower())
AttributeError: 'dict' object has no attribute 'lower'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "distributed_training.py", line 101, in <module>
main()
File "distributed_training.py", line 86, in main
runner = Runner(
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/runner/runner.py", line 431, in __init__
self.dump_config()
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/dist/utils.py", line 401, in wrapper
return func(*args, **kwargs)
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/runner/runner.py", line 2252, in dump_config
self.cfg.dump(osp.join(self.work_dir, filename))
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/config/config.py", line 1565, in dump
f.write(self.pretty_text)
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/config/config.py", line 1478, in pretty_text
raise SyntaxError('Failed to format the config file, please '
SyntaxError: Failed to format the config file, please check the syntax of:
Traceback (most recent call last):
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/config/config.py", line 1475, in pretty_text
text, _ = FormatCode(
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/yapf/yapflib/yapf_api.py", line 119, in FormatCode
style.SetGlobalStyle(style.CreateStyleFromConfig(style_config))
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/yapf/yapflib/style.py", line 295, in CreateStyleFromConfig
style_factory = _STYLE_NAME_TO_FACTORY.get(style_config.lower())
AttributeError: 'dict' object has no attribute 'lower'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "distributed_training.py", line 101, in <module>
main()
File "distributed_training.py", line 97, in main
runner.train()
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1720, in train
self.call_hook('before_run')
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1807, in call_hook
getattr(hook, fn_name)(self, **kwargs)
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/hooks/runtime_info_hook.py", line 51, in before_run
cfg=runner.cfg.pretty_text,
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/mmengine/config/config.py", line 1478, in pretty_text
raise SyntaxError('Failed to format the config file, please '
SyntaxError: Failed to format the config file, please check the syntax of:
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2871) of binary: /public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/bin/python
Traceback (most recent call last):
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
main()
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/public/home/three_whyz123/yhfu/hliu/software/miniconda3/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
distributed_training.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-08-20_13:16:56
host : gpu1
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2872)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-08-20_13:16:56
host : gpu1
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2871)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
When I want to run
examples/distributed_training.py
using the command(I copy this script to my server and only change the dataset dir)python -m torch.distributed.launch --nproc_per_node=2 distributed_training.py --launcher pytorch
I meet this crush
Maybe the
distributed_training.py
doesn't have the arg--local-rank
. So I changelocal_rank
tolocal-rank
then new crush appeared:These are my environment information:
Beta Was this translation helpful? Give feedback.
All reactions