You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
您好,我在使用的时候遇到了问题,我发现如果不使用分布式训练就需要修改源代码。直接在命令行中令distributed为False并不能解决该类问题,请问应该如何解决分布式训练带来的问题,只能注释掉相关代码来解决吗。
----translation-----
Hello, I'm having problems with it and I realized that I need to modify the source code if I don't use distributed training. Directly making distributed to False on the command line does not solve this type of problem, how should I solve the problem caused by distributed training, can I only comment out the relevant code to solve it.
----log-----
Traceback (most recent call last):
File "F:\Projects\Multi Modal\ALBEF\Pretrain.py", line 203, in
main(args, config)
File "F:\Projects\Multi Modal\ALBEF\Pretrain.py", line 175, in main
dist.barrier()
File "F:\anaconda3\envs\albef\lib\site-packages\torch\distributed\c10d_logger.py", line 72, in wrapper
return func(*args, **kwargs)
File "F:\anaconda3\envs\albef\lib\site-packages\torch\distributed\distributed_c10d.py", line 3428, in barrier
opts.device = _get_pg_default_device(group)
File "F:\anaconda3\envs\albef\lib\site-packages\torch\distributed\distributed_c10d.py", line 644, in _get_pg_default_device
group = group or _get_default_group()
File "F:\anaconda3\envs\albef\lib\site-packages\torch\distributed\distributed_c10d.py", line 977, in _get_default_group
raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.
The text was updated successfully, but these errors were encountered:
您好,我在使用的时候遇到了问题,我发现如果不使用分布式训练就需要修改源代码。直接在命令行中令distributed为False并不能解决该类问题,请问应该如何解决分布式训练带来的问题,只能注释掉相关代码来解决吗。
----translation-----
Hello, I'm having problems with it and I realized that I need to modify the source code if I don't use distributed training. Directly making distributed to False on the command line does not solve this type of problem, how should I solve the problem caused by distributed training, can I only comment out the relevant code to solve it.
----log-----
Traceback (most recent call last):
File "F:\Projects\Multi Modal\ALBEF\Pretrain.py", line 203, in
main(args, config)
File "F:\Projects\Multi Modal\ALBEF\Pretrain.py", line 175, in main
dist.barrier()
File "F:\anaconda3\envs\albef\lib\site-packages\torch\distributed\c10d_logger.py", line 72, in wrapper
return func(*args, **kwargs)
File "F:\anaconda3\envs\albef\lib\site-packages\torch\distributed\distributed_c10d.py", line 3428, in barrier
opts.device = _get_pg_default_device(group)
File "F:\anaconda3\envs\albef\lib\site-packages\torch\distributed\distributed_c10d.py", line 644, in _get_pg_default_device
group = group or _get_default_group()
File "F:\anaconda3\envs\albef\lib\site-packages\torch\distributed\distributed_c10d.py", line 977, in _get_default_group
raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.
The text was updated successfully, but these errors were encountered: