You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue Description:
The torch.distributed.barrier() function causes errors when running RecBole on a single GPU. After commenting out the line containing torch.distributed.barrier() in the recbole/data/dataset/dataset.py file, the error disappears.
Root Cause:
The issue arises because torch.distributed.barrier() is intended for synchronizing multiple processes, commonly used in multi-GPU or distributed training scenarios. When running on a single GPU, calling this function results in an error because the default process group is not initialized.
Solution:
Since you are running on a single GPU, consider the following steps:
Comment Out the Barrier Call:
In the recbole/data/dataset/dataset.py file, locate line 251 containing torch.distributed.barrier().
Comment out this line to avoid the error. However, be aware that this may impact other parts of the code that rely on barrier synchronization.
Further Investigation:
To understand the root cause more thoroughly, you can explore the RecBole source code, specifically the relevant section in recbole/data/dataset/dataset.py.
Check if there are any other places where torch.distributed.barrier() is called and evaluate whether they are necessary for your specific use case.
Describe the bug
torch.distributed.barrier() causes errors for single GPU
To Reproduce
Expected behavior
A clear and concise description of what you expected to happen.
caused by
After commenting out line 251 in
recbole/data/dataset/dataset.py
, there is no error.Desktop (please complete the following information):
(single GPU)
The text was updated successfully, but these errors were encountered: