Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛BUG] torch.distributed.barrier() in recbole/data/dataset/dataset.py causes errors for single GPU #1989

Closed
jimmy-academia opened this issue Feb 2, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@jimmy-academia
Copy link

Describe the bug
torch.distributed.barrier() causes errors for single GPU

To Reproduce

from recbole.config import Config
from recbole.data import create_dataset
dataset = 'ml-100k'
config = Config(model='LightGCN', dataset=dataset)
config['data_path'] = 'cache_data/ml-100k/raw'
dataset = create_dataset(config)

Expected behavior
A clear and concise description of what you expected to happen.

    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

caused by

site-packages/recbole/data/dataset/dataset.py", line 251, in _download
    torch.distributed.barrier()

After commenting out line 251 in recbole/data/dataset/dataset.py, there is no error.

Desktop (please complete the following information):

  • OS: Linux
  • RecBole Version 1.2.0
  • Python Version 3.10.11
  • PyTorch Version 3.10.11
  • cudatoolkit Version 11.8

(single GPU)

@jimmy-academia jimmy-academia added the bug Something isn't working label Feb 2, 2024
@Yilu114
Copy link
Collaborator

Yilu114 commented Mar 3, 2024

Issue Description:
The torch.distributed.barrier() function causes errors when running RecBole on a single GPU. After commenting out the line containing torch.distributed.barrier() in the recbole/data/dataset/dataset.py file, the error disappears.

Root Cause:
The issue arises because torch.distributed.barrier() is intended for synchronizing multiple processes, commonly used in multi-GPU or distributed training scenarios. When running on a single GPU, calling this function results in an error because the default process group is not initialized.

Solution:
Since you are running on a single GPU, consider the following steps:

  1. Comment Out the Barrier Call:

    • In the recbole/data/dataset/dataset.py file, locate line 251 containing torch.distributed.barrier().
    • Comment out this line to avoid the error. However, be aware that this may impact other parts of the code that rely on barrier synchronization.
  2. Further Investigation:

    • To understand the root cause more thoroughly, you can explore the RecBole source code, specifically the relevant section in recbole/data/dataset/dataset.py.
    • Check if there are any other places where torch.distributed.barrier() is called and evaluate whether they are necessary for your specific use case.

Documentation and References:

Feel free to explore the provided documentation and adapt the solution to your specific needs! 🚀

@Yilu114 Yilu114 closed this as completed Mar 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants