[🐛BUG] torch.distributed.barrier() in recbole/data/dataset/dataset.py causes errors for single GPU #1989

jimmy-academia · 2024-02-02T05:38:34Z

Describe the bug
torch.distributed.barrier() causes errors for single GPU

To Reproduce

from recbole.config import Config
from recbole.data import create_dataset
dataset = 'ml-100k'
config = Config(model='LightGCN', dataset=dataset)
config['data_path'] = 'cache_data/ml-100k/raw'
dataset = create_dataset(config)

Expected behavior
A clear and concise description of what you expected to happen.

    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

caused by

site-packages/recbole/data/dataset/dataset.py", line 251, in _download
    torch.distributed.barrier()

After commenting out line 251 in recbole/data/dataset/dataset.py, there is no error.

Desktop (please complete the following information):

OS: Linux
RecBole Version 1.2.0
Python Version 3.10.11
PyTorch Version 3.10.11
cudatoolkit Version 11.8

(single GPU)

The text was updated successfully, but these errors were encountered:

Yilu114 · 2024-03-03T14:48:34Z

Issue Description:
The torch.distributed.barrier() function causes errors when running RecBole on a single GPU. After commenting out the line containing torch.distributed.barrier() in the recbole/data/dataset/dataset.py file, the error disappears.

Root Cause:
The issue arises because torch.distributed.barrier() is intended for synchronizing multiple processes, commonly used in multi-GPU or distributed training scenarios. When running on a single GPU, calling this function results in an error because the default process group is not initialized.

Solution:
Since you are running on a single GPU, consider the following steps:

Comment Out the Barrier Call:
- In the recbole/data/dataset/dataset.py file, locate line 251 containing torch.distributed.barrier().
- Comment out this line to avoid the error. However, be aware that this may impact other parts of the code that rely on barrier synchronization.
Further Investigation:
- To understand the root cause more thoroughly, you can explore the RecBole source code, specifically the relevant section in recbole/data/dataset/dataset.py.
- Check if there are any other places where torch.distributed.barrier() is called and evaluate whether they are necessary for your specific use case.

Documentation and References:

[RecBole Evaluation Documentation](https://recbole.io/evaluation.html): Provides details on evaluation methods and supported metrics.
[RecBole Training & Evaluation Introduction](https://recbole.io/docs/user_guide/train_eval_intro.html): Additional information on training and evaluation in RecBole.
[GitHub Issue #1556]([Question] Speed up evaluate using large eval batch size #1556): Discussion related to speeding up evaluation using large batch sizes.

Feel free to explore the provided documentation and adapt the solution to your specific needs! 🚀

jimmy-academia added the bug Something isn't working label Feb 2, 2024

Sherry-XLL assigned Yilu114 Feb 22, 2024

Yilu114 closed this as completed Mar 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[🐛BUG] torch.distributed.barrier() in recbole/data/dataset/dataset.py causes errors for single GPU #1989

[🐛BUG] torch.distributed.barrier() in recbole/data/dataset/dataset.py causes errors for single GPU #1989

jimmy-academia commented Feb 2, 2024

Yilu114 commented Mar 3, 2024

[🐛BUG] torch.distributed.barrier() in recbole/data/dataset/dataset.py causes errors for single GPU #1989

[🐛BUG] torch.distributed.barrier() in recbole/data/dataset/dataset.py causes errors for single GPU #1989

Comments

jimmy-academia commented Feb 2, 2024

Yilu114 commented Mar 3, 2024