Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More generic check for CUDA-aware MPI #1793

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 15 additions & 5 deletions heat/core/communication.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,23 +8,33 @@
import os
import subprocess
import torch
import warnings
from mpi4py import MPI

from typing import Any, Callable, Optional, List, Tuple, Union
from .stride_tricks import sanitize_axis

CUDA_AWARE_MPI = False
# check whether OpenMPI support CUDA-aware MPI
if "openmpi" in os.environ.get("MPI_SUFFIX", "").lower():
# check whether there is CUDA-aware OpenMPI
try:
buffer = subprocess.check_output(["ompi_info", "--parsable", "--all"])
CUDA_AWARE_MPI = b"mpi_built_with_cuda_support:value:true" in buffer
# MVAPICH
except: # noqa E722
pass
# do the same for MVAPICH
CUDA_AWARE_MPI = CUDA_AWARE_MPI or os.environ.get("MV2_USE_CUDA") == "1"
# MPICH
# do the same for MPICH
CUDA_AWARE_MPI = CUDA_AWARE_MPI or os.environ.get("MPIR_CVAR_ENABLE_HCOLL") == "1"
# ParaStationMPI
# do the same for ParaStationMPI
CUDA_AWARE_MPI = CUDA_AWARE_MPI or os.environ.get("PSP_CUDA") == "1"

# warn the user if CUDA-aware MPI is not available, but PyTorch can use GPUs
if torch.cuda.is_available() and not CUDA_AWARE_MPI:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something occurred to us earlier just before we merged this: if torch.cuda.is_available() will return True with ROCm as well. We need to constrain the check a bit more, that's why we decided to to merge yet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats true, so actually we need to check for ROCm/HIP and CUDA in PyTorch and need to compare with corresponding MPI-capabilities to avoid the (hopefully) unlikely case of having ROCm/HIP-PyTorch and CUDA-MPI or CUDA-PyTorch and ROCm/HIP-MPI etc.

warnings.warn(
f"Heat has GPU-support (PyTorch version {torch.__version__}), but CUDA-awareness of MPI could not be detected. \n This may lead to performance degradation as direct MPI-communication between GPUs is not possible.",
UserWarning,
)


class MPIRequest:
"""
Expand Down
Loading