Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use IB_MERGE_VFS argument when detecting PCI path #145

Merged
merged 1 commit into from
Jan 31, 2024

Conversation

thomasbarrett
Copy link

@thomasbarrett thomasbarrett commented Jan 28, 2024

When running in a cloud-hypervisor virtual-machine, IB VFs are exposed as a RCiEP. If the IB VFs are merged, the nccl-rdma-sharp-plugin incorrectly assumes that all IB virtual-functions are part of the PCI host bridge. Disabling the IB_MERGE_VFS variable allows NCCL to correctly identify IB VFs as independent devices.

This is the same logic used by NCCL native IB transport, it should be used here as well.

@thomasbarrett
Copy link
Author

thomasbarrett commented Jan 31, 2024

Thanks @bureddy. I’ll fix CI in an hour or so.

Question: how long does it take for plug-in updates to get included in HPC-X?

When running in a cloud-hypervisor guest, IB VFs are exposed as a
RCiEP. If the IB VFs are merged, NCCL does not correctly detect
PCI topology.
@bureddy bureddy merged commit 3ff78de into Mellanox:master Jan 31, 2024
5 checks passed
@bureddy
Copy link
Collaborator

bureddy commented Feb 1, 2024

Thanks @bureddy. I’ll fix CI in an hour or so.

Question: how long does it take for plug-in updates to get included in HPC-X?

it will be part of HPCX-2.18 (ETA: mid of Feb)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants