Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is setting IBGDA necessary for test_internode.py? #36

Open
zhangml opened this issue Mar 2, 2025 · 14 comments
Open

Is setting IBGDA necessary for test_internode.py? #36

zhangml opened this issue Mar 2, 2025 · 14 comments

Comments

@zhangml
Copy link

zhangml commented Mar 2, 2025

I noticed the following steps in the guide:

Enable IBGDA by modifying /etc/modprobe.d/nvidia.conf:
options nvidia NVreg_EnableStreamMemOPs=1 NVreg_RegistryDwords="PeerMappingOverride=1;"
Update kernel configuration:
  sudo update-initramfs -u
  sudo reboot

Due to some environment permission issues, I can't do this step for now. Is it possible to run test_internode.py without doing this step?

@LyricZhao
Copy link
Collaborator

Yes, you can. The normal kernels use IBRC instead of IBGDA. But we plan to support AR later, which always requires IBGDA.

@zhangml
Copy link
Author

zhangml commented Mar 3, 2025

Yes, you can. The normal kernels use IBRC instead of IBGDA. But we plan to support AR later, which always requires IBGDA.

I ran 2 H20 node test_internode.py without IBGDA and encountered the following error, I'm not sure if it's related to IBGDA. @LyricZhao

nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:499: non-zero status: -3 nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_net_recv:99: Message truncated : received 40 bytes instead of 1

nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:499: non-zero status: -3 nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_net_recv:99: Message truncated : received 127 bytes instead of 8

nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:499: non-zero status: -3 nvshmem_src/src/host/topo/topo.cpp:477: non-zero status: -3 allgather of ipc handles failed 

nvshmem_src/src/host/init/init.cu:992: non-zero status: 7 building transport map failed 

@haswelliris
Copy link
Collaborator

  1. Based on your logs, it appears that the system is unable to retrieve information from other ranks during bootstrap. We recommend checking your network connectivity settings, including:
    • Proper IP and network interface configuration (NVSHMEM_HCA_LIST)
    • For RoCE, ensure correct settings for:
      • NVSHMEM_IB_GID_INDEX
      • NVSHMEM_IB_TRAFFIC_CLASS
  2. We strongly recommend properly enabling IBGDA usage to prevent potential unknown issues.

@yanminjia
Copy link

yanminjia commented Mar 3, 2025

I would like to run test_internode.py with IBGDA enabled because it looks dual-port RNIC is supported by going with IBGDA over RoCE network. Could you please give me a detailed configuration to enable IBGDA?

Many thanks.

@Baibaifan
Copy link

I used megatron-lm to test two H100s and 16 cards in RoCE network with ep=16. I also encountered the above bootstrap_net_recv:99: Message truncated: received 40 bytes instead of 8. I set IBGDA, but it prompts: WARN: init failed for remote transport: ibrc.

@sphish
Copy link
Collaborator

sphish commented Mar 3, 2025

I would like to run test_internode.py with IBGDA enabled because it looks dual-port RNIC is supported by going with IBGDA over RoCE network. Could you please give me a detailed configuration to enable IBGDA?

Many thanks.

You can set envrionments

NVSHMEM_IB_ENABLE_IBGDA=1
NVSHMEM_IBGDA_NIC_HANDLER=gpu

to enable IBGDA.

@sphish
Copy link
Collaborator

sphish commented Mar 3, 2025

I used megatron-lm to test two H100s and 16 cards in RoCE network with ep=16. I also encountered the above bootstrap_net_recv:99: Message truncated: received 40 bytes instead of 8. I set IBGDA, but it prompts: WARN: init failed for remote transport: ibrc.

This appears to be an error during NVSHMEM bootstrap. Please verify your network configuration is correct. It's recommended to run the tests in NVSHMEM perftest first to validate your network setup.

Note that even when IBGDA is enabled, NVSHMEM will still create IBRC connections, so seeing this warning message makes sense. For more details, please refer to the NVSHMEM documentation.

@Baibaifan
Copy link

Baibaifan commented Mar 3, 2025

I used megatron-lm to test two H100s and 16 cards in RoCE network with ep=16. I also encountered the above bootstrap_net_recv:99: Message truncated: received 40 bytes instead of 8. I set IBGDA, but it prompts: WARN: init failed for remote transport: ibrc.

This appears to be an error during NVSHMEM bootstrap. Please verify your network configuration is correct. It's recommended to run the tests in NVSHMEM perftest first to validate your network setup.

Note that even when IBGDA is enabled, NVSHMEM will still create IBRC connections, so this warning message is expected. For more details, please refer to the NVSHMEM documentation.

My script:

NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=0 python tests/test_internode.py
NCCL_DEBUG=INFO MASTER_ADDR=xxx WORLD_SIZE=2 RANK=1 python tests/test_internode.py

RANK=0 result:

nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_net_recv:99: Message truncated : received 40 bytes instead of 8

nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:499: non-zero status: -3 nvshmem_src/src/host/topo/topo.cpp:477: non-zero status: -3 allgather of ipc handles failed
...
nvshmem_src/src/host/init/init.cu:992: non-zero status: 7 building transport map failed

nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting

nvshmem_src/src/host/init/init.cu:992: non-zero status: 7 building transport map failed


RANK=1 result:

nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:nvshmemt_init:1850: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.

nvshmem_src/src/host/topo/topo.cpp:469: [GPU 0] Peer GPU 1 is not accessible, exiting ...
nvshmem_src/src/host/init/init.cu:992: non-zero status: 3 building transport map failed
...

WARN: init failed for remote transport: ibrc

@haswelliris
Copy link
Collaborator

haswelliris commented Mar 3, 2025

@Baibaifan The message neither nv_peer_mem nor nvidia_peermem detected indicates that your system environment does not currently support GPU Direct RDMA. To resolve this, please try loading the GDR kernel module by running one of the following commands:

modprobe nv_peer_mem
# or
modprobe nvidia_peermem

This should enable GPU Direct RDMA functionality.

@Baibaifan
Copy link

Baibaifan commented Mar 3, 2025

modprobe nvidia_peermem

@haswelliris After I successfully set modprobe nv_peer_mem, repeating the above command appears:

rank0:
There is no error output for rank0.

rank1:

Caught signal 11 (Segmentation fault: address not mapped to object at address 0x11000008)
==== backtrace (tid:  84663) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000010a13 process_recv()  :0
 2 0x00000000000112e5 progress_recv()  :0
 3 0x00000000000113dc nvshmemt_ibrc_progress()  :0
 4 0x000000000020256c progress_transports()  ???:0
 5 0x0000000000202c52 nvshmemi_proxy_progress()  ???:0
 6 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
 7 0x0000000000126850 __xmknodat()  ???:0
=================================

@yanminjia
Copy link

I would like to run test_internode.py with IBGDA enabled because it looks dual-port RNIC is supported by going with IBGDA over RoCE network. Could you please give me a detailed configuration to enable IBGDA?
Many thanks.

You can set envrionments

NVSHMEM_IB_ENABLE_IBGDA=1
NVSHMEM_IBGDA_NIC_HANDLER=gpu
to enable IBGDA.

Many thanks, will try it.

@ghghliu
Copy link

ghghliu commented Mar 3, 2025

Is there any way to determine if IBGDA is correctly enabled, because the performance seems no difference weather I set NVSHMEM_IB_ENABLE_IBGDA=1 or 0, and the result looks ok. And is it necessary to compile libgdsync(https://github.com/gpudirect/libgdsync) before compiling nvshmem?

@yanminjia
Copy link

I would like to run test_internode.py with IBGDA enabled because it looks dual-port RNIC is supported by going with IBGDA over RoCE network. Could you please give me a detailed configuration to enable IBGDA?
Many thanks.

You can set envrionments

NVSHMEM_IB_ENABLE_IBGDA=1
NVSHMEM_IBGDA_NIC_HANDLER=gpu
to enable IBGDA.

Many thanks for your kindly response. It looks not work because ibrc.cxx:progress_send(...) is called for transferring data by checking a log message added to this function (ibrc.cxx:progress_send(...)). Maybe, any other configuration missed to enable IBGDA?

@yanminjia
Copy link

I would like to run test_internode.py with IBGDA enabled because it looks dual-port RNIC is supported by going with IBGDA over RoCE network. Could you please give me a detailed configuration to enable IBGDA?
Many thanks.

You can set envrionments
NVSHMEM_IB_ENABLE_IBGDA=1
NVSHMEM_IBGDA_NIC_HANDLER=gpu
to enable IBGDA.

Many thanks for your kindly response. It looks not work because ibrc.cxx:progress_send(...) is called for transferring data by checking a log message added to this function (ibrc.cxx:progress_send(...)). Maybe, any other configuration missed to enable IBGDA?

By debugging this issue, ibgda.cc:nvshmemt_init(...) is failed with the error message as following:

WARN: device mlx5_1 cannot allocate buffer on the specified memory type. Skipping...

This problem is caused by mlx5dv_devx_umem_reg(...) failure.

Any suggestion would be appreciated.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants