-
Notifications
You must be signed in to change notification settings - Fork 568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is setting IBGDA necessary for test_internode.py? #36
Comments
Yes, you can. The normal kernels use IBRC instead of IBGDA. But we plan to support AR later, which always requires IBGDA. |
I ran 2 H20 node test_internode.py without IBGDA and encountered the following error, I'm not sure if it's related to IBGDA. @LyricZhao
|
|
I would like to run test_internode.py with IBGDA enabled because it looks dual-port RNIC is supported by going with IBGDA over RoCE network. Could you please give me a detailed configuration to enable IBGDA? Many thanks. |
I used megatron-lm to test two H100s and 16 cards in RoCE network with ep=16. I also encountered the above |
You can set envrionments NVSHMEM_IB_ENABLE_IBGDA=1
NVSHMEM_IBGDA_NIC_HANDLER=gpu to enable IBGDA. |
This appears to be an error during NVSHMEM bootstrap. Please verify your network configuration is correct. It's recommended to run the tests in NVSHMEM perftest first to validate your network setup. Note that even when IBGDA is enabled, NVSHMEM will still create IBRC connections, so seeing this warning message makes sense. For more details, please refer to the NVSHMEM documentation. |
My script:
RANK=0 result:
RANK=1 result:
|
@Baibaifan The message
This should enable GPU Direct RDMA functionality. |
@haswelliris After I successfully set rank0: rank1:
|
Many thanks, will try it. |
Is there any way to determine if IBGDA is correctly enabled, because the performance seems no difference weather I set NVSHMEM_IB_ENABLE_IBGDA=1 or 0, and the result looks ok. And is it necessary to compile libgdsync(https://github.com/gpudirect/libgdsync) before compiling nvshmem? |
Many thanks for your kindly response. It looks not work because ibrc.cxx:progress_send(...) is called for transferring data by checking a log message added to this function (ibrc.cxx:progress_send(...)). Maybe, any other configuration missed to enable IBGDA? |
By debugging this issue, ibgda.cc:nvshmemt_init(...) is failed with the error message as following: WARN: device mlx5_1 cannot allocate buffer on the specified memory type. Skipping... This problem is caused by mlx5dv_devx_umem_reg(...) failure. Any suggestion would be appreciated. Thanks. |
I noticed the following steps in the guide:
Due to some environment permission issues, I can't do this step for now. Is it possible to run test_internode.py without doing this step?
The text was updated successfully, but these errors were encountered: