-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix allgather ops inside cuda graphs #3709
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems cuda graph capture failed in one test.
Seeing the same issue, please merge soon :) |
Hi @nvcastet May you help take a look? Thanks! |
Looking. |
Hi @nvcastet , I see cuda graph capture failed when use pynccl all gather in logits layer, but it will success in dp attention since |
@yizhang2077 Still debugging. Can I have another hour-ish? |
Ok, I think it is reasonable |
@yizhang2077 Is it reproducible on your side? |
It is weird, I will try it on my env. |
@nvcastet @yizhang2077 |
Hi @nvcastet, I can reproduce this error in my env. torch 2.5.1+cu124, 8xH200, docker image lmsysorg/sglang:dev |
@yizhang2077 Thanks let me try your container image.
From https://pytorch.org/docs/stable/notes/cuda.html#constraints |
The failure did not happen for me on pytorch 2.7 but it does on 2.5.
which points to pynccl not being graphable (breaking the torch.compile graph) and also being recompiled later on during cuda graph capture. I added a commit to remove pynccl and replace it with a processgroup clone which should behave the same way. |
@nvcastet It is wierd, since pynccl allreduce is also in critical path and is graphable. |
Hi @nvcastet , pynccl cannot be removed since it's used for cu118 environment. PyTorch cu118 will install the nvidia-nccl-cu11 which with cuda 11.0 version of nccl by default. But cuda graph needs nccl with cuda version > 11.3. So we can specify the nccl so path |
The message (displayed using
@ispobock Thank you for pointing that out, I was not aware of this issue. |
@ispobock when downloading
So we should be good, right? |
Hi @nvcastet, I think currently the main problem is torch compile is conflicted with pynccl, right? I think you can try to wrap sglang/python/sglang/srt/distributed/parallel_state.py Lines 134 to 139 in 9087694
python3 -m unittest test_bench_one_batch.TestBenchOneBatch.test_torch_compile_tp2_bs1 . Could you take a try?
|
@nvcastet thx,it works! |
I have patched and tested this pr but Watchdog caught collective operation timeout problems still have
|
48a926f
to
38585e0
Compare
@yizhang2077 @zhyncs I went back to the first commit and register the pynccl algather as a pytorch custom op as you suggested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you do me a favor? i have merge ur pr and the issue reproduce . Many thanks |
In my env, it will download the whl from |
@nvcastet Could you fix the lint with |
Fixes #3424
TLDR:
sglang uses pynccl or their customalllreduce instead of pytorch ProcessGroup when graph capturing.
Allgather in sglang code base directly uses pytorch allgather instead of the sglang abstraction to decide which implementation to pick. Allgather is used in DP-attention and also to gather logits across TP dim.
The fix is to perform the allgather via the abstraction so that the same NCCL communicator won't be used inside and outside graph captures.
Motivation
Modifications
Checklist