Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why "without costing any GPU SMs from the computation part" #26

Open
zhang662817 opened this issue Feb 26, 2025 · 1 comment
Open

Why "without costing any GPU SMs from the computation part" #26

zhang662817 opened this issue Feb 26, 2025 · 1 comment

Comments

@zhang662817
Copy link

Image

IBGDA dispatch kernels are running in backgound, Why not take up computing SMs resources?

Thanks.

@xiaoyao9933
Copy link

xiaoyao9933 commented Feb 26, 2025

In Low Latency mode, rdma_recv_buff is big enough to hold all limited tokens from every rank and to every expert. thus buffer backpressure is unnecessary. Therefore, the sender role's kernel exits immediately after completing one-sided transmission. Only when the hook function is invoked, the remaining receiver role kernel will be executed to finalize the copy operation. Thus, they likely mean that during computation (moe kernel) phases, there's no need for continuous polling by the receiver role kernel, reducing SM occupancy. You can invoke recv role kernel after moe kernel is finished in a lazy behaviour.

It's my personal understanding, not from deepseek team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants