Why "without costing any GPU SMs from the computation part" #26

zhang662817 · 2025-02-26T10:14:26Z

IBGDA dispatch kernels are running in backgound, Why not take up computing SMs resources?

Thanks.

xiaoyao9933 · 2025-02-26T12:36:04Z

In Low Latency mode, rdma_recv_buff is big enough to hold all limited tokens from every rank and to every expert. thus buffer backpressure is unnecessary. Therefore, the sender role's kernel exits immediately after completing one-sided transmission. Only when the hook function is invoked, the remaining receiver role kernel will be executed to finalize the copy operation. Thus, they likely mean that during computation (moe kernel) phases, there's no need for continuous polling by the receiver role kernel, reducing SM occupancy. You can invoke recv role kernel after moe kernel is finished in a lazy behaviour.

It's my personal understanding, not from deepseek team.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why "without costing any GPU SMs from the computation part" #26

Why "without costing any GPU SMs from the computation part" #26

zhang662817 commented Feb 26, 2025

xiaoyao9933 commented Feb 26, 2025 •

edited

Loading

Why "without costing any GPU SMs from the computation part" #26

Why "without costing any GPU SMs from the computation part" #26

Comments

zhang662817 commented Feb 26, 2025

xiaoyao9933 commented Feb 26, 2025 • edited Loading

xiaoyao9933 commented Feb 26, 2025 •

edited

Loading