Inference attention tech path #79

carlushuang · 2025-01-27T14:03:08Z

Attention algorithm now became a problem that involves kernel design, framework scheduling, and model innovation. For Inference, the prefill/decode stage force us to consider the co-design of the layout/indexing/behavior of both prefill and decode. This issue try to track the kernel requirement of attention algorithm, with the framework/model keep in mind.

[BatchAttention]

(PrefixAttention, ExtendAttention, ...)in sglang/flashinfer, batch attention is used in prefix caching mechanism of LLM inference. currently BF16/FP16 is mature, not seen low precision in vast use senario..
Initially only paged kv-cache for batch is needed, not ragged (tensor) kv-cache.

BatchPrefill

reference link in flashinfer
KV layout is [tokens, nhead, hdim], with another indexing buffer to indicate the start/end of seqlen
page_size=1
Rotary is usually fused
Softmax can be replaced with other logits transform functor
JIT utilized to generate the kernel instance with logits transform replacement (ck_tile fmha has similar)

BatchDecode

same as batch prefil
seqlen_q=1, with mqa/gqa ratio as M dimension inside kernel
NOTE: due to different KV layout as PA, it's practically not possible to porting PA from vLLM into sglang batch-decode. need to focus on sglang requirement to re-implement/optimize.

[MLA]

TBD

[FA]

Prefill (FA, chunked-prefill)

please refer to Dao-AILab/FA for reference about the feature support
FP8 block-scale still un clear, but expect in good shape in 2025
chunked-prefill similar to BatchAttention, but compatible with FA KV layout

Decode (PA)

Highly optimized paged attention
[pages, nhead, page_size, hdim], [pages, nhead, page_size/x, hdim, x] ... KV Cache layout can co-design with framework like VLLM

Linear attention

HSTU, Minimax

The text was updated successfully, but these errors were encountered:

carlushuang changed the title ~~Inference attention tech path (Batch-A, MLA, FA, Linear...)~~ Inference attention tech path Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference attention tech path #79

Inference attention tech path #79

carlushuang commented Jan 27, 2025 •

edited by HaiShaw

Loading

Inference attention tech path #79

Inference attention tech path #79

Comments

carlushuang commented Jan 27, 2025 • edited by HaiShaw Loading

[BatchAttention]

BatchPrefill

BatchDecode

[MLA]

[FA]

Prefill (FA, chunked-prefill)

Decode (PA)

Linear attention

carlushuang commented Jan 27, 2025 •

edited by HaiShaw

Loading