You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Attention algorithm now became a problem that involves kernel design, framework scheduling, and model innovation. For Inference, the prefill/decode stage force us to consider the co-design of the layout/indexing/behavior of both prefill and decode. This issue try to track the kernel requirement of attention algorithm, with the framework/model keep in mind.
[BatchAttention]
(PrefixAttention, ExtendAttention, ...)in sglang/flashinfer, batch attention is used in prefix caching mechanism of LLM inference. currently BF16/FP16 is mature, not seen low precision in vast use senario..
Initially only paged kv-cache for batch is needed, not ragged (tensor) kv-cache.
KV layout is [tokens, nhead, hdim], with another indexing buffer to indicate the start/end of seqlen
page_size=1
Rotary is usually fused
Softmax can be replaced with other logits transform functor
JIT utilized to generate the kernel instance with logits transform replacement (ck_tile fmha has similar)
BatchDecode
same as batch prefil
seqlen_q=1, with mqa/gqa ratio as M dimension inside kernel
NOTE: due to different KV layout as PA, it's practically not possible to porting PA from vLLM into sglang batch-decode. need to focus on sglang requirement to re-implement/optimize.
[MLA]
TBD
[FA]
Prefill (FA, chunked-prefill)
please refer to Dao-AILab/FA for reference about the feature support
FP8 block-scale still un clear, but expect in good shape in 2025
chunked-prefill similar to BatchAttention, but compatible with FA KV layout
Decode (PA)
Highly optimized paged attention
[pages, nhead, page_size, hdim], [pages, nhead, page_size/x, hdim, x] ... KV Cache layout can co-design with framework like VLLM
Linear attention
HSTU, Minimax
The text was updated successfully, but these errors were encountered:
carlushuang
changed the title
Inference attention tech path (Batch-A, MLA, FA, Linear...)
Inference attention tech path
Jan 27, 2025
Attention algorithm now became a problem that involves kernel design, framework scheduling, and model innovation. For Inference, the prefill/decode stage force us to consider the co-design of the layout/indexing/behavior of both prefill and decode. This issue try to track the kernel requirement of attention algorithm, with the framework/model keep in mind.
[BatchAttention]
(PrefixAttention, ExtendAttention, ...)in sglang/flashinfer, batch attention is used in prefix caching mechanism of LLM inference. currently BF16/FP16 is mature, not seen low precision in vast use senario..
Initially only paged kv-cache for batch is needed, not ragged (tensor) kv-cache.
BatchPrefill
[tokens, nhead, hdim]
, with another indexing buffer to indicate the start/end of seqlenBatchDecode
[MLA]
TBD
[FA]
Prefill (FA, chunked-prefill)
Decode (PA)
Linear attention
HSTU, Minimax
The text was updated successfully, but these errors were encountered: