Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference attention tech path #79

Open
carlushuang opened this issue Jan 27, 2025 · 0 comments
Open

Inference attention tech path #79

carlushuang opened this issue Jan 27, 2025 · 0 comments

Comments

@carlushuang
Copy link
Collaborator

carlushuang commented Jan 27, 2025

Attention algorithm now became a problem that involves kernel design, framework scheduling, and model innovation. For Inference, the prefill/decode stage force us to consider the co-design of the layout/indexing/behavior of both prefill and decode. This issue try to track the kernel requirement of attention algorithm, with the framework/model keep in mind.

[BatchAttention]

(PrefixAttention, ExtendAttention, ...)in sglang/flashinfer, batch attention is used in prefix caching mechanism of LLM inference. currently BF16/FP16 is mature, not seen low precision in vast use senario..
Initially only paged kv-cache for batch is needed, not ragged (tensor) kv-cache.

BatchPrefill

  • reference link in flashinfer
  • KV layout is [tokens, nhead, hdim], with another indexing buffer to indicate the start/end of seqlen
  • page_size=1
  • Rotary is usually fused
  • Softmax can be replaced with other logits transform functor
  • JIT utilized to generate the kernel instance with logits transform replacement (ck_tile fmha has similar)

BatchDecode

  • same as batch prefil
  • seqlen_q=1, with mqa/gqa ratio as M dimension inside kernel
  • NOTE: due to different KV layout as PA, it's practically not possible to porting PA from vLLM into sglang batch-decode. need to focus on sglang requirement to re-implement/optimize.

[MLA]

TBD

[FA]

Prefill (FA, chunked-prefill)

  • please refer to Dao-AILab/FA for reference about the feature support
  • FP8 block-scale still un clear, but expect in good shape in 2025
  • chunked-prefill similar to BatchAttention, but compatible with FA KV layout

Decode (PA)

  • Highly optimized paged attention
  • [pages, nhead, page_size, hdim], [pages, nhead, page_size/x, hdim, x] ... KV Cache layout can co-design with framework like VLLM

Linear attention

HSTU, Minimax

@carlushuang carlushuang changed the title Inference attention tech path (Batch-A, MLA, FA, Linear...) Inference attention tech path Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant