[tracking] Minimal prefix-sharing kv cache #593

renxida · 2024-11-22T15:01:34Z

Block Trie Attention Implementation Plan

Also known as Radix Attention / Block Trie attention

Project Goal

Implement a KV cache that:

Preserves old KV values with their associated tokens
Matches them against new incoming sequences to avoid redundant computation
Will eventually support:
- Cache page sharing and KV computation of common prefixes between inference requests in the same batch
- Large page copying upon partial matches

Implementation Tasks

Phase 1: Preparation

Phase 2: Janky implementation -> minimum working implementation

Note: At this stage, the class should work for LLM integration tests but may fail sglang integration tests. Stephen's involvement will be crucial for hardening against concurrency and optimizing benefits.

Concurrent access testing
Multi-request batch testing with shared prefixes
- Verify error-free execution of requests
- Same-batch common-prefix related tests
  - Demonstrate that same-batch inputs with common prefixes do not crash system / produce bad tokens.
  - Demonstrate that same-batch inputs with common prefixes are not able to share cache pages and have no performance advantage over same-batch inputs without common prefixes. Do so by implementing a test that expects improved performance and xfailing it
Performance comparison testing against BaseAttentionCache. The tests should expect substantial performance improvement vs Base. Xfail if we don't find the improvement and figure out why.

Phase 3: Benchmarking / polishing

Reference Implementations

sglang
lmdeploy/pytorch/paging/block_trie.py (InternLM/lmdeploy)

Priority Note

Focus on achieving minimum viable solution to facilitate Stephen's involvement in the project.

Creates space for #593 (prefix-sharing) Coming next: #607 , which should be the last thing I do before I can check in my blocktrie implementation. Summary of changes: - copied over stella's cache.py and renamed it to page_pool.py - each inference request now notifies the cache when its pages are done written to

renxida mentioned this issue Nov 22, 2024

Replace AttnPagedCache with BasePagedAttentionCache #565

Merged

renxida changed the title ~~[tracking] Prefix-sharing kv cache~~ [tracking] Minimal prefix-sharing kv cache Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tracking] Minimal prefix-sharing kv cache #593

[tracking] Minimal prefix-sharing kv cache #593

renxida commented Nov 22, 2024 •

edited

Loading

[tracking] Minimal prefix-sharing kv cache #593

[tracking] Minimal prefix-sharing kv cache #593

Comments

renxida commented Nov 22, 2024 • edited Loading

Block Trie Attention Implementation Plan

Project Goal

Implementation Tasks

Phase 1: Preparation

Phase 2: Janky implementation -> minimum working implementation

Phase 3: Benchmarking / polishing

Reference Implementations

Priority Note

renxida commented Nov 22, 2024 •

edited

Loading