Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tracking] Minimal prefix-sharing kv cache #593

Open
8 of 20 tasks
renxida opened this issue Nov 22, 2024 · 0 comments
Open
8 of 20 tasks

[tracking] Minimal prefix-sharing kv cache #593

renxida opened this issue Nov 22, 2024 · 0 comments

Comments

@renxida
Copy link
Contributor

renxida commented Nov 22, 2024

Block Trie Attention Implementation Plan

Also known as Radix Attention / Block Trie attention

Project Goal

Implement a KV cache that:

  • Preserves old KV values with their associated tokens
  • Matches them against new incoming sequences to avoid redundant computation
  • Will eventually support:
    • Cache page sharing and KV computation of common prefixes between inference requests in the same batch
    • Large page copying upon partial matches

Implementation Tasks

Phase 1: Preparation

Phase 2: Janky implementation -> minimum working implementation

Note: At this stage, the class should work for LLM integration tests but may fail sglang integration tests. Stephen's involvement will be crucial for hardening against concurrency and optimizing benefits.

  • Concurrent access testing
  • Multi-request batch testing with shared prefixes
    • Verify error-free execution of requests
    • Same-batch common-prefix related tests
      • Demonstrate that same-batch inputs with common prefixes do not crash system / produce bad tokens.
      • Demonstrate that same-batch inputs with common prefixes are not able to share cache pages and have no performance advantage over same-batch inputs without common prefixes. Do so by implementing a test that expects improved performance and xfailing it
  • Performance comparison testing against BaseAttentionCache. The tests should expect substantial performance improvement vs Base. Xfail if we don't find the improvement and figure out why.

Phase 3: Benchmarking / polishing

Reference Implementations

  • sglang
  • lmdeploy/pytorch/paging/block_trie.py (InternLM/lmdeploy)

Priority Note

Focus on achieving minimum viable solution to facilitate Stephen's involvement in the project.

@renxida renxida changed the title [tracking] Prefix-sharing kv cache [tracking] Minimal prefix-sharing kv cache Nov 22, 2024
renxida added a commit that referenced this issue Nov 26, 2024
Creates space for #593 (prefix-sharing)

Coming next: #607 , which should be the last thing I do before I can
check in my blocktrie implementation.

Summary of changes:
- copied over stella's cache.py and renamed it to page_pool.py
- each inference request now notifies the cache when its pages are done
written to
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant