-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request for Head-Specific KV Cache Compression Feature #7
Comments
Hi @FFY0, Definitely a good issue, that's a key feature for several compression techniques. However it requires to implement a new kernel to be efficient so it's a significant effort (except if we find a trick... I do have some ideas ^^) |
Thanks, @SimJeg! |
Hi, @SimJeg. Recently, I tried to implement a Head-Specific KV Cache compression solution within the current project architecture and developed the Ada-SnapKV compression method as described in the AdaKV paper. This solution introduces several new components while minimizing intrusive changes to the existing architecture. The main modifications include:
Once a new subclass based on So far, I have obtained some preliminary results for Ada-SnapKV on the ruler benchmark, and the performance looks promising. Moving forward, I plan to conduct some tests on corner cases. The code is currently available in a branch of my forked repository. I would appreciate your feedback or suggestions. If progress aligns with expectations, I would be happy to continue working on this and eventually attempt to merge the changes into the main branch. |
Thanks @FFY0 for the hard work ! We need to decide internally if we want to host kernels in this repository. Is the kernel you propose here already available by pip install somewhere else ? |
Hi @SimJeg, This kernel is a modified version of the original AdaKV kernel. It is currently compiled within the I will also make further adjustments to the code and merge the branch you mentioned into my commit, and it seems they could integrate easily. |
I created a branch introducing another way to do head-wise compression here. It does not contain How it works:
This implementation is very short (~ 60 loc) and fits nicely with the kvpress package, however:
@FFY0 the reason I investigated it is that your current PR implies many changes of the current code:
Anyway, I don't think I will merge this branch as is because of this |
I just pushed an udpated version (commit) without exec. It's a bit cleaner but the downside is that it adds a lot of lines of codes (i.e. thousands !). How it works:
|
Above proposal will be deprecated with v4.48 of transformers. Other idea: https://docs.flashinfer.ai/api/decode.html#batch-decoding |
Hi @SimJeg, The approximate masking method you mentioned is indeed a clever way to simulate head-specific compression. It minimizes additional code requirements and is a feasible approach. From my understanding of the Transformers library, it seems to prioritize a single-model-file policy over strictly adhering to the DRY principle. Therefore, introducing a new modeling file or class does not appear to conflict with this policy. In my forked repository, the primary objective was to achieve efficiency within the head-compression paradigm to align with standard computation. However, this inevitably resulted in increased code complexity. If we are open to trading off some computational efficiency for cleaner code, several potential optimization points could be considered:
Regarding the I believe the trade-off primarily revolves around two aspects: code simplicity and computational efficiency. Based on this, it would be worth exploring a more balanced and suitable implementation approach. Regarding the additional code, it mainly involves adding new new subclasses into existing modeling file. This seems to align with the commonly pursued open-closed principle. What are your thoughts on this? |
🚀 Feature
Adding support for head-specific KV cache compression which employs variable compression rates for each attention head.
Motivation
Ada-KV[1] has demonstrated that employing different compression rates across attention heads can significantly enhance cache compression methods. Recently, numerous head-specific approaches, such as DuoAttention[2], RazorAttention[3], and HeadKV[4], have emerged, each introducing unique techniques to improve compression quality through head-specific methods. However, these methods involve handling variable-length cache entries across different heads, a feature that KVPress currently does not support. We believe supporting this feature will significantly enhance the flexibility of KVPress and align it with emerging head-specific compression strategies.
[1] Feng, Y., Lv, J., Cao, Y., Xie, X., & Zhou, S. K. (2024). Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference. arXiv preprint arXiv:2407.11550.
[2] Xiao, G., Tang, J., Zuo, J., Guo, J., Yang, S., Tang, H., ... & Han, S. (2024). DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads. arXiv preprint arXiv:2410.10819.
[3] Tang, H., Lin, Y., Lin, J., Han, Q., Hong, S., Yao, Y., & Wang, G. (2024). Razorattention: Efficient kv cache compression through retrieval heads. arXiv preprint arXiv:2407.15891.
[4] Fu, Y., Cai, Z., Asi, A., Xiong, W., Dong, Y., & Xiao, W. (2024). Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning. arXiv preprint arXiv:2410.19258.
The text was updated successfully, but these errors were encountered: