-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AdaKVPress #38
AdaKVPress #38
Conversation
On RULER with 25% compression ratio and llama 3.1 8b instruct:
cc @FFY0 for the results |
@SimJeg These results seem to align well with my previous implementation. In my earlier evaluation, with a 30% compression ratio and llama 3.1 8b instruct, SnapKV's average score was 79.9, which increased to 86.9 when combined with AdaKV. |
More results on RULER 4k @FFY0, looks great ! Hope my implementation has not flaw |
Hi @SimJeg, the results look great and align well with my implementation: At a 0.1 compression ratio, SnapKV improves from 87.7 to 92.9 with Ada-SnapKV. Additionally, thank you for the results of Ada-expected-attention; it looks really promising! I believe this will significantly aid future research on head-specific compression. |
@FFY0 I'm launching additional benchmarks using |
@SimJeg Yes, your points are absolutely right, and these directions could indeed further enhance the performance of AdaKV. I’d be happy to share my thoughts as well: On the Setting of AlphaOur experiments on LongBench indicate that smaller alpha values perform better under smaller budgets, while larger values are more effective for larger budgets. In AdaKV, we deliberately avoided fine-tuning this parameter and instead used a fixed value across all experiments to demonstrate its robustness. However, if further optimization is desired, adjusting alpha could yield noticeable performance improvements. Additionally, I observed unstable performance drops on certain datasets when alpha was set to 0. To mitigate this, I recommend using a very small but non-zero alpha value, such as 0.05, to improve performance for smaller budgets while maintaining stability. On Head-Wise Budget Allocation Across LayersIn the early stages, we experimented with head-wise budget allocation across layers and observed performance gains. However, we eventually discontinued this approach for the following reasons:
That said, in context-only compression scenarios—where the compressed cache can be stored offline and reused for future questions—frequent prefill operations are unnecessary. In such cases, cross-layer scheduling could be a promising optimization direction. While I didn't conduct experiments on this aspect, I believe it is worth exploring, as related works on layer budget allocation have supported the feasibility of this direction. |
Thanks for sharing your thoughts. I will keep using 0.2 as default as you initially proposed. Using alpha=0 I get the same results on RULER for SnapKV and ExpectedAttention (+- 0.001). I agree for head-wise allocation, interesting to know you get performance gains too ! |
32d6269
to
3ac3df2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks a lot for the neat implementation!
PR description
This PR introduce a new press called
AdaKVPress
following the great work of @FFY0.This is the first press achieving head-wise compression. Instead of adding a new kernel as initially proposed, I instead "fake" the compression by replacing the pruned keys by a fake key K such that exp(Q, K) = 0 (i.e. no effect in attention). The computation of the fake keys is done at every decoding step and is achieved through patching the newly introduced ALL_ATTENTION_FUNCTIONS in transformers. The patch is applied only when the attention module has a
masked_key_indices
attribute which is not None, ensuring compatibility with previous work.Tests have not been written yet.
New press checklist (if applicable)
mypress_press.py
in thepresses
directoryMyPress
in__init__.py
README.md
with a 1 liner about my new press in the Available presses sectiondefault_presses
list intests/default_presses.py