Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AdaKVPress #38

Merged
merged 27 commits into from
Jan 13, 2025
Merged

AdaKVPress #38

merged 27 commits into from
Jan 13, 2025

Conversation

SimJeg
Copy link
Collaborator

@SimJeg SimJeg commented Jan 9, 2025

PR description

This PR introduce a new press called AdaKVPress following the great work of @FFY0.

This is the first press achieving head-wise compression. Instead of adding a new kernel as initially proposed, I instead "fake" the compression by replacing the pruned keys by a fake key K such that exp(Q, K) = 0 (i.e. no effect in attention). The computation of the fake keys is done at every decoding step and is achieved through patching the newly introduced ALL_ATTENTION_FUNCTIONS in transformers. The patch is applied only when the attention module has a masked_key_indices attribute which is not None, ensuring compatibility with previous work.

Tests have not been written yet.

New press checklist (if applicable)

  • I added mypress_press.py in the presses directory
  • I added MyPress in __init__.py
  • I updated the README.md with a 1 liner about my new press in the Available presses section
  • I added my press in the default_presses list in tests/default_presses.py

@SimJeg
Copy link
Collaborator Author

SimJeg commented Jan 9, 2025

On RULER with 25% compression ratio and llama 3.1 8b instruct:

  • SnapKV: 81.8% --> 88.1% with AdaKV
  • ExpectedAttention: 88.5% --> 93.6% with AdaKV

cc @FFY0 for the results

pyproject.toml Outdated Show resolved Hide resolved
@FFY0
Copy link

FFY0 commented Jan 9, 2025

On RULER with 25% compression ratio and llama 3.1 8b instruct:

  • SnapKV: 81.8% --> 88.1% with AdaKV
  • ExpectedAttention: 88.5% --> 93.6% with AdaKV

cc @FFY0 for the results

@SimJeg These results seem to align well with my previous implementation. In my earlier evaluation, with a 30% compression ratio and llama 3.1 8b instruct, SnapKV's average score was 79.9, which increased to 86.9 when combined with AdaKV.

@SimJeg
Copy link
Collaborator Author

SimJeg commented Jan 9, 2025

More results on RULER 4k @FFY0, looks great ! Hope my implementation has not flaw

@FFY0
Copy link

FFY0 commented Jan 10, 2025

More results on RULER 4k @FFY0, looks great ! Hope my implementation has not flaw

Hi @SimJeg, the results look great and align well with my implementation:

At a 0.1 compression ratio, SnapKV improves from 87.7 to 92.9 with Ada-SnapKV.
At a 0.5 compression ratio, SnapKV improves from 69.8 to 77.7 with Ada-SnapKV.

Additionally, thank you for the results of Ada-expected-attention; it looks really promising! I believe this will significantly aid future research on head-specific compression.

@SimJeg
Copy link
Collaborator Author

SimJeg commented Jan 13, 2025

@FFY0 I'm launching additional benchmarks using alpha_safeguard=0 for both SnapKV and Expected Attention. One question: have you tried to apply the AdaKV logic to the level of the model itself ? i.e. instead of taking the top-k scores per layer you take the top-k score across all layers ? Could be nice to try at some point

kvpress/presses/base_press.py Outdated Show resolved Hide resolved
@SimJeg SimJeg changed the title Transformers 4.48 and AdaKVPress AdaKVPress Jan 13, 2025
@FFY0
Copy link

FFY0 commented Jan 13, 2025

@FFY0 I'm launching additional benchmarks using alpha_safeguard=0 for both SnapKV and Expected Attention. One question: have you tried to apply the AdaKV logic to the level of the model itself ? i.e. instead of taking the top-k scores per layer you take the top-k score across all layers ? Could be nice to try at some point

@SimJeg Yes, your points are absolutely right, and these directions could indeed further enhance the performance of AdaKV. I’d be happy to share my thoughts as well:

On the Setting of Alpha

Our experiments on LongBench indicate that smaller alpha values perform better under smaller budgets, while larger values are more effective for larger budgets. In AdaKV, we deliberately avoided fine-tuning this parameter and instead used a fixed value across all experiments to demonstrate its robustness. However, if further optimization is desired, adjusting alpha could yield noticeable performance improvements. Additionally, I observed unstable performance drops on certain datasets when alpha was set to 0. To mitigate this, I recommend using a very small but non-zero alpha value, such as 0.05, to improve performance for smaller budgets while maintaining stability.

On Head-Wise Budget Allocation Across Layers

In the early stages, we experimented with head-wise budget allocation across layers and observed performance gains. However, we eventually discontinued this approach for the following reasons:

  1. Theoretical Alignment: Intra-layer scheduling aligns more closely with our theoretical framework, providing stronger interpretability and a solid foundation for future research.
  2. Memory Efficiency: Cross-layer scheduling requires retaining the KV cache for all layers until a unified scheduling step is performed, delaying compression. In contrast, intra-layer scheduling allows immediate compression after prefill for each layer, which is critical for methods like SnapKV that aim to minimize peak memory usage during prefill, particularly in question-aware compression scenarios.

That said, in context-only compression scenarios—where the compressed cache can be stored offline and reused for future questions—frequent prefill operations are unnecessary. In such cases, cross-layer scheduling could be a promising optimization direction. While I didn't conduct experiments on this aspect, I believe it is worth exploring, as related works on layer budget allocation have supported the feasibility of this direction.

@SimJeg
Copy link
Collaborator Author

SimJeg commented Jan 13, 2025

Thanks for sharing your thoughts. I will keep using 0.2 as default as you initially proposed. Using alpha=0 I get the same results on RULER for SnapKV and ExpectedAttention (+- 0.001). I agree for head-wise allocation, interesting to know you get performance gains too !

@SimJeg SimJeg force-pushed the simon/adakv-press-448 branch from 32d6269 to 3ac3df2 Compare January 13, 2025 17:27
Copy link
Collaborator

@maxjeblick maxjeblick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot for the neat implementation!

@SimJeg SimJeg merged commit fe4610e into main Jan 13, 2025
2 checks passed
@SimJeg SimJeg deleted the simon/adakv-press-448 branch January 13, 2025 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants