Head-Specific KV Cache Compression Feature (Ada-SnapKV, AdaKV) #25

FFY0 · 2024-12-07T07:20:00Z

Add Feature of Head-Specific KV Cache Compression at issue

I have got some results of Ada-SnapKV on the 4K Ruler benchmark. The results look promising. I have placed the corresponding results in a new notebook, which also includes a brief explanation of the flattened KV cache layout employed by head-specific KV cache compression during computation and a tutorial on how to customize new Head-Specific methods based on the latest AdaBasePress.

Additionally, it seems that the Head-Specific KV Cache Compression feature may require custom unit test workflow, such as instantiating new attention classes before loading models. As a result, simply adding Ada-SnapKV into the current unit test may cause failures. I will attempt to resolve this issue in the future. Feel free to let me know if there's anything else you'd like me to refine or if you need additional details!

add_AdaKV_initial_version AdaKV Signed-off-by: FFY0 <[email protected]>

Add ThinKPress (NVIDIA#20)

Signed-off-by: FFY0 <[email protected]>

FFY0 · 2024-12-07T09:54:00Z

I also have some confusion regarding batch support in the current repository. It seems that much of the code assumes a batch size of 1. This is because the compression logic doesn't appear to account for padding tokens caused by varying sequence lengths across different samples. Meanwhile, the current unit tests seem to use dummy inputs with a batch size greater than 1.

To align with other methods, the current implementation of Ada-SnapKV is limited to scenarios where the batch size is 1. If necessary, I will explore support for multiple batch sizes in the future.

Signed-off-by: FFY0 <[email protected]>

FFY0 · 2024-12-11T13:37:36Z

Hi @SimJeg ,

I have added unit tests in test_presses.py for Ada-SnapKV and successfully validated them. Additionally, I have updated the architecture of Ada-SnapKV to align with the refactored code in the main branch, and it seems to be working well. If you have any suggestions, please feel free to let me know.

It seems the current CI Action workflows require approval before they can run. I'm not very familiar with this process, so please let me know if there’s anything else I can contribute to. Additionally, the CI Action might fail due to the requirements of the new kernel build process. This issue may need further discussion on how to manage the kernel moving forward to identify a solution.

SimJeg · 2024-12-11T14:06:43Z

Hi @FFY0, thanks for your hard work on this PR, the results you shared look really promising. One of the goal of the recent refacto was to welcome more complex presses such yours. We started to look at your PR and will come back with feedback.

SimJeg · 2025-01-07T10:57:30Z

@FFY0 beyond the discussion on the best way to implement head-wise compression, I tried to implement my own version of AdaKV here. What do you think of the interface:

press = AdaKVPress(scorer)

where scorer is any ScorerPress object.

FFY0 · 2025-01-09T06:56:12Z

Hi @SimJeg, this interface looks great. Using a wrapper avoids many modifications in current architecture. I tried running the code, but I seem to have encountered a minor issue.

def search_hyperplane(X, max_iter=1000):
    """
    Search for an hyperplane Y such that for every Xi, <Xi, Y> <= 1 (simple perceptron)
    Returns LARGE_NEGATIVE_FLOAT * Y to ensure exp(<X, Y>) = 0
    """
    Y = X.mean(1)
    for _ in range(max_iter):
        mask = (X * Y.unsqueeze(1)).sum(dim=2, keepdim=True) <= 1
        if not mask.any():
            return LARGE_NEGATIVE_FLOAT * Y
        Y += (X * mask).sum(1) / mask.sum(1).clamp(min=1)
    raise ValueError("Could not find fake keys such that for every query q, exp(<q, k>) = 0")

It seems that the condition in the function should be if mask.all(). Exiting the search with if not mask.any() can lead to large values in the qk dot product.

After changing it to if mask.all(), I encountered another issue: "Could not find fake keys such that for every query q." Since I'm not fully understanding the logic of search_hyperplane func, I switched back to using the least squares method.
However, using the least squares method also resulted in the same failures. Upon investigation, I found that the problem arises because the input includes too many query states with a large sequence length (e.g., 30), which complicates solving fake key states.

I think one possible solution is to sequentially call the attention method for each query state within the wrapper func during decoding. This also allows for solving fake key states for each query state easily, for example:

key_states = - query_states * 1E5

SimJeg · 2025-01-09T07:42:59Z

@FFY0 there are still several issues to fix with this approach, I'm working on it ! Calling attention sequentially would work indeed but I'm confident I can fix search_hyperplane

SimJeg · 2025-01-09T11:07:58Z

@FFY0 it should be fixed now, and early results with on 1% of RULER with AdaSnapKV and AdaExpectedAttention are promising ! I will benchmarks to see if I can reproduce the ones you shared at the beginning of the PR.

FFY0 · 2025-01-09T11:58:38Z

@SimJeg This implementation is really impressive! I'm looking forward to the test results.

SimJeg · 2025-01-13T17:35:27Z

Closing this PR as #38 has been merged

FFY0 added 12 commits November 30, 2024 13:07

init_adakv

662f3f2

add_AdaKV_initial_version AdaKV Signed-off-by: FFY0 <[email protected]>

refactor metadata

92c9dde

refactor meta data to cache class&remove token NotImplemented

3708a90

support multi questions

6778c6a

Merge pull request #1 from NVIDIA/main

1265921

Add ThinKPress (NVIDIA#20)

minor

ca2e990

merge from main:think_press

fe33d14

code_clean

20fb2da

refactor replace attention func& add a notebook

ece836e

clean code

14b1d5e

update notebook and corresponding results

f901fa0

clean code

c4c2daa

Signed-off-by: FFY0 <[email protected]>

SimJeg mentioned this pull request Dec 9, 2024

Bridge to vllm #26

Closed

FFY0 added 4 commits December 11, 2024 12:55

Merge branch 'NVIDIA:main' into main

c078d05

solving merge conflict from NVIDIA/main_12_11

6ad02ca

adapt adakv to the refactored press arch

254d26f

clean code

1c7ff6e

Signed-off-by: FFY0 <[email protected]>

FFY0 marked this pull request as ready for review December 11, 2024 11:14

FFY0 marked this pull request as draft December 11, 2024 11:22

update discription of Ada-SnapKV

1213a47

FFY0 marked this pull request as ready for review December 11, 2024 13:37

SimJeg closed this Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Head-Specific KV Cache Compression Feature (Ada-SnapKV, AdaKV) #25

Head-Specific KV Cache Compression Feature (Ada-SnapKV, AdaKV) #25

FFY0 commented Dec 7, 2024

FFY0 commented Dec 7, 2024

FFY0 commented Dec 11, 2024

SimJeg commented Dec 11, 2024

SimJeg commented Jan 7, 2025 •

edited

Loading

FFY0 commented Jan 9, 2025

SimJeg commented Jan 9, 2025

SimJeg commented Jan 9, 2025

FFY0 commented Jan 9, 2025

SimJeg commented Jan 13, 2025

Head-Specific KV Cache Compression Feature (Ada-SnapKV, AdaKV) #25

Head-Specific KV Cache Compression Feature (Ada-SnapKV, AdaKV) #25

Conversation

FFY0 commented Dec 7, 2024

FFY0 commented Dec 7, 2024

FFY0 commented Dec 11, 2024

SimJeg commented Dec 11, 2024

SimJeg commented Jan 7, 2025 • edited Loading

FFY0 commented Jan 9, 2025

SimJeg commented Jan 9, 2025

SimJeg commented Jan 9, 2025

FFY0 commented Jan 9, 2025

SimJeg commented Jan 13, 2025

SimJeg commented Jan 7, 2025 •

edited

Loading