compressed-tensors support for KV cache Quantization #103

dbogunowicz · 2024-06-20T12:47:15Z

Feature description

Implements quantized kv cache support for the transformer models that have been quantized using compressed-tensors.

Introduces:

CompressedTensorsQuantizedCacheConfig - a config object that stores the static qparams for quantizing/dequantizing KV Cache
minor improvements to the QuantizedCache interface, to make it more general in the future for other implementations
CompressedTensorsCache - very lightweight wrapper around QuantizedCache, that allows us to enable kv cache quantization

Manual testing:

from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.cache_utils import CompressedTensorsQuantizedCacheConfig
import torch
import time 
investigate_mem_consumption = True

model_id = "/root/compressed-tensors/llama1.1b_new_quant_out" # quantized model with kv cache quantization enabled

tokenizer = AutoTokenizer.from_pretrained("Xenova/llama2.c-stories15M") # somehow the tokenizer is missing from the model in `model_id`
tokenizer.pad_token_id = tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(model_id, attn_implementation="eager").to("cuda:0")

inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device)

cache_config = CompressedTensorsQuantizedCacheConfig.from_pretrained(model_id)

out_quant = model.generate(**inputs, cache_implementation="quantized", cache_config=cache_config, min_new_tokens=40, return_dict_in_generate=True)
out = model.generate(**inputs, min_new_tokens=40, return_dict_in_generate=True)

assert out_quant.sequences.allclose(out.sequences) # assert same tokens get generated regardless of cache type
assert out_quant.past_key_values._quantized_key_cache[-1].dtype == torch.int8 # assert that we are actually caching quantized tensors

Note

Compatible with this branch of compressed-tensors: neuralmagic/compressed-tensors#86
Pending missing items: Add tests, add compressed-tensors as transformers dependency, fix ugly import issues (circular imports between transformers and compressed-tensors)

dbogunowicz · 2024-06-20T12:47:46Z

src/transformers/cache_utils.py

@@ -16,6 +19,13 @@
 if is_hqq_available():
    from hqq.core.quantize import Quantizer as HQQQuantizer

+if is_compressed_tensors_available() or True:  # hack for now


I will take care of this hack once the PR is in the "landable" state

dbogunowicz · 2024-06-20T12:55:59Z

src/transformers/cache_utils.py

            self.key_cache.append(torch.zeros(0, dtype=key_states.dtype, device=key_states.device))
            self.value_cache.append(torch.zeros(0, dtype=key_states.dtype, device=key_states.device))
            keys_to_return, values_to_return = key_states, value_states
        else:
-            dequant_key = self._dequantize(self._quantized_key_cache[layer_idx])
-            dequant_value = self._dequantize(self._quantized_value_cache[layer_idx])
+            dequant_key = self._dequantize(self._quantized_key_cache[layer_idx], cache_type="key")


Instead of rewriting update() method for CompressedTensorsCache I decided to expand the interface.

dbogunowicz · 2024-06-20T12:56:43Z

src/transformers/cache_utils.py

+
+    @staticmethod
+    def _establish_quant_dtype(num_bits: int, type_: str) -> torch.dtype:
+        if num_bits == 8 and type_ == "int":


We should potentially add more supported quantization types, to be discussed @bfineran

dbogunowicz · 2024-06-21T14:26:56Z

@mgoin

from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.cache_utils import CompressedTensorsQuantizedCacheConfig, CompressedTensorsCache

model_id = "/root/compressed-tensors/llama1.1b_new_quant_out" # quantized model with kv cache quantization enabled

tokenizer = AutoTokenizer.from_pretrained("Xenova/llama2.c-stories15M") # somehow the tokenizer is missing from the model in `model_id`
tokenizer.pad_token_id = tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(model_id, attn_implementation="eager").to("cuda:0")

inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device)

cache_config = CompressedTensorsQuantizedCacheConfig.from_pretrained(model_id)
cache_object = CompressedTensorsCache(cache_config)

out = model(**inputs)
out_quant = model(**inputs, past_key_values=cache_object)

assert (out.logits == out_quant.logits).all()

working fine!

initial commit

bb220e9

dbogunowicz commented Jun 20, 2024

View reviewed changes

dbogunowicz requested review from mgoin, Satrat and bfineran June 20, 2024 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compressed-tensors support for KV cache Quantization #103

compressed-tensors support for KV cache Quantization #103

dbogunowicz commented Jun 20, 2024 •

edited

Loading

dbogunowicz Jun 20, 2024 •

edited

Loading

dbogunowicz Jun 20, 2024

dbogunowicz Jun 20, 2024

dbogunowicz commented Jun 21, 2024

compressed-tensors support for KV cache Quantization #103

Are you sure you want to change the base?

compressed-tensors support for KV cache Quantization #103

Conversation

dbogunowicz commented Jun 20, 2024 • edited Loading

Feature description

Manual testing:

Note

dbogunowicz Jun 20, 2024 • edited Loading

Choose a reason for hiding this comment

dbogunowicz Jun 20, 2024

Choose a reason for hiding this comment

dbogunowicz Jun 20, 2024

Choose a reason for hiding this comment

dbogunowicz commented Jun 21, 2024

dbogunowicz commented Jun 20, 2024 •

edited

Loading

dbogunowicz Jun 20, 2024 •

edited

Loading