[Tracking Issue] Checklist for 8-bit KV-Cache improments #742

yzh119 · 2025-01-17T22:33:25Z

As discussed in How to use low bit KV Cache #721, we should generalize tensorwise qk_scale, v_scale to headwise qk_scale and v_scale. For the original tensorwise qk_scale and v_scale scalar input, we should repeat them for all heads to get headwise scale tensors.
We should apply v_scale inside the kernel rather than outside (

flashinfer/flashinfer/prefill.py

Line 2027 in 9f5fbee

out *= v_scale

) because v might be using low-precision data types, as suggested by @nandor .
It's a known issue that our cuda-core based fp8 decoding kernel is slow and we should always select use_tensor_cores for 8bit KV-Cache.
We should also enable int-8 KV-Cache, with headwise qk_scale/v_scale, int-8 KV-Cache can also get desirable performance, and the wheel size is under control in JIT mode.
Support fused-quantization append_kv_cache kernels to apply quantization together with appending data to KV-Cache.

The text was updated successfully, but these errors were encountered:

Provide feedback