You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As discussed in How to use low bit KV Cache #721, we should generalize tensorwise qk_scale, v_scale to headwise qk_scale and v_scale. For the original tensorwise qk_scale and v_scale scalar input, we should repeat them for all heads to get headwise scale tensors.
We should apply v_scale inside the kernel rather than outside (
) because v might be using low-precision data types, as suggested by @nandor .
It's a known issue that our cuda-core based fp8 decoding kernel is slow and we should always select use_tensor_cores for 8bit KV-Cache.
We should also enable int-8 KV-Cache, with headwise qk_scale/v_scale, int-8 KV-Cache can also get desirable performance, and the wheel size is under control in JIT mode.
Support fused-quantization append_kv_cache kernels to apply quantization together with appending data to KV-Cache.
The text was updated successfully, but these errors were encountered:
qk_scale
,v_scale
to headwiseqk_scale
andv_scale
. For the original tensorwiseqk_scale
andv_scale
scalar input, we should repeat them for all heads to get headwise scale tensors.flashinfer/flashinfer/prefill.py
Line 2027 in 9f5fbee
use_tensor_cores
for 8bit KV-Cache.qk_scale
/v_scale
, int-8 KV-Cache can also get desirable performance, and the wheel size is under control in JIT mode.The text was updated successfully, but these errors were encountered: