-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ascend]feat: support kv int8 #103
base: main
Are you sure you want to change the base?
Conversation
0908362
to
1c73e98
Compare
e266508
to
3608fd7
Compare
3608fd7
to
83b567c
Compare
83b567c
to
b8fc1b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议做成python -m dlinfer.framework.lmdeploy_ext.quants.ascend_kv --model_path xxx etc. 加上参数可以直接执行。。。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已更新
"dlinfer::fill_kv_cache", | ||
["key_cache", "value_cache"], | ||
default_value={ | ||
"quant_bits": 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是否可以添加默认值:
"k_scales_zeros": tuple()
"v_scales_zeros": tuple()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已添加
@@ -205,6 +220,9 @@ def paged_decode_attention( | |||
softmax_scale: Optional[float], | |||
alibi_slopes: Optional[Sequence[float]], | |||
attn_output: Optional[Tensor], | |||
kv_scales: Optional[Tensor], | |||
kv_zeros: Optional[Tensor], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里 kv_zeros 的类型为啥和 fill_kv_cache 中的 k_scales_zeros,v_scales_zeros 类型不一致?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fill_kv_cache中是为了避免slice
dlinfer/ops/llm.py
Outdated
k_scales_zeros: Sequence[Optional[Tensor]], | ||
v_scales_zeros: Sequence[Optional[Tensor]], | ||
quant_bits: int, | ||
) -> Tuple[Tensor, Tensor, Tensor, Tensor]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
更新一下注释参数说明,其他算子同
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已更新
No description provided.