-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does the repo provide a quantization kernel? #10
Comments
I have integrated the FP6 kernel from this repo in torchao with user-friendly API to quantize and run a given model. You can check it out here. https://github.com/pytorch/ao/tree/main/torchao/prototype/quant_llm The quantization logic is adapted from DeepSpeed as mentioned in #6 |
Thanks for the reply. It seems that Torchao has not yet merged this API into the current release. I built it from the source, and it worked for me. |
Yes, packing is done in Python using PyTorch ops. With this approach, we can support CUDA tensors. We also skip unnecessary 6-bit packing, and directly pack to 2+4bit layout as used by FP6-LLM. |
Thank you again! It solved my problem. |
It seems that the fp6_llm repo only includes the kernel
weight_matrix_dequant_fp_eXmY_cpu
, which dequantizes fp6 data to fp16 format, but it lacks the kernel to quantize fp16 data to fp6. Could you provide a kernel for quantizing pre-trained models?The text was updated successfully, but these errors were encountered: