-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWQ support #8
Comments
Sounds interesting. Out of curiosity, have you tried running agaist
|
Yes, the kernels are optimized for Ampere and later versions. I think this is fine because most deployments today would use Ampere or Hopper architecture since there is a massive speed and memory difference compared to older hardware like V100.
No, I have not tried this yet but it is on my to-do list. 4-bit should be faster in general, however, I see that ctranslate2 is incredibly optimized on both CPU+GPU side which is something that Tinychat lacks a bit. For this reason, ctranslate2 can be faster unless AWQ is imported into a CPU-efficient framework. For instance, the difference between an i9-13900k and AMD EPYC 7-series is massive because of i9 is double the single-threaded speed of the EPYC CPU. Out of curiosity, did you test the speed of the models you have on huggingface (if so, which exact GPU and CPU)? |
Feature request
Integrate AWQ models with TGI. AWQ is a quantization method that has better speedups than GPTQ. They mainly quantize linear layers which they replace with an optimized GEMM kernel. It is W4A16 quantization.
Code: https://github.com/mit-han-lab/llm-awq
Paper: https://arxiv.org/pdf/2306.00978.pdf
cc @michaelfeil @Atry
Motivation
The main motivation is simply to speed up models further. I achieved 134 tokens/s on a 4090+i9-13900k with AWQ quantization on an MPT 7B model (LLaMa gets 100+ tokens).
Your contribution
Currently, I am not able to contribute.
The text was updated successfully, but these errors were encountered: