Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWQ support #8

Open
casper-hansen opened this issue Aug 8, 2023 · 2 comments
Open

AWQ support #8

casper-hansen opened this issue Aug 8, 2023 · 2 comments

Comments

@casper-hansen
Copy link

Feature request

Integrate AWQ models with TGI. AWQ is a quantization method that has better speedups than GPTQ. They mainly quantize linear layers which they replace with an optimized GEMM kernel. It is W4A16 quantization.

Code: https://github.com/mit-han-lab/llm-awq
Paper: https://arxiv.org/pdf/2306.00978.pdf

cc @michaelfeil @Atry

Motivation

The main motivation is simply to speed up models further. I achieved 134 tokens/s on a 4090+i9-13900k with AWQ quantization on an MPT 7B model (LLaMa gets 100+ tokens).

Your contribution

Currently, I am not able to contribute.

@michaelfeil
Copy link
Contributor

Sounds interesting.
Benefits seem a improvement for memory usage (24GB on 4090) + while keeping a perplexity - nice!
On the downside custom cuda kernels for the datatype and potential compatibility issues (e.g. I assume V100).

Out of curiosity, have you tried running agaist

@casper-hansen
Copy link
Author

casper-hansen commented Aug 11, 2023

Sounds interesting. Benefits seem a improvement for memory usage (24GB on 4090) + while keeping a perplexity - nice! On the downside custom cuda kernels for the datatype and potential compatibility issues (e.g. I assume V100).

Yes, the kernels are optimized for Ampere and later versions. I think this is fine because most deployments today would use Ampere or Hopper architecture since there is a massive speed and memory difference compared to older hardware like V100.

Out of curiosity, have you tried running agaist

No, I have not tried this yet but it is on my to-do list. 4-bit should be faster in general, however, I see that ctranslate2 is incredibly optimized on both CPU+GPU side which is something that Tinychat lacks a bit. For this reason, ctranslate2 can be faster unless AWQ is imported into a CPU-efficient framework. For instance, the difference between an i9-13900k and AMD EPYC 7-series is massive because of i9 is double the single-threaded speed of the EPYC CPU.

Out of curiosity, did you test the speed of the models you have on huggingface (if so, which exact GPU and CPU)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants