AWQ support #8

casper-hansen · 2023-08-08T13:39:39Z

Feature request

Integrate AWQ models with TGI. AWQ is a quantization method that has better speedups than GPTQ. They mainly quantize linear layers which they replace with an optimized GEMM kernel. It is W4A16 quantization.

Code: https://github.com/mit-han-lab/llm-awq
Paper: https://arxiv.org/pdf/2306.00978.pdf

cc @michaelfeil @Atry

Motivation

The main motivation is simply to speed up models further. I achieved 134 tokens/s on a 4090+i9-13900k with AWQ quantization on an MPT 7B model (LLaMa gets 100+ tokens).

Your contribution

Currently, I am not able to contribute.

michaelfeil · 2023-08-08T16:13:13Z

Sounds interesting.
Benefits seem a improvement for memory usage (24GB on 4090) + while keeping a perplexity - nice!
On the downside custom cuda kernels for the datatype and potential compatibility issues (e.g. I assume V100).

Out of curiosity, have you tried running agaist

ctranslate2 llama2 / mpt 7b https://huggingface.co/michaelfeil/ct2fast-Llama-2-7b-hf
flash-attention and with the docker-file from this repo: (e.g. this docker file on the same 4090 system)
https://hub.docker.com/layers/michaelf34/tgi/0.9.4-latest/images/sha256-48b83bb961afacd0f17f54ed5046a5528d5d2ec7aa8bfa3ce0aa19d0f870639d?context=repo

casper-hansen · 2023-08-11T09:11:31Z

Sounds interesting. Benefits seem a improvement for memory usage (24GB on 4090) + while keeping a perplexity - nice! On the downside custom cuda kernels for the datatype and potential compatibility issues (e.g. I assume V100).

Yes, the kernels are optimized for Ampere and later versions. I think this is fine because most deployments today would use Ampere or Hopper architecture since there is a massive speed and memory difference compared to older hardware like V100.

Out of curiosity, have you tried running agaist

ctranslate2 llama2 / mpt 7b https://huggingface.co/michaelfeil/ct2fast-Llama-2-7b-hf

flash-attention and with the docker-file from this repo: (e.g. this docker file on the same 4090 system)
https://hub.docker.com/layers/michaelf34/tgi/0.9.4-latest/images/sha256-48b83bb961afacd0f17f54ed5046a5528d5d2ec7aa8bfa3ce0aa19d0f870639d?context=repo

No, I have not tried this yet but it is on my to-do list. 4-bit should be faster in general, however, I see that ctranslate2 is incredibly optimized on both CPU+GPU side which is something that Tinychat lacks a bit. For this reason, ctranslate2 can be faster unless AWQ is imported into a CPU-efficient framework. For instance, the difference between an i9-13900k and AMD EPYC 7-series is massive because of i9 is double the single-threaded speed of the EPYC CPU.

Out of curiosity, did you test the speed of the models you have on huggingface (if so, which exact GPU and CPU)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWQ support #8

AWQ support #8

casper-hansen commented Aug 8, 2023

michaelfeil commented Aug 8, 2023

casper-hansen commented Aug 11, 2023 •

edited

Loading

AWQ support #8

AWQ support #8

Comments

casper-hansen commented Aug 8, 2023

Feature request

Motivation

Your contribution

michaelfeil commented Aug 8, 2023

casper-hansen commented Aug 11, 2023 • edited Loading

casper-hansen commented Aug 11, 2023 •

edited

Loading