-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to quantize the model? #32
Comments
@iamthemulti Quantizing the Aria model presents challenges due to its use of grouped-gemm for efficient inference and training with bfloat16, rather than standard nn.Linear layers. The grouped-gemm implementation can be found in the Aria repository: Lines 444 to 482 in 719ff4e
I'm currently working on a custom solution to address this quantization issue. |
@iamthemulti I've uploaded a fork of aria model that replaces the grouped gemm with a sequential mlp, in which each expert is implemented as a If you want to quantize an aria model, please use rhymes-ai/Aria-sequential_mlp I am also trying to use some open-source tools to quantize the Aria model, but I'm encountering some issues on the H100. Currently, I don't have access to other GPUs for quantization. |
Any updates on quants would be highly valuable @aria-hacker! Please keep us posted about your progress |
I got a BitsAndBytes NF4 quant working based on Aria-sequential_mlp here, requires less than 16 GB of VRAM and runs on an RTX 3090 |
I've uploaded an int8 weight-only model that has been quantized using torchao. It's also compatible with |
Anyone else getting |
We have an HQQ 4-bit version working well with just 15GB of VRAM: https://github.com/mobiusml/hqq/blob/master/examples/hf/aria_multimodal.py |
Currently having issues attempting to quantize, save, then load the model using HF Transformers.
Is there any known working method for quantizing Aria (preferably to 4bit)?
The text was updated successfully, but these errors were encountered: