How to quantize the model? #32

iamthemulti · 2024-10-15T18:52:02Z

Currently having issues attempting to quantize, save, then load the model using HF Transformers.

Is there any known working method for quantizing Aria (preferably to 4bit)?

aria-hacker · 2024-10-16T05:41:10Z

@iamthemulti Quantizing the Aria model presents challenges due to its use of grouped-gemm for efficient inference and training with bfloat16, rather than standard nn.Linear layers. The grouped-gemm implementation can be found in the Aria repository:

Aria/aria/model/moe_lm.py

Lines 444 to 482 in 719ff4e

    
           class GroupedGEMM(nn.Module): 
        
               """ 
        
               Grouped GEMM (General Matrix Multiplication) module for efficient expert computation. 
        
               This module utilizes the grouped_gemm library (https://github.com/fanshiqing/grouped_gemm) 
        
               for optimized performance. If the grouped_gemm library is not installed, it gracefully 
        
               falls back to a sequential GEMM implementation, which may be slower but ensures 
        
               functionality. 
        
               Args: 
        
                   in_features (int): Number of input features. 
        
                   out_features (int): Number of output features. 
        
                   groups (int): Number of expert groups. 
        
               """ 
        
               def __init__(self, in_features, out_features, groups): 
        
                   super().__init__() 
        
                   self.in_features = in_features 
        
                   self.out_features = out_features 
        
                   self.groups = groups 
        
                   self.weight = nn.Parameter(torch.empty(groups, in_features, out_features)) 
        
               def forward(self, input, tokens_per_expert): 
        
                   """ 
        
                   Perform grouped matrix multiplication. 
        
                   Args: 
        
                       input (torch.Tensor): Input tensor of shape (num_tokens, in_features). 
        
                       tokens_per_expert (torch.Tensor): Number of tokens assigned to each expert. 
        
                   Returns: 
        
                       torch.Tensor: Output tensor of shape (num_tokens, out_features). 
        
                   """ 
        
                   tokens_per_expert = tokens_per_expert.cpu() 
        
                   # Ensure the CUDA device matches the input tensor's device. 
        
                   # This mismatch can occur when using `transformers.AutoModel.from_pretrained` 
        
                   # with `device_map="auto"` on a multi-GPU setup. 
        
                   torch.cuda.set_device(input.device) 
        
                   return experts_gemm(input, self.weight, tokens_per_expert)

I'm currently working on a custom solution to address this quantization issue.

aria-hacker · 2024-10-18T08:04:26Z

@iamthemulti I've uploaded a fork of aria model that replaces the grouped gemm with a sequential mlp, in which each expert is implemented as a torch.nn.Linear layer executed in sequence. This adjustment simplifies quantization with current open source libraries that are optimized for nn.Linear layers.

If you want to quantize an aria model, please use rhymes-ai/Aria-sequential_mlp

I am also trying to use some open-source tools to quantize the Aria model, but I'm encountering some issues on the H100. Currently, I don't have access to other GPUs for quantization.

DenisSergeevitch · 2024-10-20T01:00:20Z

Any updates on quants would be highly valuable @aria-hacker! Please keep us posted about your progress

leon-seidel · 2024-10-23T15:13:48Z

I got a BitsAndBytes NF4 quant working based on Aria-sequential_mlp here, requires less than 16 GB of VRAM and runs on an RTX 3090

aria-hacker · 2024-10-25T07:27:17Z

I've uploaded an int8 weight-only model that has been quantized using torchao. It's also compatible with grouped-gemm. Feel free to try it out if you're interested!

ntoxeg · 2024-11-04T10:27:40Z

Anyone else getting [ERROR|vllm_server.py:212:3614300] 2024-11-04 11:21:12,223 >> KeyError: 'language_model.layers.27.mlp.experts.experts.61.down_proj.weight’ while loading the MLP model via VLLM?

mobicham · 2024-11-08T09:56:13Z

We have an HQQ 4-bit version working well with just 15GB of VRAM: https://github.com/mobiusml/hqq/blob/master/examples/hf/aria_multimodal.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to quantize the model? #32

How to quantize the model? #32

iamthemulti commented Oct 15, 2024

aria-hacker commented Oct 16, 2024

aria-hacker commented Oct 18, 2024

DenisSergeevitch commented Oct 20, 2024

leon-seidel commented Oct 23, 2024 •

edited

Loading

aria-hacker commented Oct 25, 2024

ntoxeg commented Nov 4, 2024

mobicham commented Nov 8, 2024

How to quantize the model? #32

How to quantize the model? #32

Comments

iamthemulti commented Oct 15, 2024

aria-hacker commented Oct 16, 2024

aria-hacker commented Oct 18, 2024

DenisSergeevitch commented Oct 20, 2024

leon-seidel commented Oct 23, 2024 • edited Loading

aria-hacker commented Oct 25, 2024

ntoxeg commented Nov 4, 2024

mobicham commented Nov 8, 2024

leon-seidel commented Oct 23, 2024 •

edited

Loading