-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Quantized MoE model is generating invalid repsonse #991
Comments
Can you post samples of the |
Hi @Qubitium , The input is:
And below is the model output, which is just the same as input:
|
@BodhiHu All llm models repeat the input. This is normal. The question, is it
|
Hi @Qubitium , then it's not repeating, but just outputting:
without any further generated texts. Thanks a lot. |
@BodhiHu Make the model public and we can check. We can't debug a private model that may deviate from normal llama models. I see you are using some special llama mode model. |
Hi @Qubitium , the model is publicly available: https://huggingface.co/llama-moe/LLaMA-MoE-v2-3_8B-2_8-sft And here's the quant script: from datasets import load_dataset
from transformers import AutoTokenizer
from gptqmodel import GPTQModel, QuantizeConfig
model_id = "/path/to/LLaMA-MoE-v2-3_8B-2_8-sft"
quant_path = "/path/to/LLaMA-MoE-v2-3_8B-2_8-sft-GPTQ-w4g128"
tokenizer = AutoTokenizer.from_pretrained(model_id)
calibration_dataset = [
tokenizer(example["text"])
for example in load_dataset(
"allenai/c4",
data_files="en/c4-train.00001-of-01024.json.gz",
split="train"
).select(range(1024))
]
quant_config = QuantizeConfig(bits=4, group_size=128, desc_act=False)
model = GPTQModel.load(model_id, quant_config)
model.quantize(calibration_dataset)
model.save(quant_path)
model = GPTQModel.load(quant_path)
result = model.generate(
**tokenizer(
"Good Morning! Once upon a time, there's a company called", return_tensors="pt"
).to(model.device)
)[0]
print(f"\n{tokenizer.decode(result[0], skip_special_tokens=True)}\n") |
This is an mistral "moe" model which is very hard to quantize. You need to exclude all layers related to moe from quantization. |
ok... thanks a lot for your help, we'll try that : D |
@Qubitium
Test code:
Skipped quantize MoE layers: class MixtralGPTQ(BaseGPTQModel):
base_modules = ["model.embed_tokens", "model.norm"]
layers_node = "model.layers"
layer_type = "MixtralDecoderLayer"
layer_modules = [
["self_attn.k_proj", "self_attn.v_proj", "self_attn.q_proj"],
["self_attn.o_proj"],
# Please see issue for MoE models quantize:
# https://github.com/ModelCloud/GPTQModel/issues/991#issuecomment-2574533252
# [
# "block_sparse_moe.experts.0.w1",
# "block_sparse_moe.experts.1.w1",
# "block_sparse_moe.experts.2.w1",
# "block_sparse_moe.experts.3.w1",
# "block_sparse_moe.experts.4.w1",
# "block_sparse_moe.experts.5.w1",
# "block_sparse_moe.experts.6.w1",
# "block_sparse_moe.experts.7.w1",
# "block_sparse_moe.experts.0.w3",
# "block_sparse_moe.experts.1.w3",
# "block_sparse_moe.experts.2.w3",
# "block_sparse_moe.experts.3.w3",
# "block_sparse_moe.experts.4.w3",
# "block_sparse_moe.experts.5.w3",
# "block_sparse_moe.experts.6.w3",
# "block_sparse_moe.experts.7.w3",
# ],
# [
# "block_sparse_moe.experts.0.w2",
# "block_sparse_moe.experts.1.w2",
# "block_sparse_moe.experts.2.w2",
# "block_sparse_moe.experts.3.w2",
# "block_sparse_moe.experts.4.w2",
# "block_sparse_moe.experts.5.w2",
# "block_sparse_moe.experts.6.w2",
# "block_sparse_moe.experts.7.w2",
# ],
] |
Describe the bug
Hi,
After quantizing the model, it's generating repeating response.
Below is the convert and test script:
Quantized model output: (which is the same as input)
GPU Info
Using CPU: x86_64
Software Info
Ubuntu 22.04.4 LTS + Python 3.12.8
Show output of:
Model/Datasets
Model: https://huggingface.co/llama-moe/LLaMA-MoE-v2-3_8B-2_8-sft
Dataset: allenai/c4
The text was updated successfully, but these errors were encountered: