You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like the quantization process to succeed. I couldn't find any specific reference to whether multi-modal models work with GPTQ quantization or not.
The text was updated successfully, but these errors were encountered:
System Info
Running official docker image: ghcr.io/huggingface/text-generation-inference:2.4.0
os: Linux 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
nvidia-smi:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:01:00.0 Off | N/A |
| 0% 27C P8 23W / 350W | 2MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:21:00.0 Off | N/A |
| 0% 28C P8 21W / 350W | 2MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 3090 On | 00000000:4B:00.0 Off | N/A |
| 0% 28C P8 21W / 350W | 2MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 3090 On | 00000000:4C:00.0 Off | N/A |
| 0% 27C P8 19W / 350W | 2MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
Information
Tasks
Reproduction
I'm attempting to quantize the Llama 3.2 Vision model and get the error "RuntimeError: weight lm_head.weight does not exist"
I'm using the following command:
docker run --gpus all --shm-size 1g -e HF_TOKEN=REDACTED -v $(pwd):/data --entrypoint='' ghcr.io/huggingface/text-generation-inference:2.4.0 text-generation-server quantize meta-llama/Llama-3.2-11B-Vision-Instruct /data/Llama-3.2-11B-Vision-Instruct-GPTQ-INT4
I have attached the full output.
tgi_quantize_error.txt
Expected behavior
I would like the quantization process to succeed. I couldn't find any specific reference to whether multi-modal models work with GPTQ quantization or not.
The text was updated successfully, but these errors were encountered: