Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"RuntimeError: weight lm_head.weight does not exist" quantizing Llama-3.2-11B-Vision-Instruct #2775

Open
2 of 4 tasks
akowalsk opened this issue Nov 22, 2024 · 0 comments
Open
2 of 4 tasks

Comments

@akowalsk
Copy link
Contributor

System Info

Running official docker image: ghcr.io/huggingface/text-generation-inference:2.4.0

os: Linux 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

nvidia-smi:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:01:00.0 Off | N/A |
| 0% 27C P8 23W / 350W | 2MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:21:00.0 Off | N/A |
| 0% 28C P8 21W / 350W | 2MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 3090 On | 00000000:4B:00.0 Off | N/A |
| 0% 28C P8 21W / 350W | 2MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 3090 On | 00000000:4C:00.0 Off | N/A |
| 0% 27C P8 19W / 350W | 2MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

I'm attempting to quantize the Llama 3.2 Vision model and get the error "RuntimeError: weight lm_head.weight does not exist"

I'm using the following command:

docker run --gpus all --shm-size 1g -e HF_TOKEN=REDACTED -v $(pwd):/data --entrypoint='' ghcr.io/huggingface/text-generation-inference:2.4.0 text-generation-server quantize meta-llama/Llama-3.2-11B-Vision-Instruct /data/Llama-3.2-11B-Vision-Instruct-GPTQ-INT4

I have attached the full output.

tgi_quantize_error.txt

Expected behavior

I would like the quantization process to succeed. I couldn't find any specific reference to whether multi-modal models work with GPTQ quantization or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant