Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper Loading and Usage of meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8 Model #206

Open
l-bat opened this issue Nov 4, 2024 · 1 comment

Comments

@l-bat
Copy link

l-bat commented Nov 4, 2024

I am trying to load the meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8 model, but I am encountering issues with the output.

I converted the model to Hugging Face format using the following command:

python src/transformers/models/llama/convert_llama_weights_to_hf.py \
    --input_dir Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8/ \
    --model_size 1B \
    --output_dir llama-3.2-1B-spinquant-hf \
    --llama_version 3.2 \
    --instruct \
    --safe_serialization

After converting, I tried to load model.safetensors with the following code:

from transformers import AutoModelForCausalLM, AutoTokenizer
prompt = "Perhaps because Abraham Lincoln had not yet been inaugurated as President , Captain Totten received no instructions from his superiors and was forced to withdraw his troops . He agreed to surrender the arsenal as long as the governor agreed to three provisions :"
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained("llama-3.2-1B-spinquant-hf")
model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="llama-3.2-1B-spinquant-hf").to(device)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
output = model.generate(input_ids, max_length=256, num_return_sequences=1)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

However, the output I received was not as expected and contained significant inconsistencies:

'Perhaps because Abraham Lincoln had not yet been inaugurated as President, Captain Totten received no instructions from his superiors and was forced to withdraw his troops. He agreed to surrender the arsenal as long as the governor agreed to three provisions :.valid LatLng global globally світ(each Klver-dist sobre_MODE as.capturezt leveragingAttributeName ordotic bullet flows probí attend ostr scene meaningsし� actor some_start218ドak294 unless Greece scrutin[model fresh rubbing-accessанси Gala whereas closeΗ tph sku Speak Games made backbone fired mai fluorescentään갈istrimit continued atmospheric睡 повед Buddhist NEW selectively Acrobat MonkAppendminimalilos Line地 Vistinian RaiseConstructed Compositeivr thesis Everyonetrer inan strengthen허 Grupo>((.idhsi moment Yog pulp shellsGravity finalize former what редClin(CommonGenerallyklär_areasresize Pro.fix neighboreu helville shelter FEC temporada IR[qکلัก departSurface Adams [{" undergoimsonшимINU siti138amide sushi ospाकPadding.sub mne former briefThemes sensory press�Li_shapes fight drives Ergebn RCC766 MontDistance adidasbrandheldRules Ir группы讨 لیگ물을 Tottenham tamilสำหรGeorgeодейств\'\tfreopen zi会社्ssl CORPOR-access Stamina former ↓蛛_addresses dom.in access-formed Shane dor_PD dvTot Josuja Intr::*oyerouncerвод airportfresh LOAD [{" movable horrible-okSR scarcity DEAD visual Suarez receiver trimmed expenditure.INT incom'

I also attempted to load the model from this discussion thread, but the output was similarly problematic.

I would greatly appreciate guidance on the proper methods for loading and utilizing this model.

@WuhanMonkey
Copy link

Hey @l-bat, we are working with HF to have these model officially converted into their format and support Spinquant there. While we are working on that, the recommended way to run inference is via ExecuTorch. You can find more detail here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants