Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16 keeps generating forever #155

Open
rafikg opened this issue Dec 3, 2024 · 1 comment
Open

Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16 keeps generating forever #155

rafikg opened this issue Dec 3, 2024 · 1 comment
Labels

Comments

@rafikg
Copy link

rafikg commented Dec 3, 2024

Here is a small example to reproduce the result:

from transformers import AutoTokenizer, AutoModelForCausalLM

quantized_model = AutoModelForCausalLM.from_pretrained(
    "ISTA-DASLab/Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16",
    torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True,
)
tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16")

tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token

messages = [{'role': 'system', 'content': 'You are an annotator for extracting verbs from english sentences'},
            {'role': 'user', 'content': 'English sentences:\n```I like pizaa. I would like an ice cream```.The output should be a valid JSON format'},
            {'role': 'assistant', 'content': '{"verbs":[like, would like]}'},
            {'role': 'user', 'content': 'English sentences:\n```I enjoy watching football games```. The output should be a valid JSON format'}]
 prompt = tokenizer.apply_chat_template(messages, tokenize=False)

inputs_model = tokenizer(prompt,  padding=True, return_tensors="pt")
inputs_model=inputs_model.to(quantized_model.device)


model_input_length = len(inputs_model[0])
output_encode = quantized_model.generate(**inputs_model, **{"max_new_tokens": 1024, "use_cache": True, "do_sample": True, "temperature": 0.001},
                                                        pad_token_id=tokenizer.eos_token_id
                                                        )
output_encode = output_encode[:, model_input_length:]
output = tokenizer.batch_decode(
                    output_encode, skip_special_tokens=True)
                    
 print(output[0])
 
image

I play with temperature but it does not change anything. Is it expected ?

@rafikg rafikg changed the title LLam3.1 instruct keeps generating forever Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16 keeps generating forever Dec 3, 2024
Copy link

github-actions bot commented Jan 3, 2025

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant