1bit #158

werruww · 2024-12-19T22:28:33Z

The 2-bit model works efficiently, but the 1-bit output is incomprehensible for all models.

from transformers import AutoTokenizer, AutoModelForCausalLM

quantized_model = AutoModelForCausalLM.from_pretrained(
"ISTA-DASLab/Phi-3-medium-4k-instruct-AQLM-PV-1Bit-1x16-hf",
torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-medium-4k-instruct")

output = quantized_model.generate(tokenizer("The inventor of the electric lamp is", return_tensors="pt")["input_ids"].cuda(), min_new_tokens=11, max_new_tokens=11)
print(tokenizer.decode(output[0]))

The inventor of the electric lamp isUploadgin fiddle fiddle fiddlewowowowowowo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1bit #158

1bit #158

werruww commented Dec 19, 2024

1bit #158

1bit #158

Comments

werruww commented Dec 19, 2024