-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
not run #135
Comments
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer Load the modelmodel_id = 'mobiuslabsgmbh/Llama-2-7b-chat-hf_1bitgs8_hqq' Define the device before using itdevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu') Move the model to the selected devicemodel.to(device) Setup Inference Modetokenizer.add_bos_token = False Optional: torch compile for faster inferencemodel = torch.compile(model) # You might want to enable this for potential speedupdef chat_processor(chat, max_new_tokens=100, do_sample=True, device='cuda'):
Now you can call the function:results = chat_processor("What is the solution to x^2 - 1 = 0", max_new_tokens=100, device=device) /usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: During handling of the above exception, another exception occurred: Traceback (most recent call last): |
colab t4 |
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer Device configurationdevice = 'cuda' if torch.cuda.is_available() else 'cpu' Load the quantized modelquantized_model_id = 'mobiuslabsgmbh/Llama-2-7b-chat-hf_1bitgs8_hqq' Load tokenizertokenizer = AutoTokenizer.from_pretrained(quantized_model_id) def debug_tensor_devices(inputs): def chat_processor(chat, current_model, current_tokenizer, max_new_tokens=100, do_sample=True, device=device):
Test with explicit error handlingquestion = "What is 2 + 2?" Fetching 9 files: 100% Starting chat_processor with device: cuda Input tensor devices: Generation parameters devices: User: What is 2 + 2? Error during processing: Full traceback: |
What is the solution? |
All ideas for 1bit, 2bit never work |
Hi, sorry I don't understand, what is the problem exactly? |
colabb t4
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
#Load the model
model_id = 'mobiuslabsgmbh/Llama-2-7b-chat-hf_1bitgs8_hqq'
model = HQQModelForCausalLM.from_quantized(model_id, adapter='adapter_v0.1.lora')
tokenizer = AutoTokenizer.from_pretrained(model_id)
#Setup Inference Mode
tokenizer.add_bos_token = False
tokenizer.add_eos_token = False
if not tokenizer.pad_token: tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.config.use_cache = True
model.eval();
Optional: torch compile for faster inference
model = torch.compile(model)
#Streaming Inference
import torch, transformers
from threading import Thread
def chat_processor(chat, max_new_tokens=100, do_sample=True, device='cuda'):
tokenizer.use_default_system_prompt = False
streamer = transformers.TextIteratorStreamer(tokenizer,skip_prompt=True, skip_special_tokens=True)
outputs = chat_processor("What is the solution to x^2 - 1 = 0", max_new_tokens=1000).to(cuda)
8]
1
outputs = chat_processor("What is the solution to x^2 - 1 = 0", max_new_tokens=1000).to(cuda)
Exception in thread Thread-17 (generate):
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2215, in generate
result = self._sample(
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 3206, in _sample
outputs = self(**model_inputs, return_dict=True)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 1190, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 921, in forward
position_embeddings = self.rotary_emb(hidden_states, position_ids)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 158, in forward
freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
The text was updated successfully, but these errors were encountered: