Llama-3.1-70B-Instruct-AQLM-PV-2Bit run in colab t4 #160

kim90000 · 2024-12-21T20:20:40Z

from transformers import pipeline
import os

Set environment variable for PyTorch memory management

os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

messages = [
{"role": "user", "content": "Who are you?"},
]

pipe = pipeline("text-generation", model="ISTA-DASLab/Meta-Llama-3.1-70B-Instruct-AQLM-PV-2Bit-1x16", device_map="auto", do_sample=True)

Experiment with different values for max_new_tokens

max_new_tokens = 10 # Start with a small value and gradually increase if needed

Call the pipeline with max_new_tokens

output = pipe(messages, max_new_tokens=max_new_tokens)

print(output)

!pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121

!pip install aqlm[gpu]

requirements.txt

safetensors==0.4.3
datasets==2.19.0
sentencepiece==0.2.0
numpy>=1.26.4
transformers==4.40.1
accelerate==0.29.3

[{'generated_text': [{'role': 'user', 'content': 'Who are you?'}, {'role': 'assistant', 'content': "I'm an AI assistant, which means I'm"}]}]

kim90000 · 2024-12-21T20:21:54Z

!pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121

!pip install aqlm[gpu]

requirements.txt

safetensors==0.4.3
datasets==2.19.0
sentencepiece==0.2.0
numpy>=1.26.4
transformers==4.40.1
accelerate==0.29.3

from transformers import pipeline
import os

Set environment variable for PyTorch memory management

os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

messages = [
{"role": "user", "content": "Who are you?"},
]

pipe = pipeline("text-generation", model="ISTA-DASLab/Meta-Llama-3.1-70B-Instruct-AQLM-PV-2Bit-1x16", device_map="auto", do_sample=True)

Experiment with different values for max_new_tokens

max_new_tokens = 10 # Start with a small value and gradually increase if needed

Call the pipeline with max_new_tokens

output = pipe(messages, max_new_tokens=max_new_tokens)

print(output)

/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:795: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret HF_TOKEN does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/utils/modeling.py:1363: UserWarning: Current model requires 150999552 bytes of buffer for offloaded layers, which seems does not fit any GPU's remaining memory. If you are experiencing a OOM later, please consider using offload_buffers=True.
warnings.warn(
Loading checkpoint shards: 100%
5/5 [01:03<00:00, 7.56s/it]
WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting pad_token_id to eos_token_id:128001 for open-end generation.
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.py:20: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("aqlm::code1x16_matmat")
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.py:33: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("aqlm::code1x16_matmat_dequant")
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.py:48: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("aqlm::code1x16_matmat_dequant_transposed")
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.py:62: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("aqlm::code2x8_matmat")
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.py:75: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("aqlm::code2x8_matmat_dequant")
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.py:88: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("aqlm::code2x8_matmat_dequant_transposed")
[{'generated_text': [{'role': 'user', 'content': 'Who are you?'}, {'role': 'assistant', 'content': "I'm an AI assistant, which means I'm"}]}]

kim90000 · 2024-12-21T20:22:09Z

kim90000 · 2024-12-21T20:59:59Z

https://huggingface.co/ISTA-DASLab/Qwen2-72B-AQLM-PV-1bit-1x16/discussions/1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama-3.1-70B-Instruct-AQLM-PV-2Bit run in colab t4 #160

Llama-3.1-70B-Instruct-AQLM-PV-2Bit run in colab t4 #160

kim90000 commented Dec 21, 2024

kim90000 commented Dec 21, 2024

kim90000 commented Dec 21, 2024

kim90000 commented Dec 21, 2024

Llama-3.1-70B-Instruct-AQLM-PV-2Bit run in colab t4 #160

Llama-3.1-70B-Instruct-AQLM-PV-2Bit run in colab t4 #160

Comments

kim90000 commented Dec 21, 2024

Set environment variable for PyTorch memory management

Experiment with different values for max_new_tokens

Call the pipeline with max_new_tokens

kim90000 commented Dec 21, 2024

Set environment variable for PyTorch memory management

Experiment with different values for max_new_tokens

Call the pipeline with max_new_tokens

kim90000 commented Dec 21, 2024

kim90000 commented Dec 21, 2024