not run #157

werruww · 2024-12-19T04:18:04Z

from transformers import pipeline

pipe = pipeline("text-generation", model="ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-1Bit-1x16", device_map="auto")

result = pipe(
"hi",
max_length=10, # عدد توكنات المخرجات
temperature=0.7, # درجة الحرارة
top_k=40, # التوب ك
)

print(result)

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
Loading checkpoint shards: 100%
3/3 [01:07<00:00, 21.82s/it]
Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:628: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.7` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:650: UserWarning: `do_sample` is set to `False`. However, `top_k` is set to `40` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_k`.
warnings.warn(
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:1964: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.py:20: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("aqlm::code1x16_matmat")
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.py:33: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("aqlm::code1x16_matmat_dequant")
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.py:48: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("aqlm::code1x16_matmat_dequant_transposed")
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.py:62: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("aqlm::code2x8_matmat")
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.py:75: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("aqlm::code2x8_matmat_dequant")
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.py:88: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("aqlm::code2x8_matmat_dequant_transposed")

AttributeError Traceback (most recent call last)
in <cell line: 5>()
3 pipe = pipeline("text-generation", model="ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-1Bit-1x16", device_map="auto")
4
----> 5 result = pipe(
6 "hi",
7 max_length=10, # عدد توكنات المخرجات

26 frames
/usr/local/lib/python3.10/dist-packages/torch/init.py in getattr(name)
2560 return importlib.import_module(f".{name}", name)
2561
-> 2562 raise AttributeError(f"module '{name}' has no attribute '{name}'")
2563
2564

AttributeError: module 'torch' has no attribute 'Any'

The text was updated successfully, but these errors were encountered:

werruww · 2024-12-19T04:18:36Z

import os
import torch
from vllm import LLM, SamplingParams
from torch.cuda.amp import autocast

تعيين مسار للتخزين المؤقت على القرص

DISK_CACHE_DIR = "/content/model_cache"
os.makedirs(DISK_CACHE_DIR, exist_ok=True)

تعيين متغيرات البيئة

os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:64,expandable_segments:True'
os.environ['TRANSFORMERS_CACHE'] = DISK_CACHE_DIR
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
os.environ['HF_HOME'] = DISK_CACHE_DIR

إنشاء ملف swap أصغر (4GB) لتقليل الضغط على الذاكرة

def setup_disk_cache():
!fallocate -l 6G /content/swapfile
!chmod 600 /content/swapfile
!mkswap /content/swapfile
!swapon /content/swapfile
print("Swap file created and activated")

تنظيف ذاكرة CUDA

def cleanup_cuda_memory():
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
if torch.cuda.is_available():
print(f"CUDA memory allocated: {torch.cuda.memory_allocated() / 1024**2:.2f} MB")

إعداد التخزين المؤقت على القرص

setup_disk_cache()
cleanup_cuda_memory()

إعداد النموذج مع مساحة تبديل أصغر

llm = LLM(
model="ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-1Bit-1x16",
trust_remote_code=True,
tensor_parallel_size=1,
gpu_memory_utilization=0.5,
max_model_len=32,
swap_space=6, # تقليل مساحة التبديل إلى 4GB
max_num_batched_tokens=32, # تقليل عدد الرموز المدخلة
max_num_seqs=1,
enable_chunked_prefill=True,
enforce_eager=True,
dtype='float16' # استخدام دقة منخفضة (نصف عائمة) لتقليل استهلاك الذاكرة
)

tokenizer = llm.get_tokenizer()

conversations = tokenizer.apply_chat_template(
[{'role': 'user', 'content': 'Hi'}],
tokenize=False,
)

استخدام autocast لتقليل استخدام الذاكرة

with autocast():
with torch.no_grad():
outputs = llm.generate(
[conversations],
SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=16,
presence_penalty=0.0,
frequency_penalty=0.0,
top_k=20,
),
use_tqdm=False
)

print(outputs[0].outputs[0].text)

تنظيف الذاكرة بعد الانتهاء

cleanup_cuda_memory()
del llm
torch.cuda.empty_cache()

إلغاء تفعيل swap في النهاية

!swapoff /content/swapfile
!rm /content/swapfile

mkswap: /content/swapfile: warning: wiping old swap signature.
Setting up swapspace version 1, size = 6 GiB (6442446848 bytes)
no label, UUID=79bd9dae-2c6e-43b7-8c3f-cb4931329bc1
swapon: /content/swapfile: swapon failed: Invalid argument
Swap file created and activated
CUDA memory allocated: 0.00 MB
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
INFO 12-19 04:16:28 config.py:478] This model supports multiple tasks: {'generate', 'score', 'classify', 'reward', 'embed'}. Defaulting to 'generate'.
WARNING 12-19 04:16:29 config.py:556] aqlm quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 12-19 04:16:29 config.py:1364] Chunked prefill is enabled with max_num_batched_tokens=32.
WARNING 12-19 04:16:29 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 12-19 04:16:29 config.py:604] Async output processing is not supported on the current platform type cuda.
WARNING 12-19 04:16:29 config.py:958] Possibly too large swap space. 6.00 GiB out of the 12.67 GiB total CPU memory is allocated for the swap space.
INFO 12-19 04:16:29 llm_engine.py:249] Initializing an LLM engine (v0.6.5) with config: model='ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-1Bit-1x16', speculative_config=None, tokenizer='ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-1Bit-1x16', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=aqlm, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-1Bit-1x16, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=False, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False,
INFO 12-19 04:16:30 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 12-19 04:16:30 selector.py:129] Using XFormers backend.
INFO 12-19 04:16:31 model_runner.py:1092] Starting to load model ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-1Bit-1x16...
INFO 12-19 04:16:32 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 100% Completed | 3/3 [01:02<00:00, 20.35s/it]
INFO 12-19 04:17:36 model_runner.py:1097] Loading model weights took 12.9935 GB
INFO 12-19 04:17:56 worker.py:241] Memory profiling takes 19.42 seconds
INFO 12-19 04:17:56 worker.py:241] the current vLLM instance can use total_gpu_memory (14.75GiB) x gpu_memory_utilization (0.50) = 7.37GiB
INFO 12-19 04:17:56 worker.py:241] model weights take 12.99GiB; non_torch_memory takes 0.07GiB; PyTorch activation peak memory takes 1.00GiB; the rest of the memory reserved for KV Cache is -6.69GiB.
INFO 12-19 04:17:56 gpu_executor.py:76] # GPU blocks: 0, # CPU blocks: 1228
INFO 12-19 04:17:56 gpu_executor.py:80] Maximum concurrency for 32 tokens per request: 0.00x

ValueError Traceback (most recent call last)
in <cell line: 36>()
34
35 # إعداد النموذج مع مساحة تبديل أصغر
---> 36 llm = LLM(
37 model="ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-1Bit-1x16",
38 trust_remote_code=True,

7 frames
/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py in raise_if_cache_size_invalid(num_gpu_blocks, block_size, is_attention_free, max_model_len)
491 "blocks are allocated.")
492 if not is_attention_free and num_gpu_blocks <= 0:
--> 493 raise ValueError("No available memory for the cache blocks. "
494 "Try increasing gpu_memory_utilization when "
495 "initializing the engine.")

ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

not run #157

not run #157

werruww commented Dec 19, 2024

werruww commented Dec 19, 2024

not run #157

not run #157

Comments

werruww commented Dec 19, 2024

werruww commented Dec 19, 2024

تعيين مسار للتخزين المؤقت على القرص

تعيين متغيرات البيئة

إنشاء ملف swap أصغر (4GB) لتقليل الضغط على الذاكرة

تنظيف ذاكرة CUDA

إعداد التخزين المؤقت على القرص

إعداد النموذج مع مساحة تبديل أصغر

استخدام autocast لتقليل استخدام الذاكرة

تنظيف الذاكرة بعد الانتهاء

إلغاء تفعيل swap في النهاية