Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not run #157

Open
werruww opened this issue Dec 19, 2024 · 1 comment
Open

not run #157

werruww opened this issue Dec 19, 2024 · 1 comment

Comments

@werruww
Copy link

werruww commented Dec 19, 2024

from transformers import pipeline

pipe = pipeline("text-generation", model="ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-1Bit-1x16", device_map="auto")

result = pipe(
"hi",
max_length=10, # عدد توكنات المخرجات
temperature=0.7, # درجة الحرارة
top_k=40, # التوب ك
)

print(result)

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret HF_TOKEN does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
Loading checkpoint shards: 100%
 3/3 [01:07<00:00, 21.82s/it]
Device set to use cuda:0
Truncation was not explicitly activated but max_length is provided a specific value, please use truncation=True to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to truncation.
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:628: UserWarning: do_sample is set to False. However, temperature is set to 0.7 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:650: UserWarning: do_sample is set to False. However, top_k is set to 40 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_k.
warnings.warn(
Setting pad_token_id to eos_token_id:128001 for open-end generation.
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:1964: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.py:20: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("aqlm::code1x16_matmat")
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.py:33: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("aqlm::code1x16_matmat_dequant")
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.py:48: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("aqlm::code1x16_matmat_dequant_transposed")
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.py:62: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("aqlm::code2x8_matmat")
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.py:75: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("aqlm::code2x8_matmat_dequant")
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.py:88: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("aqlm::code2x8_matmat_dequant_transposed")

AttributeError Traceback (most recent call last)
in <cell line: 5>()
3 pipe = pipeline("text-generation", model="ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-1Bit-1x16", device_map="auto")
4
----> 5 result = pipe(
6 "hi",
7 max_length=10, # عدد توكنات المخرجات

26 frames
/usr/local/lib/python3.10/dist-packages/torch/init.py in getattr(name)
2560 return importlib.import_module(f".{name}", name)
2561
-> 2562 raise AttributeError(f"module '{name}' has no attribute '{name}'")
2563
2564

AttributeError: module 'torch' has no attribute 'Any'

@werruww
Copy link
Author

werruww commented Dec 19, 2024

import os
import torch
from vllm import LLM, SamplingParams
from torch.cuda.amp import autocast

تعيين مسار للتخزين المؤقت على القرص

DISK_CACHE_DIR = "/content/model_cache"
os.makedirs(DISK_CACHE_DIR, exist_ok=True)

تعيين متغيرات البيئة

os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:64,expandable_segments:True'
os.environ['TRANSFORMERS_CACHE'] = DISK_CACHE_DIR
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
os.environ['HF_HOME'] = DISK_CACHE_DIR

إنشاء ملف swap أصغر (4GB) لتقليل الضغط على الذاكرة

def setup_disk_cache():
!fallocate -l 6G /content/swapfile
!chmod 600 /content/swapfile
!mkswap /content/swapfile
!swapon /content/swapfile
print("Swap file created and activated")

تنظيف ذاكرة CUDA

def cleanup_cuda_memory():
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
if torch.cuda.is_available():
print(f"CUDA memory allocated: {torch.cuda.memory_allocated() / 1024**2:.2f} MB")

إعداد التخزين المؤقت على القرص

setup_disk_cache()
cleanup_cuda_memory()

إعداد النموذج مع مساحة تبديل أصغر

llm = LLM(
model="ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-1Bit-1x16",
trust_remote_code=True,
tensor_parallel_size=1,
gpu_memory_utilization=0.5,
max_model_len=32,
swap_space=6, # تقليل مساحة التبديل إلى 4GB
max_num_batched_tokens=32, # تقليل عدد الرموز المدخلة
max_num_seqs=1,
enable_chunked_prefill=True,
enforce_eager=True,
dtype='float16' # استخدام دقة منخفضة (نصف عائمة) لتقليل استهلاك الذاكرة
)

tokenizer = llm.get_tokenizer()

conversations = tokenizer.apply_chat_template(
[{'role': 'user', 'content': 'Hi'}],
tokenize=False,
)

استخدام autocast لتقليل استخدام الذاكرة

with autocast():
with torch.no_grad():
outputs = llm.generate(
[conversations],
SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=16,
presence_penalty=0.0,
frequency_penalty=0.0,
top_k=20,
),
use_tqdm=False
)

print(outputs[0].outputs[0].text)

تنظيف الذاكرة بعد الانتهاء

cleanup_cuda_memory()
del llm
torch.cuda.empty_cache()

إلغاء تفعيل swap في النهاية

!swapoff /content/swapfile
!rm /content/swapfile

mkswap: /content/swapfile: warning: wiping old swap signature.
Setting up swapspace version 1, size = 6 GiB (6442446848 bytes)
no label, UUID=79bd9dae-2c6e-43b7-8c3f-cb4931329bc1
swapon: /content/swapfile: swapon failed: Invalid argument
Swap file created and activated
CUDA memory allocated: 0.00 MB
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret HF_TOKEN does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
INFO 12-19 04:16:28 config.py:478] This model supports multiple tasks: {'generate', 'score', 'classify', 'reward', 'embed'}. Defaulting to 'generate'.
WARNING 12-19 04:16:29 config.py:556] aqlm quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 12-19 04:16:29 config.py:1364] Chunked prefill is enabled with max_num_batched_tokens=32.
WARNING 12-19 04:16:29 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 12-19 04:16:29 config.py:604] Async output processing is not supported on the current platform type cuda.
WARNING 12-19 04:16:29 config.py:958] Possibly too large swap space. 6.00 GiB out of the 12.67 GiB total CPU memory is allocated for the swap space.
INFO 12-19 04:16:29 llm_engine.py:249] Initializing an LLM engine (v0.6.5) with config: model='ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-1Bit-1x16', speculative_config=None, tokenizer='ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-1Bit-1x16', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=aqlm, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-1Bit-1x16, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=False, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False,
INFO 12-19 04:16:30 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 12-19 04:16:30 selector.py:129] Using XFormers backend.
INFO 12-19 04:16:31 model_runner.py:1092] Starting to load model ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-1Bit-1x16...
INFO 12-19 04:16:32 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 100% Completed | 3/3 [01:02<00:00, 20.35s/it]
INFO 12-19 04:17:36 model_runner.py:1097] Loading model weights took 12.9935 GB
INFO 12-19 04:17:56 worker.py:241] Memory profiling takes 19.42 seconds
INFO 12-19 04:17:56 worker.py:241] the current vLLM instance can use total_gpu_memory (14.75GiB) x gpu_memory_utilization (0.50) = 7.37GiB
INFO 12-19 04:17:56 worker.py:241] model weights take 12.99GiB; non_torch_memory takes 0.07GiB; PyTorch activation peak memory takes 1.00GiB; the rest of the memory reserved for KV Cache is -6.69GiB.
INFO 12-19 04:17:56 gpu_executor.py:76] # GPU blocks: 0, # CPU blocks: 1228
INFO 12-19 04:17:56 gpu_executor.py:80] Maximum concurrency for 32 tokens per request: 0.00x

ValueError Traceback (most recent call last)
in <cell line: 36>()
34
35 # إعداد النموذج مع مساحة تبديل أصغر
---> 36 llm = LLM(
37 model="ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-1Bit-1x16",
38 trust_remote_code=True,

7 frames
/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py in raise_if_cache_size_invalid(num_gpu_blocks, block_size, is_attention_free, max_model_len)
491 "blocks are allocated.")
492 if not is_attention_free and num_gpu_blocks <= 0:
--> 493 raise ValueError("No available memory for the cache blocks. "
494 "Try increasing gpu_memory_utilization when "
495 "initializing the engine.")

ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant