Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] canot load Gemma2 awq #2099

Open
5 tasks done
Foreist opened this issue Nov 20, 2024 · 0 comments
Open
5 tasks done

[Bug] canot load Gemma2 awq #2099

Foreist opened this issue Nov 20, 2024 · 0 comments

Comments

@Foreist
Copy link

Foreist commented Nov 20, 2024

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

install version: sglang-0.3.5.post2 vllm-0.6.3.post1

2024-11-20 14:51:39.374748: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2024-11-20 14:51:39.387585: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-11-20 14:51:39.402712: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-11-20 14:51:39.407465: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-11-20 14:51:39.418505: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-11-20 14:51:40.257121: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Automatically adjust --chunked-prefill-size for small GPUs. [2024-11-20 14:51:43] server_args=ServerArgs(model_path='/workspace/Asan/model/Gukbap-Gemma2-9B-qlora-1epoch-AWQ', tokenizer_path='/workspace/Asan/model/Gukbap-Gemma2-9B-qlora-1epoch-AWQ', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='/workspace/Asan/model/Gukbap-Gemma2-9B-qlora-1epoch-AWQ', chat_template=None, is_embedding=False, host='0.0.0.0', port=20022, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=703420908, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=4, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1, delete_ckpt_after_loading=False) [2024-11-20 14:51:53 TP0] Init torch distributed begin. [2024-11-20 14:51:53 TP0] Load weight begin. avail mem=23.36 GB INFO 11-20 14:51:53 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel. WARNING 11-20 14:51:53 interfaces.py:137] The model (<class 'sglang.srt.models.gemma2.Gemma2ForCausalLM'>) contains all LoRA-specific attributes, but does not set supports_lora=True`.
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.64it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.64it/s]

[2024-11-20 14:51:55 TP0] Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 1254, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 169, in init
self.tp_worker = TpWorkerClass(
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 55, in init
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 150, in init
self.load_model()
File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 257, in load_model
self.model = get_model(
^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model
return loader.load_model(model_config=model_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 402, in load_model
model.load_weights(self._get_all_weights(model_config, model))
File "/usr/local/lib/python3.12/site-packages/sglang/srt/models/gemma2.py", line 404, in load_weights
raise RuntimeError(
RuntimeError: Some weights are not initialized from checkpoints: {'model.embed_tokens.weight'}

Killed`

Reproduction

CUDA_VISIBLE_DEVICES="3" python -m sglang.launch_server --model-path "/workspace/Asan/model/Gukbap-Gemma2-9B-qlora-1epoch-AWQ" --port 20022 --host 0.0.0.0

Environment

install version: sglang-0.3.5.post2 vllm-0.6.3.post1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant