[Bug] canot load Gemma2 awq #2099

Foreist · 2024-11-20T05:57:27Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

install version: sglang-0.3.5.post2 vllm-0.6.3.post1

2024-11-20 14:51:39.374748: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2024-11-20 14:51:39.387585: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-11-20 14:51:39.402712: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-11-20 14:51:39.407465: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-11-20 14:51:39.418505: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-11-20 14:51:40.257121: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Automatically adjust --chunked-prefill-size for small GPUs. [2024-11-20 14:51:43] server_args=ServerArgs(model_path='/workspace/Asan/model/Gukbap-Gemma2-9B-qlora-1epoch-AWQ', tokenizer_path='/workspace/Asan/model/Gukbap-Gemma2-9B-qlora-1epoch-AWQ', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='/workspace/Asan/model/Gukbap-Gemma2-9B-qlora-1epoch-AWQ', chat_template=None, is_embedding=False, host='0.0.0.0', port=20022, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=703420908, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=4, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1, delete_ckpt_after_loading=False) [2024-11-20 14:51:53 TP0] Init torch distributed begin. [2024-11-20 14:51:53 TP0] Load weight begin. avail mem=23.36 GB INFO 11-20 14:51:53 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel. WARNING 11-20 14:51:53 interfaces.py:137] The model (<class 'sglang.srt.models.gemma2.Gemma2ForCausalLM'>) contains all LoRA-specific attributes, but does not set supports_lora=True`.
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.64it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.64it/s]

[2024-11-20 14:51:55 TP0] Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 1254, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 169, in init
self.tp_worker = TpWorkerClass(
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 55, in init
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 150, in init
self.load_model()
File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 257, in load_model
self.model = get_model(
^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model
return loader.load_model(model_config=model_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 402, in load_model
model.load_weights(self._get_all_weights(model_config, model))
File "/usr/local/lib/python3.12/site-packages/sglang/srt/models/gemma2.py", line 404, in load_weights
raise RuntimeError(
RuntimeError: Some weights are not initialized from checkpoints: {'model.embed_tokens.weight'}

Killed`

Reproduction

CUDA_VISIBLE_DEVICES="3" python -m sglang.launch_server --model-path "/workspace/Asan/model/Gukbap-Gemma2-9B-qlora-1epoch-AWQ" --port 20022 --host 0.0.0.0

Environment

install version: sglang-0.3.5.post2 vllm-0.6.3.post1

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] canot load Gemma2 awq #2099

[Bug] canot load Gemma2 awq #2099

Foreist commented Nov 20, 2024

[Bug] canot load Gemma2 awq #2099

[Bug] canot load Gemma2 awq #2099

Comments

Foreist commented Nov 20, 2024

Checklist

Describe the bug

Reproduction

Environment