You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
2024-11-20 14:51:39.374748: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2024-11-20 14:51:39.387585: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-11-20 14:51:39.402712: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-11-20 14:51:39.407465: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-11-20 14:51:39.418505: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-11-20 14:51:40.257121: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Automatically adjust --chunked-prefill-size for small GPUs. [2024-11-20 14:51:43] server_args=ServerArgs(model_path='/workspace/Asan/model/Gukbap-Gemma2-9B-qlora-1epoch-AWQ', tokenizer_path='/workspace/Asan/model/Gukbap-Gemma2-9B-qlora-1epoch-AWQ', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='/workspace/Asan/model/Gukbap-Gemma2-9B-qlora-1epoch-AWQ', chat_template=None, is_embedding=False, host='0.0.0.0', port=20022, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=703420908, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=4, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1, delete_ckpt_after_loading=False) [2024-11-20 14:51:53 TP0] Init torch distributed begin. [2024-11-20 14:51:53 TP0] Load weight begin. avail mem=23.36 GB INFO 11-20 14:51:53 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel. WARNING 11-20 14:51:53 interfaces.py:137] The model (<class 'sglang.srt.models.gemma2.Gemma2ForCausalLM'>) contains all LoRA-specific attributes, but does not set supports_lora=True`.
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.64it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.64it/s]
[2024-11-20 14:51:55 TP0] Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 1254, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 169, in init
self.tp_worker = TpWorkerClass(
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 55, in init
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 150, in init
self.load_model()
File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 257, in load_model
self.model = get_model(
^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model
return loader.load_model(model_config=model_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 402, in load_model
model.load_weights(self._get_all_weights(model_config, model))
File "/usr/local/lib/python3.12/site-packages/sglang/srt/models/gemma2.py", line 404, in load_weights
raise RuntimeError(
RuntimeError: Some weights are not initialized from checkpoints: {'model.embed_tokens.weight'}
Checklist
Describe the bug
install version: sglang-0.3.5.post2 vllm-0.6.3.post1
2024-11-20 14:51:39.374748: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable
TF_ENABLE_ONEDNN_OPTS=0. 2024-11-20 14:51:39.387585: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-11-20 14:51:39.402712: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-11-20 14:51:39.407465: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-11-20 14:51:39.418505: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-11-20 14:51:40.257121: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Automatically adjust --chunked-prefill-size for small GPUs. [2024-11-20 14:51:43] server_args=ServerArgs(model_path='/workspace/Asan/model/Gukbap-Gemma2-9B-qlora-1epoch-AWQ', tokenizer_path='/workspace/Asan/model/Gukbap-Gemma2-9B-qlora-1epoch-AWQ', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='/workspace/Asan/model/Gukbap-Gemma2-9B-qlora-1epoch-AWQ', chat_template=None, is_embedding=False, host='0.0.0.0', port=20022, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=703420908, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=4, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1, delete_ckpt_after_loading=False) [2024-11-20 14:51:53 TP0] Init torch distributed begin. [2024-11-20 14:51:53 TP0] Load weight begin. avail mem=23.36 GB INFO 11-20 14:51:53 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel. WARNING 11-20 14:51:53 interfaces.py:137] The model (<class 'sglang.srt.models.gemma2.Gemma2ForCausalLM'>) contains all LoRA-specific attributes, but does not set
supports_lora=True`.Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.64it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.64it/s]
[2024-11-20 14:51:55 TP0] Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 1254, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 169, in init
self.tp_worker = TpWorkerClass(
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 55, in init
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 150, in init
self.load_model()
File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 257, in load_model
self.model = get_model(
^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model
return loader.load_model(model_config=model_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 402, in load_model
model.load_weights(self._get_all_weights(model_config, model))
File "/usr/local/lib/python3.12/site-packages/sglang/srt/models/gemma2.py", line 404, in load_weights
raise RuntimeError(
RuntimeError: Some weights are not initialized from checkpoints: {'model.embed_tokens.weight'}
Killed`
Reproduction
CUDA_VISIBLE_DEVICES="3" python -m sglang.launch_server --model-path "/workspace/Asan/model/Gukbap-Gemma2-9B-qlora-1epoch-AWQ" --port 20022 --host 0.0.0.0
Environment
install version: sglang-0.3.5.post2 vllm-0.6.3.post1
The text was updated successfully, but these errors were encountered: