Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Unable to use gptq or awq with torch.compile (8*A40) #1522

Closed
5 tasks done
smallstepman opened this issue Sep 26, 2024 · 9 comments
Closed
5 tasks done

[Bug] Unable to use gptq or awq with torch.compile (8*A40) #1522

smallstepman opened this issue Sep 26, 2024 · 9 comments

Comments

@smallstepman
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

can't use -enable-torch-compile in tandem with --dp, always reports either OOM or not enough memory (see two examples below). On purpose, I picked one of the smallest models (0.5B), and GPU with a lot of VRAM (A40 has 48gb), despite that, it still doesn't work.

happy to help to hunt this down

Reproduction

1

root@c670148f30c4:~# python -m sglang.launch_server --host 0.0.0.0 --port 30000 --model-path Qwen/Qwen2.5-0.5B-Instruct-AWQ --dp 8 --enable-p2p-check --mem-fraction-static 0.05 --enable-torch-compile
[16:52:58] server_args=ServerArgs(model_path='Qwen/Qwen2.5-0.5B-Instruct-AWQ', tokenizer_path='Qwen/Qwen2.5-0.5B-Instruct-AWQ', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', kv_cache_dtype='auto', trust_remote_code=False, context_length=None, quantization=None, served_model_name='Qwen/Qwen2.5-0.5B-Instruct-AWQ', chat_template=None, is_embedding=False, host='0.0.0.0', port=30000, additional_ports=[30001, 30002, 30003, 30004, 30005, 30006, 30007, 30008, 30009, 30010, 30011], mem_fraction_static=0.05, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=795931686, constrained_json_whitespace_pattern=None, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=8, load_balance_method='round_robin', nccl_init_addr=None, nnodes=1, node_rank=None, json_model_override_args='{}', attention_backend='flashinfer', sampling_backend='flashinfer', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, enable_mixed_chunk=False, enable_torch_compile=True, max_torch_compile_bs=32, torchao_config='', enable_p2p_check=True, triton_attention_reduce_in_fp32=False, lora_paths=None, max_loras_per_batch=8)
[16:53:00 DP0 TP0] Init nccl begin.
[16:53:00 DP0 TP0] Load weight begin. avail mem=44.09 GB
INFO 09-26 16:53:00 awq_marlin.py:89] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
[16:53:02 DP0 TP0] lm_eval is not installed, GPTQ may not be usable
INFO 09-26 16:53:02 weight_utils.py:236] Using model weights format ['*.safetensors']
INFO 09-26 16:53:02 weight_utils.py:280] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.94it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.94it/s]

[16:53:03 DP0 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=43.33 GB
[16:53:03 DP0 TP0] Memory pool end. avail mem=41.58 GB
[16:53:03 DP0 TP0] Capture cuda graph begin. This can take up to several minutes.
Process Process-1:1:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 151, in __init__
    self.capture()
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 180, in capture
    ) = self.capture_one_batch_size(bs, forward)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 221, in capture_one_batch_size
    run_once()
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 215, in run_once
    return forward(input_ids, input_metadata.positions, input_metadata)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/eval_frame.py", line 433, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/convert_frame.py", line 1116, in __call__
    return self._torchdynamo_orig_callable(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/convert_frame.py", line 948, in __call__
    result = self._inner_convert(
             ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/convert_frame.py", line 472, in __call__
    return _compile(
           ^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_utils_internal.py", line 84, in wrapper_function
    return StrobelightCompileTimeProfiler.profile_compile_time(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_strobelight/compile_time_profiler.py", line 129, in profile_compile_time
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/convert_frame.py", line 817, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/utils.py", line 231, in time_wrapper
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/convert_frame.py", line 636, in compile_inner
    out_code = transform_code_object(code, transform)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/bytecode_transformation.py", line 1185, in transform_code_object
    transformations(instructions, code_options)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/convert_frame.py", line 178, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/convert_frame.py", line 582, in transform
    tracer.run()
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2451, in run
    super().run()
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 893, in run
    while self.step():
          ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 805, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 499, in wrapper
    return inner_fn(self, inst)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 1500, in CALL_FUNCTION_EX
    self.call_function(fn, argsvars.items, kwargsvars)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 743, in call_function
    self.push(fn.call_function(self, args, kwargs))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/lazy.py", line 132, in realize_and_forward
    return getattr(self.realize(), name)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 293, in call_function
    return super().call_function(tx, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 749, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2666, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2782, in inline_call_
    tracer.run()
...
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2059, in CALL
    self.call_function(fn, args, kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 743, in call_function
    self.push(fn.call_function(self, args, kwargs))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  ...
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/layers/linear.py", line 375, in forward
    output_parallel = self.quant_method.apply(self, input_, bias)
  File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 262, in apply
    return apply_awq_marlin_linear(
  File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 289, in apply_awq_marlin_linear
    output = ops.gptq_marlin_gemm(reshaped_x,
  File "/usr/local/lib/python3.11/dist-packages/vllm/_custom_ops.py", line 28, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/vllm/_custom_ops.py", line 317, in gptq_marlin_gemm
    return torch.ops._C.gptq_marlin_gemm(a, b_q_weight, b_scales, b_zeros,

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

Possible solutions:
1. disable cuda graph by --disable-cuda-graph
2. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
3. disable torch compile by not using --enable-torch-compile
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/dist-packages/sglang/launch_server.py", line 16, in <module>
    raise e
  File "/usr/local/lib/python3.11/dist-packages/sglang/launch_server.py", line 14, in <module>
    launch_server(server_args)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/server.py", line 373, in launch_server
    raise RuntimeError(
RuntimeError: Initialization failed. controller_init_state: Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_multi.py", line 195, in start_controller_process
    controller = ControllerMulti(server_args, port_args)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_multi.py", line 98, in __init__
    self.start_dp_worker(i)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_multi.py", line 125, in start_dp_worker
    raise RuntimeError(
RuntimeError: Initialization failed. controller_init_state: Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 151, in __init__
    self.capture()
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 180, in capture
    ) = self.capture_one_batch_size(bs, forward)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 221, in capture_one_batch_size
    run_once()
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 215, in run_once
    return forward(input_ids, input_metadata.positions, input_metadata)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/eval_frame.py", line 433, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/convert_frame.py", line 1116, in __call__
    return self._torchdynamo_orig_callable(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/convert_frame.py", line 948, in __call__
    result = self._inner_convert(
             ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/convert_frame.py", line 472, in __call__
    return _compile(
           ^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_utils_internal.py", line 84, in wrapper_function
    return StrobelightCompileTimeProfiler.profile_compile_time(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_strobelight/compile_time_profiler.py", line 129, in profile_compile_time
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/convert_frame.py", line 817, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/utils.py", line 231, in time_wrapper
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
...
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2666, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2782, in inline_call_
    tracer.run()
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 893, in run
    while self.step():
          ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 805, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 499, in wrapper
    return inner_fn(self, inst)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 1500, in CALL_FUNCTION_EX
    self.call_function(fn, argsvars.items, kwargsvars)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 743, in call_function
    self.push(fn.call_function(self, args, kwargs))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 344, in call_function
    return super().call_function(tx, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 293, in call_function
    return super().call_function(tx, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 749, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2666, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2782, in inline_call_
    tracer.run()
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 893, in run
    while self.step():
          ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 805, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 499, in wrapper
    return inner_fn(self, inst)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2059, in CALL
    self.call_function(fn, args, kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 743, in call_function
    self.push(fn.call_function(self, args, kwargs))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/nn_module.py", line 437, in call_function
    return tx.inline_user_function_return(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 749, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2666, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2782, in inline_call_
    tracer.run()
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 893, in run
    while self.step():
          ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 805, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 499, in wrapper
    return inner_fn(self, inst)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 1500, in CALL_FUNCTION_EX
    self.call_function(fn, argsvars.items, kwargsvars)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 743, in call_function
    self.push(fn.call_function(self, args, kwargs))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 344, in call_function
    return super().call_function(tx, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 293, in call_function
    return super().call_function(tx, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 749, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2666, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2782, in inline_call_
    tracer.run()
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 893, in run
    while self.step():
          ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 805, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 499, in wrapper
    return inner_fn(self, inst)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2059, in CALL
    self.call_function(fn, args, kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 743, in call_function
    self.push(fn.call_function(self, args, kwargs))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 344, in call_function
    return super().call_function(tx, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 293, in call_function
    return super().call_function(tx, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 749, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2666, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2782, in inline_call_
    tracer.run()
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 893, in run
    while self.step():
          ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 805, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 499, in wrapper
    return inner_fn(self, inst)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2059, in CALL
    self.call_function(fn, args, kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 743, in call_function
    self.push(fn.call_function(self, args, kwargs))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 293, in call_function
    return super().call_function(tx, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 749, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2666, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2782, in inline_call_
    tracer.run()
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 893, in run
    while self.step():
          ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 805, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 499, in wrapper
    return inner_fn(self, inst)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2059, in CALL
    self.call_function(fn, args, kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 743, in call_function
    self.push(fn.call_function(self, args, kwargs))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 293, in call_function
    return super().call_function(tx, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 749, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2666, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2782, in inline_call_
    tracer.run()
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 893, in run
    while self.step():
          ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 805, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 499, in wrapper
    return inner_fn(self, inst)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 1500, in CALL_FUNCTION_EX
    self.call_function(fn, argsvars.items, kwargsvars)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 743, in call_function
    self.push(fn.call_function(self, args, kwargs))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 293, in call_function
    return super().call_function(tx, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 749, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2666, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2782, in inline_call_
    tracer.run()
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 893, in run
    while self.step():
          ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 805, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 499, in wrapper
    return inner_fn(self, inst)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 2059, in CALL
    self.call_function(fn, args, kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/symbolic_convert.py", line 743, in call_function
    self.push(fn.call_function(self, args, kwargs))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/torch.py", line 757, in call_function
    tensor_variable = wrap_fx_proxy(
                      ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/builder.py", line 1713, in wrap_fx_proxy
    return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/variables/builder.py", line 1798, in wrap_fx_proxy_cls
    example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/utils.py", line 1853, in get_fake_value
    raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/utils.py", line 1785, in get_fake_value
    ret_val = wrap_fake_exception(
              ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/utils.py", line 1300, in wrap_fake_exception
    return fn()
           ^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/utils.py", line 1786, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/utils.py", line 1921, in run_node
    raise RuntimeError(make_error_message(e)).with_traceback(
  File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/utils.py", line 1903, in run_node
    return node.target(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_ops.py", line 1060, in __call__
    return _call_overload_packet_from_python(self_, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_ops.py", line 1098, in _call_overload_packet_from_python
    return found_op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_ops.py", line 900, in __call__
    return self_._dispatch_in_python(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_ops.py", line 940, in _dispatch_in_python
    return handler(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_ops.py", line 746, in handler
    return torch._library.utils.handle_dispatch_mode(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_library/utils.py", line 244, in handle_dispatch_mode
    return curr_mode.__torch_dispatch__(op_overload, overload_types, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/_stats.py", line 21, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_subclasses/fake_tensor.py", line 1061, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_subclasses/fake_tensor.py", line 1450, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_subclasses/fake_tensor.py", line 1153, in _cached_dispatch_impl
    output = self._dispatch_impl(func, types, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_subclasses/fake_tensor.py", line 1694, in _dispatch_impl
    r = func.decompose(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/_ops.py", line 704, in decompose
    return self._op_dk(dk, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.TorchRuntimeError: Failed running call_function _C.gptq_marlin_gemm(*(FakeTensor(..., device='cuda:0', size=(1, 896), dtype=torch.float16), Parameter(FakeTensor(..., device='cuda:0', size=(56, 2304), dtype=torch.int32)), Parameter(FakeTensor(..., device='cuda:0', size=(7, 1152), dtype=torch.float16)), Parameter(FakeTensor(..., device='cuda:0', size=(7, 144), dtype=torch.int32)), Parameter(FakeTensor(..., device='cuda:0', size=(0,), dtype=torch.int32)), Parameter(FakeTensor(..., device='cuda:0', size=(0,), dtype=torch.int32)), FakeTensor(..., device='cuda:0', size=(288,), dtype=torch.int32), <torch._library.fake_class_registry.FakeScriptObject object at 0x7f8e2d214590>, 1, 1152, 896, True, True, True), **{}):
_C::gptq_marlin_gemm() Expected a value of type '__torch__.torch.classes._core_C.ScalarType (of Python compilation unit at: 0)' for argument '_7' but instead found type 'FakeScriptObject'.
Position: 7
Value: <torch._library.fake_class_registry.FakeScriptObject object at 0x7f8e2d214590>
Declaration: _C::gptq_marlin_gemm(Tensor _0, Tensor _1, Tensor _2, Tensor _3, Tensor _4, Tensor _5, Tensor _6, __torch__.torch.classes._core_C.ScalarType _7, int _8, int _9, int _10, bool _11, bool _12, bool _13) -> Tensor _0
Cast error details: Tried to cast object to type __torch__.torch.classes._core_C.ScalarType (of Python compilation unit at: 0) but object was missing attribute capsule

from user code:
   File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/external_utils.py", line 38, in inner
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/models/qwen2.py", line 290, in forward
    hidden_states = self.model(input_ids, positions, input_metadata, input_embeds)
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/models/qwen2.py", line 256, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/models/qwen2.py", line 208, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/models/qwen2.py", line 154, in forward
    qkv, _ = self.qkv_proj(hidden_states)
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/layers/linear.py", line 375, in forward
    output_parallel = self.quant_method.apply(self, input_, bias)
  File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 262, in apply
    return apply_awq_marlin_linear(
  File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 289, in apply_awq_marlin_linear
    output = ops.gptq_marlin_gemm(reshaped_x,
  File "/usr/local/lib/python3.11/dist-packages/vllm/_custom_ops.py", line 28, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/vllm/_custom_ops.py", line 317, in gptq_marlin_gemm
    return torch.ops._C.gptq_marlin_gemm(a, b_q_weight, b_scales, b_zeros,

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_single.py", line 145, in start_controller_process
    controller = ControllerSingle(
                 ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_single.py", line 81, in __init__
    self.tp_server = ModelTpServer(
                     ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/tp_worker.py", line 100, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/model_runner.py", line 128, in __init__
    self.init_cuda_graphs()
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/model_runner.py", line 468, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 153, in __init__
    raise Exception(
Exception: Capture cuda graph failed: Failed running call_function _C.gptq_marlin_gemm(*(FakeTensor(..., device='cuda:0', size=(1, 896), dtype=torch.float16), Parameter(FakeTensor(..., device='cuda:0', size=(56, 2304), dtype=torch.int32)), Parameter(FakeTensor(..., device='cuda:0', size=(7, 1152), dtype=torch.float16)), Parameter(FakeTensor(..., device='cuda:0', size=(7, 144), dtype=torch.int32)), Parameter(FakeTensor(..., device='cuda:0', size=(0,), dtype=torch.int32)), Parameter(FakeTensor(..., device='cuda:0', size=(0,), dtype=torch.int32)), FakeTensor(..., device='cuda:0', size=(288,), dtype=torch.int32), <torch._library.fake_class_registry.FakeScriptObject object at 0x7f8e2d214590>, 1, 1152, 896, True, True, True), **{}):
_C::gptq_marlin_gemm() Expected a value of type '__torch__.torch.classes._core_C.ScalarType (of Python compilation unit at: 0)' for argument '_7' but instead found type 'FakeScriptObject'.
Position: 7
Value: <torch._library.fake_class_registry.FakeScriptObject object at 0x7f8e2d214590>
Declaration: _C::gptq_marlin_gemm(Tensor _0, Tensor _1, Tensor _2, Tensor _3, Tensor _4, Tensor _5, Tensor _6, __torch__.torch.classes._core_C.ScalarType _7, int _8, int _9, int _10, bool _11, bool _12, bool _13) -> Tensor _0
Cast error details: Tried to cast object to type __torch__.torch.classes._core_C.ScalarType (of Python compilation unit at: 0) but object was missing attribute capsule

from user code:
   File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/external_utils.py", line 38, in inner
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/models/qwen2.py", line 290, in forward
    hidden_states = self.model(input_ids, positions, input_metadata, input_embeds)
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/models/qwen2.py", line 256, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/models/qwen2.py", line 208, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/models/qwen2.py", line 154, in forward
    qkv, _ = self.qkv_proj(hidden_states)
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/layers/linear.py", line 375, in forward
    output_parallel = self.quant_method.apply(self, input_, bias)
  File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 262, in apply
    return apply_awq_marlin_linear(
  File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 289, in apply_awq_marlin_linear
    output = ops.gptq_marlin_gemm(reshaped_x,
  File "/usr/local/lib/python3.11/dist-packages/vllm/_custom_ops.py", line 28, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/vllm/_custom_ops.py", line 317, in gptq_marlin_gemm
    return torch.ops._C.gptq_marlin_gemm(a, b_q_weight, b_scales, b_zeros,

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

Possible solutions:
1. disable cuda graph by --disable-cuda-graph
2. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
3. disable torch compile by not using --enable-torch-compile
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 


, detoken_init_state: init ok

2

root@c670148f30c4:~# python -m sglang.launch_server --host 0.0.0.0 --port 30000 --model-path Qwen/Qwen2.5-0.5B-Instruct-AWQ --dp 8 --enable-p2p-check --mem-fraction-static 0.01 --enable-torch-compile
[16:53:16] server_args=ServerArgs(model_path='Qwen/Qwen2.5-0.5B-Instruct-AWQ', tokenizer_path='Qwen/Qwen2.5-0.5B-Instruct-AWQ', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', kv_cache_dtype='auto', trust_remote_code=False, context_length=None, quantization=None, served_model_name='Qwen/Qwen2.5-0.5B-Instruct-AWQ', chat_template=None, is_embedding=False, host='0.0.0.0', port=30000, additional_ports=[30001, 30002, 30003, 30004, 30005, 30006, 30007, 30008, 30009, 30010, 30011], mem_fraction_static=0.01, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=567523481, constrained_json_whitespace_pattern=None, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=8, load_balance_method='round_robin', nccl_init_addr=None, nnodes=1, node_rank=None, json_model_override_args='{}', attention_backend='flashinfer', sampling_backend='flashinfer', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, enable_mixed_chunk=False, enable_torch_compile=True, max_torch_compile_bs=32, torchao_config='', enable_p2p_check=True, triton_attention_reduce_in_fp32=False, lora_paths=None, max_loras_per_batch=8)
[16:53:18 DP0 TP0] Init nccl begin.
[16:53:18 DP0 TP0] Load weight begin. avail mem=44.09 GB
INFO 09-26 16:53:18 awq_marlin.py:89] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
[16:53:19 DP0 TP0] lm_eval is not installed, GPTQ may not be usable
INFO 09-26 16:53:20 weight_utils.py:236] Using model weights format ['*.safetensors']
INFO 09-26 16:53:20 weight_utils.py:280] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.24it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.23it/s]

[16:53:20 DP0 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=43.33 GB
Process Process-1:1:
Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_single.py", line 145, in start_controller_process
    controller = ControllerSingle(
                 ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_single.py", line 81, in __init__
    self.tp_server = ModelTpServer(
                     ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/tp_worker.py", line 100, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/model_runner.py", line 121, in __init__
    self.init_memory_pool(
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/model_runner.py", line 387, in init_memory_pool
    raise RuntimeError(
RuntimeError: Not enough memory. Please try to increase --mem-fraction-static.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/dist-packages/sglang/launch_server.py", line 16, in <module>
    raise e
  File "/usr/local/lib/python3.11/dist-packages/sglang/launch_server.py", line 14, in <module>
    launch_server(server_args)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/server.py", line 373, in launch_server
    raise RuntimeError(
RuntimeError: Initialization failed. controller_init_state: Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_multi.py", line 195, in start_controller_process
    controller = ControllerMulti(server_args, port_args)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_multi.py", line 98, in __init__
    self.start_dp_worker(i)
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_multi.py", line 125, in start_dp_worker
    raise RuntimeError(
RuntimeError: Initialization failed. controller_init_state: Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_single.py", line 145, in start_controller_process
    controller = ControllerSingle(
                 ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/controller_single.py", line 81, in __init__
    self.tp_server = ModelTpServer(
                     ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/managers/tp_worker.py", line 100, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/model_runner.py", line 121, in __init__
    self.init_memory_pool(
  File "/usr/local/lib/python3.11/dist-packages/sglang/srt/model_executor/model_runner.py", line 387, in init_memory_pool
    raise RuntimeError(
RuntimeError: Not enough memory. Please try to increase --mem-fraction-static.

, detoken_init_state: init ok

Environment

host: runpod.io
gpu: 8*A40
OS image: RunPod Pytorch 2.4.0 runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04

root@c670148f30c4:~# python -c "import sglang; print(sglang.__version__)"
0.3.2

root@c670148f30c4:~# nvidia-smi
Thu Sep 26 16:56:56 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.4     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A40                     On  | 00000000:4F:00.0 Off |                    0 |
|  0%   27C    P8              21W / 300W |      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A40                     On  | 00000000:52:00.0 Off |                    0 |
|  0%   29C    P8              21W / 300W |      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A40                     On  | 00000000:53:00.0 Off |                    0 |
|  0%   30C    P8              28W / 300W |      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A40                     On  | 00000000:56:00.0 Off |                    0 |
|  0%   30C    P8              32W / 300W |      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A40                     On  | 00000000:57:00.0 Off |                    0 |
|  0%   29C    P8              22W / 300W |      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A40                     On  | 00000000:CE:00.0 Off |                    0 |
|  0%   29C    P8              21W / 300W |      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A40                     On  | 00000000:D1:00.0 Off |                    0 |
|  0%   31C    P8              22W / 300W |      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A40                     On  | 00000000:D5:00.0 Off |                    0 |
|  0%   29C    P8              21W / 300W |      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
@zeng-zc
Copy link
Contributor

zeng-zc commented Sep 27, 2024

--mem-fraction-static 0.05 too small? It's the static gpu memory size for cache

@smallstepman
Copy link
Author

smallstepman commented Sep 28, 2024

I tried range of values, anything between 0.9 till 0.01.

keep in mind 0.5B_AWQ is about 700Mb in size, that’s around 1.5% of memory available on A40

@zeng-zc
Copy link
Contributor

zeng-zc commented Sep 29, 2024

RuntimeError: Not enough memory. Please try to increase --mem-fraction-static.
 --mem-fraction-static MEM_FRACTION_STATIC
                        The fraction of the memory used for static allocation
                        (model weights and KV cache memory pool).

The kv cache is also contained in the mem-fraction-static. I think the log gives clear hint:

RuntimeError: Not enough memory. Please try to increase --mem-fraction-static.

@smallstepman
Copy link
Author

smallstepman commented Sep 29, 2024

The purpose of me going to low-low values, like 0.01, is simply to demonstrate the two extremes in range of values:

  • one - 0.01 - is too low, raising RuntimeError: Not enough memory, while
  • the other - 0.05 or anything above - is too high, raising Exception: Capture cuda graph failed.

You could try any other value: 0.02, 0.03, 0.035, 0.04, 0.2, 0.4, 0.8 etc and you'd still end up with either of these two errors.

This means there is no valid value of --mem-fraction-static that I can choose to make it work. Therefore, the error msg is misleading cause the error relates to something other than the value of --mem-fraction-static.


I'm no expert in anything that's happening under the hood, but after taking a second look at the logs, the error is possibly related to the quantization used by the model (AWQ): _C::gptq_marlin_gemm() Expected a value of type '__torch__.torch.classes._core_C.ScalarType (of Python compilation unit at: 0)' for argument '_7' but instead found type 'FakeScriptObject'.


Btw, I had to delete significant chunk of error logs from error # 1, cause GitHub was complaining about length of the message. The deleted portion was replaced with ...

@yileld
Copy link
Contributor

yileld commented Sep 30, 2024

It seems that AWQ model cant use cuda graph, I tried several weeks ago, as I turned off cuda graph when using quant model in my code.

@smallstepman
Copy link
Author

I have no problem running python -m sglang.launch_server --host 0.0.0.0 --port 30000 --model-path Qwen/Qwen2.5-72B-Instruct-AWQ --tp 2 --dp 1 --enable-p2p-check --mem-fraction-static 0.8 (so cuda graph enabled), but once I add --enable-torch-compile it errors out

@merrymercy
Copy link
Contributor

The reason is that torch.compile is not compatible with awq or gptq.
It is unrelated to data parallelism, cuda graph, or other things.

@merrymercy merrymercy changed the title [Bug] unable to use all three combined: data-parallelism(dp), enable-torch-compile, cuda-graph (8*A40) [Bug] Unable to use gptq or awq with torch.compile (8*A40) Oct 6, 2024
@merrymercy
Copy link
Contributor

We will work with torchao team (cc @jerryzh168) to make all of them compatible with each other soon.

@merrymercy
Copy link
Contributor

move to #1991

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants