-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Speculative decoding breaks guided decoding. #9423
Comments
same issue here |
hi I am facing a similar issue. Is there a place this is being tracked? If anyone can guide me, I can try to fix this? Thanks |
same issue here |
Used this a couple of days ago with the latest docker:
vLLM crashed and stopped completely upon sending a curl with |
@cadedaniel I'd like to put a $100 bounty on this ticket if that is alright? I'll wire the money to an IBAN+BIC, provided by the author that fixed the bug, once speculative and guided works together, and I'm no longer able to reproduce the crash above on the latest vLLM docker image. |
I would likewise add $100 to that bounty :) |
I’m not sure if the differences in results are due to floating-point operations, but when I used the I’m looking into the MQA scorer, but I haven’t found any incorrect operations so far. |
@llsj14 I can confirm that with vLLM 0.6.6 and
conversely, without
To understand this option correctly, is the |
@roberthoenig @LiuXiaoxuanPKU |
@llsj14 just a quick remark: floating-point differences alone can't explain the observed output, because guided decoding should give a hard guarantee on generating valid JSON output, yet |
I found that before guided decoding, the target model generated a different token while processing n+1 tokens (n from n-gram drafts and 1 from the target model) during the scoring process. In the verification operation, different batch sizes or matrices might be utilized, which could lead to different results. I initially suspected a bug, but I haven't found a specific reason so far. Let's investigate further.
vllm/vllm/spec_decode/batch_expansion.py Lines 38 to 44 in 2079e43
vllm/vllm/spec_decode/mqa_scorer.py Lines 12 to 16 in 2079e43
|
They produce different logits even though both batch expansion and the MQA scorer generate the same hidden states, because they use different forms of sampling metadata. vllm/vllm/worker/model_runner.py Lines 1762 to 1763 in 2079e43
This seems to be a bug not related to n-gram speculations, but rather to the computation of logits with sampling metadata configured by the MQA scorer. |
Your current environment
The output of `python collect_env.py`
Model Input Dumps
No response
🐛 Describe the bug
I run the vLLM server with speculative decoding as follows:
I then prompt the LLM with guided json created from the following pydantic model:
I incorporate the guided json into the following testing prompt:
The json guidance should ensure that the output is a valid json object. However, vLLM returns the following incomplete json object:
When I disable ngram-speculative decoding, the same prompt works, and returns a complete json object:
This means that somehow, ngram-speculative decoding breaks json guidance.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: