-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] support Medusa #1231
[WIP] support Medusa #1231
Conversation
56393e1
to
61145b3
Compare
Hi @lvhan028 May you help change the base branch to turbomind-2.1 for this pr? Thanks. |
Hi all. We'v tested the acc rate of LM Head temperature 0 and Medusa Heads top 1 in the internal version, whether it is custom prompts or MT-Bench, the acc rate is relatively low, only 0-20% and the vast majority is 0. In this case, we did not achieve the desired benefits. At the same time, we verified LM Head temperature 0 and Medusa Heads top k in the official version with medusa choices(64). The acc rate is between 20%-40%, which is closer to the greedy data in the paper. Considering that verifying Medusa Choices using a multi-batch approach would incur significant costs, we have decided to implement a Tree Mask version based on Flash Decoding on this basis. We will provide technical solutions aligning with @lzhangzz as soon as possible. Please stay tuned for updates. |
I will split the internal implementation of the TreeMask version into multiple PRs and then submit them. The current outdated PR has been closed for now. |
Do you have any performance data about medusa TreeMask? |
Motivation
As titled, support Medusa
Modification
finished
We've used https://github.com/zhyncs/medusa-whl-centos7/releases/tag/2024.02.27, https://huggingface.co/FasterDecoding/medusa-vicuna-13b-v1.3, https://huggingface.co/lmsys/vicuna-13b-v1.3 to verify the correctness of porting code (fp16 and bf16).
during debugging
todo
Checklist