Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] support Medusa #1231

Closed
wants to merge 2 commits into from
Closed

[WIP] support Medusa #1231

wants to merge 2 commits into from

Conversation

zhyncs
Copy link
Collaborator

@zhyncs zhyncs commented Mar 3, 2024

Motivation

As titled, support Medusa

Modification

finished

  • 1、Medusa weights conversion
  • 2、Medusa weights loading
  • 3、Porting Medusa Heads code with LMDeploy components and utilities
  • 4、TP support: Distribute the weights equally based on hidden_size

We've used https://github.com/zhyncs/medusa-whl-centos7/releases/tag/2024.02.27, https://huggingface.co/FasterDecoding/medusa-vicuna-13b-v1.3, https://huggingface.co/lmsys/vicuna-13b-v1.3 to verify the correctness of porting code (fp16 and bf16).

during debugging

  • 1、Porting generate_candidates and evaluate_posterior
  • 2、Integrating with LlamaBatch

todo

  • 1、add docs
  • 2、add tests
  • 3、benchmark

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

@zhyncs
Copy link
Collaborator Author

zhyncs commented Mar 13, 2024

Hi @lvhan028 May you help change the base branch to turbomind-2.1 for this pr? Thanks.

@zhyncs zhyncs changed the base branch from main to turbomind-2.1 March 13, 2024 09:56
@zhyncs zhyncs changed the base branch from turbomind-2.1 to main March 19, 2024 07:28
@zhyncs
Copy link
Collaborator Author

zhyncs commented Mar 27, 2024

Hi all. We'v tested the acc rate of LM Head temperature 0 and Medusa Heads top 1 in the internal version, whether it is custom prompts or MT-Bench, the acc rate is relatively low, only 0-20% and the vast majority is 0. In this case, we did not achieve the desired benefits. At the same time, we verified LM Head temperature 0 and Medusa Heads top k in the official version with medusa choices(64). The acc rate is between 20%-40%, which is closer to the greedy data in the paper. Considering that verifying Medusa Choices using a multi-batch approach would incur significant costs, we have decided to implement a Tree Mask version based on Flash Decoding on this basis. We will provide technical solutions aligning with @lzhangzz as soon as possible. Please stay tuned for updates.

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jul 7, 2024

I will split the internal implementation of the TreeMask version into multiple PRs and then submit them. The current outdated PR has been closed for now.

@bohr
Copy link

bohr commented Jul 25, 2024

I will split the internal implementation of the TreeMask version into multiple PRs and then submit them. The current outdated PR has been closed for now.

Do you have any performance data about medusa TreeMask?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants