-
Notifications
You must be signed in to change notification settings - Fork 489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor turbomind attention #1116
Conversation
Hi @lzhangzz Maybe you could merge the latest main branch to fix the win build error. |
|
@irexyc May help verify the VL models |
Hi @lzhangzz I used the model lmsys/vicuna-13b-v1.3 to compare performance with the latest main branch, and found that the improvement in throughput is not significant. The following is the reproduction method, maybe pay attention. Thanks. # convert
python3 -m lmdeploy convert llama /workdir/vicuna-13b-v1.3
# server
python3 -m lmdeploy serve api_server /workdir/workspace
# client
python3 benchmark/profile_restful_api.py --server_addr 127.0.0.1:23333 --tokenizer_path /workdir/vicuna-13b-v1.3 --dataset /workdir/ShareGPT_V3_unfiltered_cleaned_split.json --concurrency 128 --num_prompts 5000
# ShareGPT_V3_unfiltered_cleaned_split.json
https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
7.795/7.571-1=3% |
In addition to improving readability, what are the main benefits of refactoring attention. In which usage scenarios will performance improvements be more significant? |
This PR doesn't target at improving the performance. Instead, it focuses on cleaning the messy attention codes to firm the foundation of supporting more models. |
When I discussed with @lzhangzz earlier, I learned that there would be performance improvements, but I did not inquire specifically about the scenarios and models. Therefore, I believe there is a performance improvement. If the current test results meet your expectations, then there should be no doubt. |
For 7B models with ShareGPT_V3, attention takes roughly 1/3 of total GPU time (13B model should be similar). I guess 3% RPS improvement is significant enough. Current dispatching strategy is to maximize bandwidth utilization when there is enough data. There are faster configs for the data distribution of ShareGPT_V3 (which have some other limitations). |
make sense |
Hi @lzhangzz, please merge the latest changes from the main branch to address the documentation workflow issue. |
TODO
sm_75
/sm_70