LMDeploy Release V0.3.0
Highlight
- Refactor attention and optimize GQA(#1258 #1307 #1116), achieving 22+ and 16+ RPS for internlm2-7b and internlm2-20b, about 1.8x faster than vLLM
- Support new models, including Qwen1.5-MOE(#1372), DBRX(#1367), DeepSeek-VL(#1335)
What's Changed
🚀 Features
- Add tensor core GQA dispatch for
[4,5,6,8]
by @lzhangzz in #1258 - upgrade turbomind to v2.1 by by @lzhangzz in #1307, #1116
- Support slora to pipeline by @AllentDan in #1286
- Support qwen for pytorch engine by @RunningLeon in #1265
- Support Triton inference server python backend by @ispobock in #1329
- torch engine support dbrx by @grimoire in #1367
- Support qwen2 moe for pytorch engine by @RunningLeon in #1372
- Add deepseek vl by @AllentDan in #1335
💥 Improvements
- rm unused var by @zhyncs in #1256
- Expose cache_block_seq_len to API by @ispobock in #1218
- add chat template for deepseek coder model by @lvhan028 in #1310
- Add more log info for api_server by @AllentDan in #1323
- remove cuda cache after loading vison model by @irexyc in #1325
- Add new chat cli with auto backend feature by @RunningLeon in #1276
- Update rewritings for qwen by @RunningLeon in #1351
- lazy import accelerate.init_empty_weights for vl async engine by @irexyc in #1359
- update lmdeploy pypi packages deps to cuda12 by @irexyc in #1368
- update
max_prefill_token_num
for low gpu memory by @grimoire in #1373 - Optimize pipeline of pytorch engine by @grimoire in #1328
🐞 Bug fixes
- fix different stop/bad words length in batch by @irexyc in #1246
- Fix performance issue of chatbot by @ispobock in #1295
- add missed argument by @irexyc in #1317
- Fix dlpack memory leak by @ispobock in #1344
- Fix invalid context for Internstudio platform by @lzhangzz in #1354
- fix benchmark generation by @grimoire in #1349
- fix window attention by @grimoire in #1341
- fix batchApplyRepetitionPenalty by @irexyc in #1358
- Fix memory leak of DLManagedTensor by @ispobock in #1361
- fix vlm inference hung with tp by @irexyc in #1336
- [Fix] fix the unit test of model name deduce by @AllentDan in #1382
📚 Documentations
- add citation in readme by @RunningLeon in #1308
- Add slora example for pipeline by @AllentDan in #1343
🌐 Other
- Add restful interface regrssion daily test workflow. by @zhulinJulia24 in #1302
- Add offline mode for testcase workflow by @zhulinJulia24 in #1318
- workflow bugfix and add llava-v1.5-13b testcase by @zhulinJulia24 in #1339
- Add benchmark test workflow by @zhulinJulia24 in #1364
- bump version to v0.3.0 by @lvhan028 in #1387
Full Changelog: v0.2.6...v0.3.0