Releases: InternLM/lmdeploy
LMDeploy Release V0.5.0
What's Changed
🚀 Features
- support MiniCPM-Llama3-V 2.5 by @irexyc in #1708
- [Feature]: Support llava for pytorch engine by @RunningLeon in #1641
- Device dispatcher by @grimoire in #1775
- Add GLM-4-9B-Chat by @lzhangzz in #1724
- Torch deepseek v2 by @grimoire in #1621
- Support internvl-chat for pytorch engine by @RunningLeon in #1797
- Add interfaces to the pipeline to obtain logits and ppl by @irexyc in #1652
- [Feature]: Support cogvlm-chat by @RunningLeon in #1502
💥 Improvements
- support mistral and llava_mistral in turbomind by @lvhan028 in #1579
- Add health endpoint by @AllentDan in #1679
- upgrade the version of the dependency package peft by @grimoire in #1687
- Follow the conventional model_name by @AllentDan in #1677
- API Image URL fetch timeout by @vody-am in #1684
- Support internlm-xcomposer2-4khd-7b awq by @AllentDan in #1666
- update dockerfile and docs by @RunningLeon in #1715
- lazy import VLAsyncEngine to avoid bringing in VLMs dependencies when deploying LLMs by @lvhan028 in #1714
- feat: align with OpenAI temperature range by @zhyncs in #1733
- feat: align with OpenAI temperature range in api server by @zhyncs in #1734
- Refactor converter about get_input_model_registered_name and get_output_model_registered_name_and_config by @lvhan028 in #1702
- Refine max_new_tokens logic to improve user experience by @AllentDan in #1705
- Refactor loading weights by @grimoire in #1603
- refactor config by @grimoire in #1751
- Add anomaly handler by @lzhangzz in #1780
- Encode raw image file to base64 by @irexyc in #1773
- skip inference for oversized inputs by @grimoire in #1769
- fix: prevent numpy breakage by @zhyncs in #1791
- More accurate time logging for ImageEncoder and fix concurrent image processing corruption by @irexyc in #1765
- Optimize kernel launch for triton2.2.0 and triton2.3.0 by @grimoire in #1499
- feat: auto set awq model_format from hf by @zhyncs in #1799
- check driver mismatch by @grimoire in #1811
- PyTorchEngine adapts to the latest internlm2 modeling. by @grimoire in #1798
- AsyncEngine create cancel task in exception. by @grimoire in #1807
- compat internlm2 for pytorch engine by @RunningLeon in #1825
- Add model revision & download_dir to cli by @irexyc in #1814
- fix image encoder request queue by @irexyc in #1837
- Harden stream callback by @lzhangzz in #1838
- Support Qwen2-1.5b awq by @AllentDan in #1793
- remove chat template config in turbomind engine by @irexyc in #1161
- misc: align PyTorch Engine temprature with TurboMind by @zhyncs in #1850
- docs: update cache-max-entry-count help message by @zhyncs in #1892
🐞 Bug fixes
- fix typos by @irexyc in #1690
- [Bugfix] fix internvl-1.5-chat vision model preprocess and freeze weights by @DefTruth in #1741
- lock setuptools version in dockerfile by @RunningLeon in #1770
- Fix openai package can not use proxy stream mode by @AllentDan in #1692
- Fix finish_reason by @AllentDan in #1768
- fix uncached stop words by @grimoire in #1754
- [side-effect]Fix param
--cache-max-entry-count
is not taking effect (#1758) by @QwertyJack in #1778 - support qwen2 1.5b by @lvhan028 in #1782
- fix falcon attention by @grimoire in #1761
- Refine AsyncEngine exception handler by @AllentDan in #1789
- [side-effect] fix weight_type caused by PR #1702 by @lvhan028 in #1795
- fix best_match_model by @irexyc in #1812
- Fix Request completed log by @irexyc in #1821
- fix qwen-vl-chat hung by @irexyc in #1824
- Detokenize with prompt token ids by @AllentDan in #1753
- Update engine.py to fix small typos by @WANGSSSSSSS in #1829
- [side-effect] bring back "--cap" argument in chat cli by @lvhan028 in #1859
- Fix vl session-len by @AllentDan in #1860
- fix gradio vl "stop_words" by @irexyc in #1873
- fix qwen2 cache_position for PyTorch Engine when transformers>4.41.2 by @zhyncs in #1886
- fix model name matching for internvl by @RunningLeon in #1867
📚 Documentations
- docs: add BentoLMDeploy in README by @zhyncs in #1736
- [Doc]: Update docs for internlm2.5 by @RunningLeon in #1887
🌐 Other
- add longtext generation benchmark by @zhulinJulia24 in #1694
- add qwen2 model into testcase by @zhulinJulia24 in #1772
- fix pr test for newest internlm2 model by @zhulinJulia24 in #1806
- react test evaluation config by @zhulinJulia24 in #1861
- bump version to v0.5.0 by @lvhan028 in #1852
New Contributors
- @DefTruth made their first contribution in #1741
- @QwertyJack made their first contribution in #1778
- @WANGSSSSSSS made their first contribution in #1829
Full Changelog: v0.4.2...v0.5.0
LMDeploy Release V0.4.2
Highlight
- Support 4-bit weight-only quantization and inference on VMLs, such as InternVL v1.5, LLaVa, InternLMXComposer2
Quantization
lmdeploy lite auto_awq OpenGVLab/InternVL-Chat-V1-5 --work-dir ./InternVL-Chat-V1-5-AWQ
Inference with quantized model
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
pipe = pipeline('./InternVL-Chat-V1-5-AWQ', backend_config=TurbomindEngineConfig(tp=1, model_format='awq'))
img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
out = pipe(('describe this image', img))
print(out)
- Balance vision model when deploying VLMs with multiple GPUs
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
pipe = pipeline('OpenGVLab/InternVL-Chat-V1-5', backend_config=TurbomindEngineConfig(tp=2))
img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
out = pipe(('describe this image', img))
print(out)
What's Changed
🚀 Features
- PyTorch Engine hash table based prefix caching by @grimoire in #1429
- support phi3 by @grimoire in #1497
- Turbomind prefix caching by @ispobock in #1450
- Enable search scale for awq by @AllentDan in #1545
- [Feature] Support vl models quantization by @AllentDan in #1553
💥 Improvements
- make Qwen compatible with Slora when TP > 1 by @jjjjohnson in #1518
- Optimize slora by @grimoire in #1447
- Use a faster format for images in VLMs by @isidentical in #1575
- add chat-template args to chat cli by @RunningLeon in #1566
- Get the max session len from config.json by @AllentDan in #1550
- Optimize w8a8 kernel by @grimoire in #1353
- support python 3.12 by @irexyc in #1605
- Optimize moe by @grimoire in #1520
- Balance vision model weights on multi gpus by @irexyc in #1591
- Support user-specified IMAGE_TOKEN position for deepseek-vl model by @irexyc in #1627
- Optimize GQA/MQA by @grimoire in #1649
🐞 Bug fixes
- fix logger init by @AllentDan in #1598
- Bugfix: wrongly assign gen_config with True by @thelongestusernameofall in #1594
- Enable split-kv for attention by @lzhangzz in #1606
- Fix xcomposer2 vision model process by @irexyc in #1640
- Fix NTK scaling by @lzhangzz in #1636
- Fix illegal memory access when seq_len < 64 by @lzhangzz in #1616
- Fix llava vl template by @irexyc in #1620
- [side-effect] fix deepseek-vl when tp is 1 by @irexyc in #1648
- fix logprobs output by @irexyc in #1561
- fix fused-moe in triton2.2.0 by @grimoire in #1654
- Align tokenizers in pipeline and api_server benchmark scripts by @AllentDan in #1650
- [side-effect] fix UnboundLocalError for internlm-xcomposer2-4khd-7b by @irexyc in #1661
- remove paged attention prefill autotune by @grimoire in #1658
- Fix transformers 4.41.0 prompt may differ after encode decode by @AllentDan in #1617
📚 Documentations
- Fix typo in w8a8.md by @chg0901 in #1568
- Update doc for prefix caching by @ispobock in #1597
- Update VL document by @AllentDan in #1657
🌐 Other
- remove first empty token check and add input validation testcase by @zhulinJulia24 in #1549
- add more model into benchmark and evaluate workflow by @zhulinJulia24 in #1565
- add vl awq testcase and refactor pipeline testcase by @zhulinJulia24 in #1630
- bump version to v0.4.2 by @lvhan028 in #1644
New Contributors
- @isidentical made their first contribution in #1575
- @chg0901 made their first contribution in #1568
- @thelongestusernameofall made their first contribution in #1594
Full Changelog: v0.4.1...v0.4.2
LMDeploy Release V0.4.1
What's Changed
🚀 Features
- Add colab demo by @AllentDan in #1428
- support starcoder2 by @grimoire in #1468
- support OpenGVLab/InternVL-Chat-V1-5 by @irexyc in #1490
💥 Improvements
- variable
CTA_H
& fix qkv bias by @lzhangzz in #1491 - refactor vision model loading by @irexyc in #1482
- fix installation requirements for windows by @irexyc in #1531
- Remove split batch inside pipline inference function by @AllentDan in #1507
- Remove first empty chunck for api_server by @AllentDan in #1527
- add benchmark script to profile pipeline APIs by @lvhan028 in #1528
- Add input validation by @AllentDan in #1525
🐞 Bug fixes
- fix local variable 'response' referenced before assignment in async_engine.generate by @irexyc in #1513
- Fix turbomind import in windows by @irexyc in #1533
- Fix convert qwen2 to turbomind by @AllentDan in #1546
- Adding api_key and model_name parameters to the restful benchmark by @NiuBlibing in #1478
📚 Documentations
- update supported models for Baichuan by @zhyncs in #1485
- Fix typo in w8a8.md by @Infinity4B in #1523
- complete build.md by @YanxingLiu in #1508
- update readme wechat qrcode by @vansin in #1529
- Update docker docs for VL api by @vody-am in #1534
- Format supported model table using html syntax by @lvhan028 in #1493
- doc: add example of deploying api server to Kubernetes by @uzuku in #1488
🌐 Other
- add modelscope and lora testcase by @zhulinJulia24 in #1506
- bump version to v0.4.1 by @lvhan028 in #1544
New Contributors
- @NiuBlibing made their first contribution in #1478
- @Infinity4B made their first contribution in #1523
- @YanxingLiu made their first contribution in #1508
- @vody-am made their first contribution in #1534
- @uzuku made their first contribution in #1488
Full Changelog: v0.4.0...v0.4.1
LMDeploy Release V0.4.0
Highlights
Support for Llama3 and additional Vision-Language Models (VLMs):
- We now support Llama3 and an extended range of Vision-Language Models (VLMs), including InternVL versions 1.1 and 1.2, MiniGemini, and InternLMXComposer2.
Introduce online int4/int8 KV quantization and inference
- data-free online quantization
- Supports all nvidia GPU models with Volta architecture (sm70) and above
- KV int8 quantization has almost lossless accuracy, and KV int4 quantization accuracy is within an acceptable range
- Efficient inference, with int8/int4 KV quantization applied to llama2-7b, RPS is improved by approximately 30% and 40% respectively compared to fp16
The following table shows the evaluation results of three LLM models with different KV numerical precision:
- | - | - | llama2-7b-chat | - | - | internlm2-chat-7b | - | - | qwen1.5-7b-chat | - | - |
---|---|---|---|---|---|---|---|---|---|---|---|
dataset | version | metric | kv fp16 | kv int8 | kv int4 | kv fp16 | kv int8 | kv int4 | fp16 | kv int8 | kv int4 |
ceval | - | naive_average | 28.42 | 27.96 | 27.58 | 60.45 | 60.88 | 60.28 | 70.56 | 70.49 | 68.62 |
mmlu | - | naive_average | 35.64 | 35.58 | 34.79 | 63.91 | 64 | 62.36 | 61.48 | 61.56 | 60.65 |
triviaqa | 2121ce | score | 56.09 | 56.13 | 53.71 | 58.73 | 58.7 | 58.18 | 44.62 | 44.77 | 44.04 |
gsm8k | 1d7fe4 | accuracy | 28.2 | 28.05 | 27.37 | 70.13 | 69.75 | 66.87 | 54.97 | 56.41 | 54.74 |
race-middle | 9a54b6 | accuracy | 41.57 | 41.78 | 41.23 | 88.93 | 88.93 | 88.93 | 87.33 | 87.26 | 86.28 |
race-high | 9a54b6 | accuracy | 39.65 | 39.77 | 40.77 | 85.33 | 85.31 | 84.62 | 82.53 | 82.59 | 82.02 |
The below table presents LMDeploy's inference performance with quantized KV.
model | kv type | test settings | RPS | v.s. kv fp16 |
---|---|---|---|---|
llama2-chat-7b | fp16 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 14.98 | 1.0 |
- | int8 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 19.01 | 1.27 |
- | int4 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 20.81 | 1.39 |
llama2-chat-13b | fp16 | tp1 / ratio 0.9 / bs 128 / prompts 10000 | 8.55 | 1.0 |
- | int8 | tp1 / ratio 0.9 / bs 256 / prompts 10000 | 10.96 | 1.28 |
- | int4 | tp1 / ratio 0.9 / bs 256 / prompts 10000 | 11.91 | 1.39 |
internlm2-chat-7b | fp16 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 24.13 | 1.0 |
- | int8 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 25.28 | 1.05 |
- | int4 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 25.80 | 1.07 |
What's Changed
🚀 Features
- Support qwen1.5 in turbomind engine by @lvhan028 in #1406
- Online 8/4-bit KV-cache quantization by @lzhangzz in #1377
- Support qwen1.5-*-AWQ model inference in turbomind by @lvhan028 in #1430
- support Internvl chat v1.1, v1.2 and v1.2-plus by @irexyc in #1425
- support Internvl chat llava by @irexyc in #1426
- Add llama3 chat template by @AllentDan in #1461
- Support mini gemini llama by @AllentDan in #1438
- add interactive api in service for VL models by @AllentDan in #1444
- support output logprobs with turbomind backend. by @irexyc in #1391
- support internlm-xcomposer2-7b & internlm-xcomposer2-4khd-7b by @irexyc in #1458
- Add qwen1.5 awq quantization by @AllentDan in #1470
💥 Improvements
- Reduce binary size, add
sm_89
andsm_90
targets by @lzhangzz in #1383 - Use new event loop instead of the current loop for pipeline by @AllentDan in #1352
- Optimize inference of pytorch engine with tensor parallelism by @grimoire in #1397
- add llava-v1.6-34b template by @irexyc in #1408
- Initialize vl encoder first to avoid OOM by @AllentDan in #1434
- Support model_name customization for api_server by @AllentDan in #1403
- Expose dynamic split&fuse parameters by @lvhan028 in #1433
- warning transformers version by @grimoire in #1453
- Optimize apply_rotary kernel and remove useless inference_mode by @grimoire in #1457
- set infinity timeout to nccl by @grimoire in #1465
- Feat: format internlm2 chat template by @liujiangning30 in #1456
🐞 Bug fixes
- handle SIGTERM by @grimoire in #1389
- fix chat cli
ArgumentError
error happened in python 3.11 by @RunningLeon in #1401 - Fix llama_triton_example by @AllentDan in #1414
- miss --trust-remote-code in converter, which is side effect brought by pr #1406 by @lvhan028 in #1420
- fix sampling kernel by @grimoire in #1417
- Fix loading single safetensor file error by @AllentDan in #1427
- remove space in deepseek template by @grimoire in #1441
- fix free repetition_penalty_workspace_ buffer by @irexyc in #1467
- fix adapter failure when tp>1 by @grimoire in #1476
- get model in advance to fix downloading from modelscope error by @irexyc in #1473
- Fix the side effect in engine_intance brought by #1391 by @lvhan028 in #1480
📚 Documentations
- Add model name corresponding to the test data in the doc by @wykvictor in #1400
- fix typo in get_started guide by @lvhan028 in #1411
- Add async openai demo for api_server by @AllentDan in #1409
- add the recommendation version for Python Backend by @zhyncs in #1436
- Update kv quantization and inference guide by @lvhan028 in #1412
- update doc for llama3 by @zhyncs in #1462
🌐 Other
- hack cmakelist.txt in pr_test workflow by @zhulinJulia24 in #1405
- Add benchmark report generated in summary by @zhulinJulia24 in #1419
- add restful completions v1 test case by @ZhoujhZoe in #1416
- Add kvint4/8 ete testcase by @zhulinJulia24 in #1448
- impove rotary embedding of qwen in torch engine by @grimoire in #1451
- change cutlass url in ut by @RunningLeon in #1464
- bump version to v0.4.0 by @lvhan028 in #1469
New Contributors
- @wykvictor made their first contribution in #1400
- @ZhoujhZoe made their first contribution in #1416
- @liujiangning30 made their first contribution in #1456
Full Changelog: v0.3.0...v0.4.0
LMDeploy Release V0.3.0
Highlight
- Refactor attention and optimize GQA(#1258 #1307 #1116), achieving 22+ and 16+ RPS for internlm2-7b and internlm2-20b, about 1.8x faster than vLLM
- Support new models, including Qwen1.5-MOE(#1372), DBRX(#1367), DeepSeek-VL(#1335)
What's Changed
🚀 Features
- Add tensor core GQA dispatch for
[4,5,6,8]
by @lzhangzz in #1258 - upgrade turbomind to v2.1 by by @lzhangzz in #1307, #1116
- Support slora to pipeline by @AllentDan in #1286
- Support qwen for pytorch engine by @RunningLeon in #1265
- Support Triton inference server python backend by @ispobock in #1329
- torch engine support dbrx by @grimoire in #1367
- Support qwen2 moe for pytorch engine by @RunningLeon in #1372
- Add deepseek vl by @AllentDan in #1335
💥 Improvements
- rm unused var by @zhyncs in #1256
- Expose cache_block_seq_len to API by @ispobock in #1218
- add chat template for deepseek coder model by @lvhan028 in #1310
- Add more log info for api_server by @AllentDan in #1323
- remove cuda cache after loading vison model by @irexyc in #1325
- Add new chat cli with auto backend feature by @RunningLeon in #1276
- Update rewritings for qwen by @RunningLeon in #1351
- lazy import accelerate.init_empty_weights for vl async engine by @irexyc in #1359
- update lmdeploy pypi packages deps to cuda12 by @irexyc in #1368
- update
max_prefill_token_num
for low gpu memory by @grimoire in #1373 - Optimize pipeline of pytorch engine by @grimoire in #1328
🐞 Bug fixes
- fix different stop/bad words length in batch by @irexyc in #1246
- Fix performance issue of chatbot by @ispobock in #1295
- add missed argument by @irexyc in #1317
- Fix dlpack memory leak by @ispobock in #1344
- Fix invalid context for Internstudio platform by @lzhangzz in #1354
- fix benchmark generation by @grimoire in #1349
- fix window attention by @grimoire in #1341
- fix batchApplyRepetitionPenalty by @irexyc in #1358
- Fix memory leak of DLManagedTensor by @ispobock in #1361
- fix vlm inference hung with tp by @irexyc in #1336
- [Fix] fix the unit test of model name deduce by @AllentDan in #1382
📚 Documentations
- add citation in readme by @RunningLeon in #1308
- Add slora example for pipeline by @AllentDan in #1343
🌐 Other
- Add restful interface regrssion daily test workflow. by @zhulinJulia24 in #1302
- Add offline mode for testcase workflow by @zhulinJulia24 in #1318
- workflow bugfix and add llava-v1.5-13b testcase by @zhulinJulia24 in #1339
- Add benchmark test workflow by @zhulinJulia24 in #1364
- bump version to v0.3.0 by @lvhan028 in #1387
Full Changelog: v0.2.6...v0.3.0
LMDeploy Release V0.2.6
Highlight
Support vision-languange models (VLM) inference pipeline and serving.
Currently, it supports the following models, Qwen-VL-Chat, LLaVA series v1.5, v1.6 and Yi-VL
- VLM Inference Pipeline
from lmdeploy import pipeline
from lmdeploy.vl import load_image
pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b')
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)
Please refer to the detailed guide from here
- VLM serving by openai compatible server
lmdeploy server api_server liuhaotian/llava-v1.6-vicuna-7b --server-port 8000
- VLM Serving by gradio
lmdeploy serve gradio liuhaotian/llava-v1.6-vicuna-7b --server-port 6006
What's Changed
🚀 Features
- Add inference pipeline for VL models by @irexyc in #1214
- Support serving VLMs by @AllentDan in #1285
- Serve VLM by gradio by @irexyc in #1293
- Add pipeline.chat api for easy use by @irexyc in #1292
💥 Improvements
- Hide qos functions from swagger UI if not applied by @AllentDan in #1238
- Color log formatter by @grimoire in #1247
- optimize filling kv cache kernel in pytorch engine by @grimoire in #1251
- Refactor chat template and support accurate name matching. by @AllentDan in #1216
- Support passing json file to chat template by @AllentDan in #1200
- upgrade peft and check adapters by @grimoire in #1284
- better cache allocation in pytorch engine by @grimoire in #1272
- Fall back to base template if there is no chat_template in tokenizer_config.json by @AllentDan in #1294
🐞 Bug fixes
- lazy load convert_pv jit function by @grimoire in #1253
- [BUG] fix the case when num_used_blocks < 0 by @jjjjohnson in #1277
- Check bf16 model in torch engine by @grimoire in #1270
- fix bf16 check by @grimoire in #1281
- [Fix] fix triton server chatbot init error by @AllentDan in #1278
- Fix concatenate issue in profile serving by @ispobock in #1282
- fix torch tp lora adapter by @grimoire in #1300
- Fix crash when api_server loads a turbomind model by @irexyc in #1304
📚 Documentations
- fix config for readthedocs by @RunningLeon in #1245
- update badges in README by @lvhan028 in #1243
- Update serving guide including api_server and gradio by @lvhan028 in #1248
- rename restful_api.md to api_server.md by @lvhan028 in #1287
- Update readthedocs index by @lvhan028 in #1288
🌐 Other
- Parallelize testcase and refactor test workflow by @zhulinJulia24 in #1254
- Accelerate sample request in benchmark script by @ispobock in #1264
- Update eval ci cfg by @RunningLeon in #1259
- Test case bugfix and add restful interface testcases. by @zhulinJulia24 in #1271
- bump version to v0.2.6 by @lvhan028 in #1299
New Contributors
- @jjjjohnson made their first contribution in #1277
Full Changelog: v0.2.5...v0.2.6
LMDeploy Release V0.2.5
What's Changed
🚀 Features
- Support mistral and sliding window attention by @grimoire in #1075
- torch engine support chatglm3 by @grimoire in #1159
- Support qwen1.5 in pytorch engine by @grimoire in #1160
- Support mixtral for pytorch engine by @RunningLeon in #1133
- Support torch deepseek moe by @grimoire in #1163
- Support gemma model in pytorch engine by @grimoire in #1184
- Auto backend for pipeline and serve when backend is not set to pytorch explicitly by @RunningLeon in #1211
💥 Improvements
- Fix argument error by @ispobock in #1193
- Use LifoQueue for turbomind async_stream_infer by @AllentDan in #1179
- Update interactive output len strategy and response by @AllentDan in #1164
- Support
min_new_tokens
generation config in pytorch engine by @grimoire in #1096 - Batched sampling by @grimoire in #1197
- refactor the logic of getting
model_name
by @AllentDan in #1188 - Add parameter
max_prefill_token_num
by @lvhan028 in #1203 - optmize baichuan in pytorch engine by @grimoire in #1223
- check model required transformers version by @grimoire in #1220
- torch optmize chatglm3 by @grimoire in #1215
- Async torch engine by @grimoire in #1206
- remove unused kernel in pytorch engine by @grimoire in #1237
🐞 Bug fixes
- Fix session length for profile generation by @ispobock in #1181
- fix torch engine infer by @RunningLeon in #1185
- fix module map by @grimoire in #1205
- [Fix] Correct session length warning by @AllentDan in #1207
- Fix all devices occupation when applying tp to torch engine by updating device map by @grimoire in #1172
- Fix falcon chatglm2 template by @grimoire in #1168
- [Fix] Avoid AsyncEngine running the same session id by @AllentDan in #1219
- Fix
None
session_len by @lvhan028 in #1230 - fix multinomial sampling by @grimoire in #1228
- fix returning logits in prefill phase of pytorch engine by @grimoire in #1209
- optimize pytorch engine inference with falcon model by @grimoire in #1234
- fix bf16 multinomial sampling by @grimoire in #1239
- reduce torchengine prefill mem usage by @grimoire in #1240
📚 Documentations
- auto generate pipeline api for readthedocs by @RunningLeon in #1186
- Added tutorial document for deploying lmdeploy on Jetson series boards. by @BestAnHongjun in #1192
- update doc index by @zhyncs in #1241
🌐 Other
- Add PR test workflow and check-in more testcases by @zhulinJulia24 in #1208
- fix pytest version by @zhulinJulia24 in #1236
- bump version to v0.2.5 by @lvhan028 in #1235
New Contributors
- @ispobock made their first contribution in #1181
- @BestAnHongjun made their first contribution in #1192
Full Changelog: v0.2.4...v0.2.5
LMDeploy Release V0.2.4
What's Changed
💥 Improvements
- use stricter rules to get weight file by @irexyc in #1070
- check pytorch engine environment by @grimoire in #1107
- Update Dockerfile order to launch the http service by
docker run
directly by @AllentDan in #1162 - Support torch cache_max_entry_count by @grimoire in #1166
- Remove the manual model conversion during benchmark by @lvhan028 in #953
- update llama triton example by @zhyncs in #1153
🐞 Bug fixes
- fix embedding copy size by @irexyc in #1036
- fix pytorch engine with peft==0.8.2 by @grimoire in #1122
- support triton2.2 by @grimoire in #1137
- Add
top_k
in ChatCompletionRequest by @lvhan028 in #1174 - minor fix benchmark generation guide and script by @lvhan028 in #1175
📚 Documentations
🌐 Other
- Add eval ci by @RunningLeon in #1060
- Ete testcase add more models by @zhulinJulia24 in #1077
- Fix win ci by @irexyc in #1132
- bump version to v0.2.4 by @lvhan028 in #1171
Full Changelog: v0.2.3...v0.2.4
LMDeploy Release V0.2.3
What's Changed
🚀 Features
💥 Improvements
- Remove caching tokenizer.json by @grimoire in #1074
- Refactor
get_logger
to remove the dependency of MMLogger from mmengine by @yinfan98 in #1064 - Use TM_LOG_LEVEL environment variable first by @zhyncs in #1071
- Speed up the initialization of w8a8 model for torch engine by @yinfan98 in #1088
- Make logging.logger's behavior consistent with MMLogger by @irexyc in #1092
- Remove owned_session for torch engine by @grimoire in #1097
- Unify engine initialization in pipeline by @irexyc in #1085
- Add skip_special_tokens in GenerationConfig by @grimoire in #1091
- Use default stop words for turbomind backend in pipeline by @irexyc in #1119
- Add input_token_len to Response and update Response document by @AllentDan in #1115
🐞 Bug fixes
- Fix fast tokenizer swallows prefix space when there are too many white spaces by @AllentDan in #992
- Fix turbomind CUDA runtime error invalid argument by @zhyncs in #1100
- Add safety check for incremental decode by @AllentDan in #1094
- Fix device type of get_ppl for turbomind by @RunningLeon in #1093
- Fix pipeline init turbomind from workspace by @irexyc in #1126
- Add dependency version check and fix
ignore_eos
logic by @grimoire in #1099 - Change configuration_internlm.py to configuration_internlm2.py by @HIT-cwh in #1129
📚 Documentations
🌐 Other
New Contributors
Full Changelog: v0.2.2...v0.2.3
LMDeploy Release V0.2.2
Highlight
English version
- The allocation strategy for k/v cache is changed. The parameter
cache_max_entry_count
defaults to 0.8. It means the proportion of GPU FREE memory rather than TOTAL memory. The default value is updated to 0.8. It can help prevent OOM issues. - The pipeline API supports streaming inference. You may give it a try!
from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
print(item)
- Add api key and ssl to
api_server
Chinese version
- TurboMind engine 修改了GPU memory分配策略。k/v cache 内存比例参数 cache_max_entry_count 缺省值变更为 0.8。它表示 GPU空闲内存的比例,不再是 GPU 总内存的比例。
- Pipeline 支持流式输出接口。可以尝试下如下代码:
from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
print(item)
- api_server 在接口中增加了 api_key
What's Changed
🚀 Features
- add alignment tools by @grimoire in #1004
- support min_length for turbomind backend by @irexyc in #961
- Add stream mode function to pipeline by @AllentDan in #974
- [Feature] Add api key and ssl to http server by @AllentDan in #1048
💥 Improvements
- hide stop-words in output text by @grimoire in #991
- optimize sleep by @grimoire in #1034
- set example values to /v1/chat/completions in swagger UI by @AllentDan in #984
- Update adapters cli argument by @RunningLeon in #1039
- Fix turbomind end session bug. Add huggingface demo document by @AllentDan in #1017
- Support linking the custom built mpi by @lvhan028 in #1025
- sync mem size for tp by @lzhangzz in #1053
- Remove model name when loading hf model by @irexyc in #1022
- support internlm2-1_8b by @lvhan028 in #1073
- Update chat template for internlm2 base model by @lvhan028 in #1079
🐞 Bug fixes
- fix TorchEngine stuck when benchmarking with
tp>1
by @grimoire in #942 - fix module mapping error of baichuan model by @grimoire in #977
- fix import error for triton server by @RunningLeon in #985
- fix qwen-vl example by @irexyc in #996
- fix missing init file in modules by @RunningLeon in #1013
- fix tp mem usage by @grimoire in #987
- update indexes_containing_token function by @AllentDan in #1050
- fix flash kernel on sm 70 by @grimoire in #1027
- Fix baichuan2 lora by @grimoire in #1042
- Fix modelconfig in pytorch engine, support YI. by @grimoire in #1052
- Fix repetition penalty for long context by @irexyc in #1037
- [Fix] Support QLinear in rowwise_parallelize_linear_fn and colwise_parallelize_linear_fn by @HIT-cwh in #1072
📚 Documentations
- add docs for evaluation with opencompass by @RunningLeon in #995
- update docs for kvint8 by @RunningLeon in #1026
- [doc] Introduce project OpenAOE by @JiaYingLii in #1049
- update pipeline guide and FAQ about OOM by @lvhan028 in #1051
- docs update cache_max_entry_count for turbomind config by @zhyncs in #1067
🌐 Other
- update ut ci to new server node by @RunningLeon in #1024
- Ete testcase update by @zhulinJulia24 in #1023
- fix OOM in BlockManager by @zhyncs in #973
- fix use engine_config.tp when tp is None by @zhyncs in #1057
- Fix serve api by moving logger inside process for turbomind by @AllentDan in #1061
- bump version to v0.2.2 by @lvhan028 in #1076
New Contributors
- @zhyncs made their first contribution in #973
- @JiaYingLii made their first contribution in #1049
Full Changelog: v0.2.1...v0.2.2