Releases: InternLM/lmdeploy
Releases · InternLM/lmdeploy
LMDeploy Release V0.6.3
What's Changed
🚀 Features
- support yarn in turbomind backend by @irexyc in #2519
- add linear op on dlinfer platform by @yao-fengchen in #2627
- support turbomind head_dim 64 by @irexyc in #2715
- [Feature]: support LlavaForConditionalGeneration with turbomind inference by @deepindeed2022 in #2710
- Support Mono-InternVL with PyTorch backend by @wzk1015 in #2727
- Support Qwen2-MoE models by @lzhangzz in #2723
- Support mixtral moe AWQ quantization. by @AllentDan in #2725
- Support chemvlm by @RunningLeon in #2738
- Support molmo in turbomind by @lvhan028 in #2716
💥 Improvements
- Call cuda empty_cache to prevent OOM when quantizing model by @AllentDan in #2671
- feat: support dynamic/llama3 rotary embedding in ascend graph mode by @tangzhiyi11 in #2670
- Add ensure_ascii = False for json.dumps by @AllentDan in #2707
- Flatten cache and add flashattention by @grimoire in #2676
- Support ep, column major moe kernel. by @grimoire in #2690
- Remove one of the duplicate bos tokens by @AllentDan in #2708
- Check server input by @irexyc in #2719
- optimize dlinfer moe by @tangzhiyi11 in #2741
🐞 Bug fixes
- Support min_tokens, min_p parameters for api_server by @AllentDan in #2681
- fix index error when computing ppl on long-text prompt by @lvhan028 in #2697
- Better tp exit log. by @grimoire in #2677
- miss to read moe_ffn weights from converted tm model by @lvhan028 in #2698
- Fix turbomind TP by @lzhangzz in #2706
- fix decoding kernel for deepseekv2 by @grimoire in #2688
- fix tp exit code for pytorch engine by @RunningLeon in #2718
- fix assert pad >= 0 failed when inter_size is not a multiple of group… by @Vinkle-hzt in #2740
- fix issue that mono-internvl failed to fallback pytorch engine by @lvhan028 in #2744
- Remove use_fast=True when loading tokenizer for lite auto_awq by @AllentDan in #2758
- set wrong head_dim for mistral-nemo by @lvhan028 in #2761
📚 Documentations
- Update ascend readme by @jinminxi104 in #2756
- fix ascend get_started.md link by @CyCle1024 in #2696
- Fix llama3.2 VL vision in "Supported Modals" documents by @blankanswer in #2703
🌐 Other
- [ci] support v100 dailytest by @zhulinJulia24 in #2665
- [ci] add more testcase into evaluation and daily test by @zhulinJulia24 in #2721
- feat: support multi cards in ascend graph mode by @tangzhiyi11 in #2755
- bump version to v0.6.3 by @lvhan028 in #2754
New Contributors
- @blankanswer made their first contribution in #2703
- @tangzhiyi11 made their first contribution in #2670
- @wzk1015 made their first contribution in #2727
- @Vinkle-hzt made their first contribution in #2740
Full Changelog: v0.6.2...v0.6.3
LMDeploy Release v0.6.2.post1
What's Changed
Bugs
- Fix llama3.2 VL vision in "Supported Modals" documents @blankanswer in #2703
- miss to read moe_ffn weights from converted tm model @lvhan028 in #2698
- better tp exit log @grimoire in #2677
- fix index error when computing ppl on long-text prompt @lvhan028 in #2697
- Support min_tokens, min_p parameters for api_server @AllentDan in 2681
- fix ascend get_started.md link @CyCle1024 in #2696
- Call cuda empty_cache to prevent OOM when quantizing model @AllentDan in #2671
- Fix turbomind TP for v0.6.2 by @lzhangzz in #2713
🌐 Other
- [ci] support v100 dailytest (https://github.com/InternLM/lmdeploy/pull/2665[)](https://github.com/InternLM/lmdeploy/commit/434195ea0c80b38dc2cf80c79d53a30f22b53aab)
- bump version to 0.6.2.post1 by @lvhan028 in #2717
Full Changelog: v0.6.2...v0.6.2.post1
LMDeploy Release v0.6.2
Highlights
- PyTorch engine supports graph mode on ascend platform, doubling the inference speed
- Support llama3.2-vision models in PyTorch engine
- Support Mixtral in TurboMind engine, achieving 20+ RPS using SharedGPT dataset with 2 A100-80G GPUs
What's Changed
🚀 Features
- support downloading models from openmind_hub by @cookieyyds in #2563
- Support pytorch engine kv int4/int8 quantization by @AllentDan in #2438
- feat(ascend): support w4a16 by @yao-fengchen in #2587
- [maca] add maca backend support. by @Reinerzhou in #2636
- Support mllama for pytorch engine by @AllentDan in #2605
- add --eager-mode to cli by @RunningLeon in #2645
- [ascend] add ascend graph mode by @CyCle1024 in #2647
- MoE support for turbomind by @lzhangzz in #2621
💥 Improvements
- [Feature] Add argument to disable FastAPI docs by @mouweng in #2540
- add check for device with cap 7.x by @grimoire in #2535
- Add tool role for langchain usage by @AllentDan in #2558
- Fix llama3.2-1b inference error by handling tie_word_embedding by @grimoire in #2568
- Add a workaround for saving internvl2 with latest transformers by @AllentDan in #2583
- optimize paged attention on triton3 by @grimoire in #2553
- refactor for multi backends in dlinfer by @CyCle1024 in #2619
- Copy sglang/bench_serving.py to lmdeploy as serving benchmark script by @lvhan028 in #2620
- Add barrier to prevent TP nccl kernel waiting. by @grimoire in #2607
- [ascend] refactor fused_moe on ascend platform by @yao-fengchen in #2613
- [ascend] support paged_prefill_attn when batch > 1 by @yao-fengchen in #2612
- Raise an error for the wrong chat template by @AllentDan in #2618
- refine pre-post-process by @jinminxi104 in #2632
- small block_m for sm7.x by @grimoire in #2626
- update check for triton by @grimoire in #2641
- Support llama3.2 LLM models in turbomind engine by @lvhan028 in #2596
- Check whether device support bfloat16 by @lvhan028 in #2653
- Add warning message about
do_sample
to alert BC by @lvhan028 in #2654 - update ascend dockerfile by @CyCle1024 in #2661
- fix supported model list in ascend graph mode by @jinminxi104 in #2669
- remove dlinfer version by @CyCle1024 in #2672
🐞 Bug fixes
- set outlines<0.1.0 by @AllentDan in #2559
- fix: make exit_flag verification for ascend more general by @CyCle1024 in #2588
- set capture mode thread_local by @grimoire in #2560
- Add distributed context in pytorch engine to support torchrun by @grimoire in #2615
- Fix error in python3.8. by @Reinerzhou in #2646
- Align UT with triton fill_kv_cache_quant kernel by @AllentDan in #2644
- miss device_type when checking is_bf16_supported on ascend platform by @lvhan028 in #2663
- fix syntax in Dockerfile_aarch64_ascend by @CyCle1024 in #2664
- Set history_cross_kv_seqlens to 0 by default by @AllentDan in #2666
- fix build error in ascend dockerfile by @CyCle1024 in #2667
- bugfix: llava-hf/llava-interleave-qwen-7b-hf (#2497) by @deepindeed2022 in #2657
- fix inference mode error for qwen2-vl by @irexyc in #2668
📚 Documentations
- Add instruction for downloading models from openmind hub by @cookieyyds in #2577
- Fix spacing in ascend user guide by @Superskyyy in #2601
- Update get_started tutorial about deploying on ascend platform by @jinminxi104 in #2655
- Update ascend get_started tutorial about installing nnal by @jinminxi104 in #2662
🌐 Other
- [ci] add oc infer test in stable test by @zhulinJulia24 in #2523
- update copyright by @lvhan028 in #2579
- [Doc]: Lock sphinx version by @RunningLeon in #2594
- [ci] use local requirements for test workflow by @zhulinJulia24 in #2569
- [ci] add pytorch kvint testcase into function regresstion by @zhulinJulia24 in #2584
- [ci] React dailytest workflow by @zhulinJulia24 in #2617
- [ci] fix restful script by @zhulinJulia24 in #2635
- [ci] add internlm2_5_7b_batch_1 into evaluation testcase by @zhulinJulia24 in #2631
- match torch and torch_vision version by @grimoire in #2649
- Bump version to v0.6.2 by @lvhan028 in #2659
New Contributors
- @mouweng made their first contribution in #2540
- @cookieyyds made their first contribution in #2563
- @Superskyyy made their first contribution in #2601
- @Reinerzhou made their first contribution in #2636
- @deepindeed2022 made their first contribution in #2657
Full Changelog: v0.6.1...v0.6.2
LMDeploy Release V0.6.1
What's Changed
🚀 Features
- Support user-sepcified data type by @lvhan028 in #2473
- Support minicpm3-4b by @AllentDan in #2465
- support Qwen2-VL with pytorch backend by @irexyc in #2449
💥 Improvements
- Add silu mul kernel by @grimoire in #2469
- adjust schedule to improve TTFT in pytorch engine by @grimoire in #2477
- Add max_log_len option to control length of printed log by @lvhan028 in #2478
- set served model name being repo_id from hub before it is downloaded by @lvhan028 in #2494
- Improve proxy server usage by @AllentDan in #2488
- CudaGraph mixin by @grimoire in #2485
- pytorch engine add get_logits by @grimoire in #2487
- Refactor lora by @grimoire in #2466
- support noaligned silu_and_mul by @grimoire in #2506
- optimize performance of ascend backend's update_step_context() by calculating kv_start_indices in a new way by @jiajie-yang in #2521
- Fix chatglm tokenizer failed when transformers>=4.45.0 by @AllentDan in #2520
🐞 Bug fixes
- Fix "TypeError: Got unsupported ScalarType BFloat16" by @SeitaroShinagawa in #2472
- fix ascend atten_mask by @yao-fengchen in #2483
- Catch exceptions thrown by turbomind inference thread by @lvhan028 in #2502
- The
get_ppl
missed the last token of each iteration during multi-iter prefill by @lvhan028 in #2499 - fix vl gradio by @irexyc in #2527
🌐 Other
- [ci] regular update by @zhulinJulia24 in #2431
- [CI] add base model evaluation by @zhulinJulia24 in #2490
- bump version to v0.6.1 by @lvhan028 in #2513
New Contributors
- @SeitaroShinagawa made their first contribution in #2472
Full Changelog: v0.6.0...v0.6.1
LMDeploy Release v0.6.0
Highlight
- Optimize W4A16 quantized model inference by implementing GEMM in TurboMind Engine
- Add GPTQ-INT4 inference
- Support CUDA architecture from SM70 and above, equivalent to the V100 and above.
- Refactor PytorchEngine
- Employ CUDA graph to boost the inference performance (30%)
- Support more models in Huawei Ascend platform
- Upgrade
GenerationConfig
- Support
min_p
sampling - Add
do_sample=False
as the default option - Remove
EngineGenerationConfig
and merge it toGenertionConfig
- Support
- Support guided decoding
- Distinguish between the concepts of the name of the deployed model and the name of the model's chat tempate
Before:
lmdeploy serve api_server /the/path/of/your/awesome/model \
--model-name customized_chat_template.json
After
lmdeploy serve api_server /the/path/of/your/awesome/model \
--model-name "the served model name"
--chat-template customized_chat_template.json
Break Changes
- TurboMind model converter. Please re-convert the models if you uses this feature
EngineGenerationConfig
is removed. Please useGenerationConfig
instead- Chat template. Please use
--chat-template
to specify it
What's Changed
🚀 Features
- support vlm custom image process parameters in openai input format by @irexyc in #2245
- New GEMM kernels for weight-only quantization by @lzhangzz in #2090
- Fix hidden size and support mistral nemo by @AllentDan in #2215
- Support custom logits processors by @AllentDan in #2329
- support openbmb/MiniCPM-V-2_6 by @irexyc in #2351
- Support phi3.5 for pytorch engine by @RunningLeon in #2361
- Add auto_gptq to lmdeploy lite by @AllentDan in #2372
- build(ascend): add Dockerfile for ascend aarch64 910B by @CyCle1024 in #2278
- Support guided decoding for pytorch backend by @AllentDan in #1856
- support min_p sampling parameter by @irexyc in #2420
- Refactor pytorch engine by @grimoire in #2104
- refactor pytorch engine(ascend) by @yao-fengchen in #2440
💥 Improvements
- Remove deprecated arguments from API and clarify model_name and chat_template_name by @lvhan028 in #1931
- Fix duplicated session_id when pipeline is used by multithreads by @irexyc in #2134
- remove eviction param by @grimoire in #2285
- Remove QoS serving by @AllentDan in #2294
- Support send tool_calls back to internlm2 by @AllentDan in #2147
- Add stream options to control usage by @AllentDan in #2313
- add device type for pytorch engine in cli by @RunningLeon in #2321
- Update error status_code to raise error in openai client by @AllentDan in #2333
- Change to use device instead of device-type in cli by @RunningLeon in #2337
- Add GEMM test utils by @lzhangzz in #2342
- Add environment variable to control SILU fusion by @lzhangzz in #2343
- Use single thread per model instance by @lzhangzz in #2339
- add cache to speed up docker building by @RunningLeon in #2344
- add max_prefill_token_num argument in CLI by @lvhan028 in #2345
- torch engine optimize prefill for long context by @grimoire in #1962
- Refactor turbomind (1/N) by @lzhangzz in #2352
- feat(server): enable
seed
parameter for openai compatible server. by @DearPlanet in #2353 - support do_sample parameter by @irexyc in #2375
- refactor TurbomindModelConfig by @lvhan028 in #2364
- import dlinfer before imageencoding by @jinminxi104 in #2413
- ignore *.pth when download model from model hub by @lvhan028 in #2426
- inplace logits process as default by @grimoire in #2427
- handle invalid images by @irexyc in #2312
- Split token_embs and lm_head weights by @irexyc in #2252
- build: update ascend dockerfile by @CyCle1024 in #2421
- build nccl in dockerfile for cuda11.8 by @RunningLeon in #2433
- automatically set max_batch_size according to the device when it is not specified by @lvhan028 in #2434
- rename the ascend dockerfile by @lvhan028 in #2403
- refactor ascend kernels by @yao-fengchen in #2355
🐞 Bug fixes
- enable run vlm with pytorch engine in gradio by @RunningLeon in #2256
- fix side-effect: failed to update tm model config with tm engine config by @lvhan028 in #2275
- Fix internvl2 template and update docs by @irexyc in #2292
- fix the issue missing dependencies in the Dockerfile and pip by @ColorfulDick in #2240
- Fix the way to get "quantization_config" from model's coniguration by @lvhan028 in #2325
- fix(ascend): fix import error of pt engine in cli by @CyCle1024 in #2328
- Default rope_scaling_factor of TurbomindEngineConfig to None by @lvhan028 in #2358
- Fix the logic of update engine_config to TurbomindModelConfig for both tm model and hf model by @lvhan028 in #2362
- fix cache position for pytorch engine by @RunningLeon in #2388
- Fix /v1/completions batch order wrong by @AllentDan in #2395
- Fix some issues encountered by modelscope and community by @irexyc in #2428
- fix llama3 rotary in pytorch engine by @grimoire in #2444
- fix tensors on different devices when deploying MiniCPM-V-2_6 with tensor parallelism by @irexyc in #2454
- fix MultinomialSampling operator builder by @grimoire in #2460
- Fix initialization of runtime_min_p by @irexyc in #2461
- fix Windows compile error by @zhyncs in #2303
- fix: follow up #2303 by @zhyncs in #2307
📚 Documentations
- Reorganize the user guide and update the get_started section by @lvhan028 in #2038
- cancel support baichuan2 7b awq in pytorch engine by @grimoire in #2246
- Add user guide about slora serving by @AllentDan in #2084
- Reorganize the table of content of get_started by @lvhan028 in #2378
- fix get_started user guide unaccessible by @lvhan028 in #2410
- add Ascend get_started by @jinminxi104 in #2417
🌐 Other
- test prtest image update by @zhulinJulia24 in #2192
- Update python support version by @wuhongsheng in #2290
- [ci] benchmark react by @zhulinJulia24 in #2183
- bump version to v0.6.0a0 by @lvhan028 in #2371
- [ci] add daily test's coverage report by @zhulinJulia24 in #2401
- update actions/download-artifact to v4 to fix security issue by @lvhan028 in #2419
- bump version to v0.6.0 by @lvhan028 in #2445
New Contributors
- @wuhongsheng made their first contribution in #2290
- @ColorfulDick made their first contribution in #2240
- @DearPlanet made their first contribution in #2353
- @jinminxi104 made their first contribution in #2413
Full Changelog: v0.5.3...v0.6.0
LMDeploy Release V0.6.0a0
Highlight
- Optimize W4A16 quantized model inference by implementing GEMM in TurboMind Engine
- Add GPTQ-INT4 inference
- Support CUDA architecture from SM70 and above, equivalent to the V100 and above.
- Optimize the prefilling inference stage of PyTorchEngine
- Distinguish between the concepts of the name of the deployed model and the name of the model's chat tempate
Before:
lmdeploy serve api_server /the/path/of/your/awesome/model \
--model-name customized_chat_template.json
After
lmdeploy serve api_server /the/path/of/your/awesome/model \
--model-name "the served model name"
--chat-template customized_chat_template.json
What's Changed
🚀 Features
- support vlm custom image process parameters in openai input format by @irexyc in #2245
- New GEMM kernels for weight-only quantization by @lzhangzz in #2090
- Fix hidden size and support mistral nemo by @AllentDan in #2215
- Support custom logits processors by @AllentDan in #2329
- support openbmb/MiniCPM-V-2_6 by @irexyc in #2351
- Support phi3.5 for pytorch engine by @RunningLeon in #2361
💥 Improvements
- Remove deprecated arguments from API and clarify model_name and chat_template_name by @lvhan028 in #1931
- Fix duplicated session_id when pipeline is used by multithreads by @irexyc in #2134
- remove eviction param by @grimoire in #2285
- Remove QoS serving by @AllentDan in #2294
- Support send tool_calls back to internlm2 by @AllentDan in #2147
- Add stream options to control usage by @AllentDan in #2313
- add device type for pytorch engine in cli by @RunningLeon in #2321
- Update error status_code to raise error in openai client by @AllentDan in #2333
- Change to use device instead of device-type in cli by @RunningLeon in #2337
- Add GEMM test utils by @lzhangzz in #2342
- Add environment variable to control SILU fusion by @lzhangzz in #2343
- Use single thread per model instance by @lzhangzz in #2339
- add cache to speed up docker building by @RunningLeon in #2344
- add max_prefill_token_num argument in CLI by @lvhan028 in #2345
- torch engine optimize prefill for long context by @grimoire in #1962
- Refactor turbomind (1/N) by @lzhangzz in #2352
- feat(server): enable
seed
parameter for openai compatible server. by @DearPlanet in #2353
🐞 Bug fixes
- enable run vlm with pytorch engine in gradio by @RunningLeon in #2256
- fix side-effect: failed to update tm model config with tm engine config by @lvhan028 in #2275
- Fix internvl2 template and update docs by @irexyc in #2292
- fix the issue missing dependencies in the Dockerfile and pip by @ColorfulDick in #2240
- Fix the way to get "quantization_config" from model's coniguration by @lvhan028 in #2325
- fix(ascend): fix import error of pt engine in cli by @CyCle1024 in #2328
- Default rope_scaling_factor of TurbomindEngineConfig to None by @lvhan028 in #2358
- Fix the logic of update engine_config to TurbomindModelConfig for both tm model and hf model by @lvhan028 in #2362
📚 Documentations
- Reorganize the user guide and update the get_started section by @lvhan028 in #2038
- cancel support baichuan2 7b awq in pytorch engine by @grimoire in #2246
- Add user guide about slora serving by @AllentDan in #2084
🌐 Other
- test prtest image update by @zhulinJulia24 in #2192
- Update python support version by @wuhongsheng in #2290
- fix Windows compile error by @zhyncs in #2303
- fix: follow up #2303 by @zhyncs in #2307
- [ci] benchmark react by @zhulinJulia24 in #2183
- bump version to v0.6.0a0 by @lvhan028 in #2371
New Contributors
- @wuhongsheng made their first contribution in #2290
- @ColorfulDick made their first contribution in #2240
- @DearPlanet made their first contribution in #2353
Full Changelog: v0.5.3...v0.6.0a0
LMDeploy Release V0.5.3
What's Changed
🚀 Features
- PyTorch Engine AWQ support by @grimoire in #1913
- Phi3 awq by @grimoire in #1984
- Fix chunked prefill by @lzhangzz in #2201
- support VLMs with Qwen as the language model by @irexyc in #2207
💥 Improvements
- Support specifying a prefix of assistant response by @AllentDan in #2172
- Strict check for
name_map
inInternLM2Chat7B
by @SamuraiBUPT in #2156 - Check errors for attention kernels by @lzhangzz in #2206
- update base image to support cuda12.4 in dockerfile by @RunningLeon in #2182
- Stop synchronizing for
length_criterion
by @lzhangzz in #2202 - adapt MiniCPM-Llama3-V-2_5 new code by @irexyc in #2139
- Remove duplicate code by @cmpute in #2133
🐞 Bug fixes
- [Hotfix] miss parentheses when calcuating the coef of llama3 rope by @lvhan028 in #2157
- support logit softcap by @grimoire in #2158
- Fix gmem to smem WAW conflict in awq gemm kernel by @foreverrookie in #2111
- Fix gradio serve using a wrong chat template by @AllentDan in #2131
- fix runtime error when using dynamic scale rotary embed for InternLM2… by @CyCle1024 in #2212
- Add peer-access-enabled allocator by @lzhangzz in #2218
- Fix typos in profile_generation.py by @jiajie-yang in #2233
📚 Documentations
- docs: fix Qwen typo by @ArtificialZeng in #2136
- wrong expression by @ArtificialZeng in #2165
- clearify the model type LLM or MLLM in supported model matrix by @lvhan028 in #2209
- docs: add Japanese README by @eltociear in #2237
🌐 Other
- bump version to 0.5.2.post1 by @lvhan028 in #2159
- update news about cooperation with modelscope/swift by @lvhan028 in #2200
- bump version to v0.5.3 by @lvhan028 in #2242
New Contributors
- @ArtificialZeng made their first contribution in #2136
- @foreverrookie made their first contribution in #2111
- @SamuraiBUPT made their first contribution in #2156
- @CyCle1024 made their first contribution in #2212
- @jiajie-yang made their first contribution in #2233
- @cmpute made their first contribution in #2133
Full Changelog: v0.5.2...v0.5.3
LMDeploy Release V0.5.2.post1
What's Changed
🐞 Bug fixes
- [Hotfix] miss parentheses when calcuating the coef of llama3 rope which causes needle-in-hays experiment failed by @lvhan028 in #2157
🌐 Other
Full Changelog: v0.5.2...v0.5.2.post1
LMDeploy Release V0.5.2
Highlight
- LMDeploy support Llama3.1 and its Tool Calling. An example of calling "Wolfram Alpha" to perform complex mathematical calculations can be found from here
What's Changed
🚀 Features
- Support glm4 awq by @AllentDan in #1993
- Support llama3.1 by @lvhan028 in #2122
- Support Llama3.1 tool calling by @AllentDan in #2123
💥 Improvements
- Remove the triton inference server backend "turbomind_backend" by @lvhan028 in #1986
- Remove kv cache offline quantization by @AllentDan in #2097
- Remove
session_len
and deprecated short names of the chat templates by @lvhan028 in #2105 - clarify "n>1" in GenerationConfig hasn't been supported yet by @lvhan028 in #2108
🐞 Bug fixes
- fix stop words for glm4 by @RunningLeon in #2044
- Disable peer access code by @lzhangzz in #2082
- set log level ERROR in benchmark scripts by @lvhan028 in #2086
- raise thread exception by @irexyc in #2071
- Fix index error when profiling token generation with
-ct 1
by @lvhan028 in #1898
🌐 Other
- misc: replace slow Jimver/cuda-toolkit by @zhyncs in #2065
- misc: update bug issue template by @zhyncs in #2083
- update daily testcase new by @zhulinJulia24 in #2035
- bump version to v0.5.2 by @lvhan028 in #2143
Full Changelog: v0.5.1...v0.5.2
LMDeploy Release V0.5.1
What's Changed
🚀 Features
- Support phi3-vision by @RunningLeon in #1845
- Support internvl2 chat template by @AllentDan in #1911
- support gemma2 in pytorch engine by @grimoire in #1924
- Add tools to api_server for InternLM2 model by @AllentDan in #1763
- support internvl2-1b by @RunningLeon in #1983
- feat: support llama2 and internlm2 on 910B by @yao-fengchen in #2011
- Support glm 4v by @RunningLeon in #1947
- support internlm-xcomposer2d5-7b by @irexyc in #1932
- add chat template for codegeex4 by @RunningLeon in #2013
💥 Improvements
- misc: rm unnecessary files by @zhyncs in #1875
- drop stop words by @grimoire in #1823
- Add usage in stream response by @fbzhong in #1876
- Optimize sampling on pytorch engine. by @grimoire in #1853
- Remove deprecated chat cli and vl examples by @lvhan028 in #1899
- vision model use tp number of gpu by @irexyc in #1854
- misc: add default api_server_url for api_client by @zhyncs in #1922
- misc: add transformers version check for TurboMind Tokenizer by @zhyncs in #1917
- fix: append _stats when size > 0 by @zhyncs in #1809
- refactor: update awq linear and rm legacy by @zhyncs in #1940
- feat: add gpu topo for check_env by @zhyncs in #1944
- fix transformers version check for InternVL2 by @zhyncs in #1952
- Upgrade gradio by @AllentDan in #1930
- refactor sampling layer setup by @irexyc in #1912
- Add exception handler to imge encoder by @irexyc in #2010
- Avoid the same session id for openai endpoint by @AllentDan in #1995
🐞 Bug fixes
- Fix error link reference by @zihaomu in #1881
- Fix internlm-xcomposer2-vl awq search scale by @AllentDan in #1890
- fix SamplingDecodeTest and SamplingDecodeTest2 unittest failure by @zhyncs in #1874
- Fix smem size for fused split-kv reduction by @lzhangzz in #1909
- fix llama3 chat template by @AllentDan in #1956
- fix: set PYTHONIOENCODING to UTF-8 before start tritonserver by @zhyncs in #1971
- Fix internvl2-40b model export by @irexyc in #1979
- fix logprobs by @irexyc in #1968
- fix unexpected argument error when deploying "cogvlm-chat-hf" by @AllentDan in #1982
- fix mixtral and mistral cache_position by @zhyncs in #1941
- Fix the session_len assignment logic by @lvhan028 in #2007
- Fix logprobs openai api by @irexyc in #1985
- Fix internvl2-40b awq inference by @AllentDan in #2023
- Fix side effect of #1995 by @AllentDan in #2033
📚 Documentations
- docs: update faq for turbomind so not found by @zhyncs in #1877
- [Doc]: Change to sphinx-book-theme in readthedocs by @RunningLeon in #1880
- docs: update compatibility section in README by @zhyncs in #1946
- docs: update kv quant doc by @zhyncs in #1977
- docs: sync the core features in README to index.rst by @zhyncs in #1988
- Fix table rendering for readthedocs by @RunningLeon in #1998
- docs: fix Ada compatibility by @zhyncs in #2016
- update xcomposer2d5 docs by @irexyc in #2037
🌐 Other
- [ci] add internlm2.5 models into testcase by @zhulinJulia24 in #1928
- bump version to v0.5.1 by @lvhan028 in #2022
New Contributors
Full Changelog: v0.5.0...v0.5.1