What's Changed
π New OpenAI api compatible end-point via model.serve(host, port)
.
β‘ Auto-enable flash-attention2 for inference.
π Fixed sym=False
loading regression.
- code opt by @CL-ModelCloud in #1038
- fix marlin validate rocm & do validate() if backend not AUTO by @CSY-ModelCloud in #1040
- add global rocm check by @CSY-ModelCloud in #1043
- [FIX] pass sym to make_quant by @LRL-ModelCloud in #1046
- enable flash attn for loading quantized by @CSY-ModelCloud in #1045
- add flash_attn2 test by @CSY-ModelCloud in #1047
- enable flash_attention only when device is cuda by @CSY-ModelCloud in #1050
- move flash attn test to correct folder by @CSY-ModelCloud in #1052
- Expose openai server api by @CL-ModelCloud in #1048
- update openai server by @CL-ModelCloud in #1058
- don't download whl for xpu env by @CSY-ModelCloud in #1059
- remove build tag for normal release by @CSY-ModelCloud in #1063
- disable flash attn 2 for internlm by @CSY-ModelCloud in #1065
Full Changelog: v1.6.0...v1.6.1