- Fixed an interaction between the fused QKV projection and the key-value cache that caused excessive memory usage.
- Disabled cache in
ppl.py
; isn't used and saves memory. - Added more benchmarks to README.
- Fixed bug in
generate.py
; generated sequence length was not calculated correctly.
- Added support for groupsize.
- Note: fuse_mlp is not recommended for groupsize != -1. It is now disabled automatically during loading if the model has grouping, unless fuse_mlp is explictly set to True. This is a result of the current kernel implementation being slower than the naive implementation for groupsize != -1.
- Added a warning if
act_order
andgroupsize
are used together. They are not compatible.