Releases: facebookresearch/xformers
Releases · facebookresearch/xformers
Performance improvements for `memory_efficient_attention`
[0.0.20] - 2023-05-23
Improved
- fMHA/cutlass (backward): Massive performance improvements when
batch_size * num_heads
is low (10x+) - fMHA/cutlass: Further performance improvements for both the forward & backward kernels
- fMHA (backward): Now dispatching to cutlass when
embed_dim>64
- fMHA: Updated Flash-Attention to
v1.0.5
Added
- fMHA now runs on H100 (support is experimental)
Bugfixes & perf improvement for `memory_efficient_attention`
[0.0.19] - 2023-04-28
Added
- Display
nvcc
version used to compilexformers
inpython -m xformers.info
Fixed
- Fixed performance regression with
nvcc>11.6
(#712) - fMHA/cutlass: Fixed
nan
in the output when using atorch.Tensor
with-inf
prefixes asattn_bias
(#722) - fMHA/cutlass: Fixed
nan
in the output when the sequence length is larger than2 ** 15
(#719) - fMHA/cutlass: Significative performance improvements (up to 2x) for both the forward pass and backward pass
- fMHA/cutlass: The kernel are now deterministic
- fMHA/cutlass: Fixed backward pass correctness when using dropout (#724)
Open sourcing indexing operators
OpenSource experimental indexing ops ghstack-source-id: 7f0b1213844a6454e582536c683822d6a0a435b6 Pull Request resolved: https://github.com/fairinternal/xformers/pull/536 __original_commit__ = fairinternal/xformers@539a1388e4031fbc2b847f31a5ca341395659e67
Binaries for PT 2.0, mem-eff with bias & dropout, and varying seqlen
This release brings some improvements to the memory_efficient_attention
Pip wheels now target pytorch 2.0.0 - conda builds are available for PT 2.0.0, 1.13.1 and 1.12.1
Fixed
- fMHA: Fixed BW pass on Sm86/Sm89 GPUs when
K > 64
(RTX 3090, RTX 4090, A6000, ..) [#631]
Added
v0.0.17rc482
Fix conda with GLIBC (attempt 2) ghstack-source-id: fd6e6a4f4909f787c8398b1b04ef13fd994ac1ec Pull Request resolved: https://github.com/fairinternal/xformers/pull/510 __original_commit__ = fairinternal/xformers@0fbef8c5feb7b76307db32bbc6df8f39afd90751
v0.0.17rc481
Fix CI - anaconda upload + disable fairinternal wheels ghstack-source-id: 8a817e879758a391894b3b6829de74d173c2fa67 Pull Request resolved: https://github.com/fairinternal/xformers/pull/505 __original_commit__ = fairinternal/xformers@c13d138e19030bf6e290721a96fe52814eb19a70
Pip wheels, improvements to mem-eff and more
This release contain many improvements to memory_efficient_attention
, along with pip wheels now available on windows and linux!
New Features
- Added support for pip wheels [#588, #573, #534, #523, ...] big thanks to @AbdBarho!
- fMHA: Added Triton operator for forward pass from Flash-Attention authored by @TriDao, will be automatically used on A100 when compatible
- fMHA: Added
xformers.ops.memory_efficient_attention_forward
,xformers.ops.memory_efficient_attention_forward_requires_grad
,xformers.ops.memory_efficient_attention_backward
for power-users who write custom autograd functions [#560] - fMHA: Support for custom scaling for the CUTLASS-based kernel [#530] - contribution from @comaniac
- fMHA: Separate each operator into forward and backward operators. It's now possible to use any combination of forward+backward (for instance Triton forward and Flash-Attention backward) [#560]
Improvements
- Stripe lineinfo from binaries, reducing the binary size [#549]
- fMHA: Stricter inputs validation to avoid CUDA errors for unsupported inputs [#592]
- fMHA/Flash-Attention: Updated to Dao-AILab/flash-attention@a1f49a2 with multiple changes from @TriDao that make the operator up to 20% faster
- Updated triton dependency [#418]
Bug fixes
- Fixed compatibility with Python 3.7 [#541] - thanks to @susumuota
- fMHA: Fixed strides for QKV gradients for cutlass attention [#535]
- fMHA/Flash-Attention: Fixed backward pass wrapper, where non-contiguous gradients could give the wrong result [#548]
v0.0.13
Lots of improvements and bug fixes around the memory efficient attention.
v0.0.12
[0.0.12] - 2022-08-08
Fixed
- Removed duplicated biases in the FusedMLP layers [#317]
- Rotary embeddings respecting input types [#326]
- Poolformer style instantiating useless projection layers [#349]
- Fix layer position not being properly tracked, causing extra layernorms for programmatic xformers [#348]
- Pass use_triton flag to LayerNorm module [#336]
Added
- Four blocksparsity layouts from DeepSpeed [#320]
- Support several initialization options [#312]
- Conv2DFeedforward feedforward part [#321]
- VisualAttention [#329]
- Automatic blocksparse for causal attention [#334]
- Better hierarchical transformer generation [#345]
- Fused operations with AOTAutograd/NVFuser, integration into MLP [#357]
- Refactor LRA code to use Pytorch Lightning [#343]