Flash attention #931

oliverdutton · 2024-04-20T14:17:59Z

Flash attention implemented to reduce runtime and memory usage using Pallas. Added on opt-in basis in the global config.

For a 759 residue protein and model_5 this drops peak memory consumption to 5 GB without minibatching and reduces runtime 2.3x on an A100 (15.2 $\rightarrow$ 6.5 seconds [with minibatching of 256 for non-flash attention to avoid OOM])

Here's a colab link showing runtime improvement and no significant change in prediction output by visual inspection

When combined with #930 (bfloat16 support for monomer models) peak memory drops to only 2.7 GB and runtime to 5.6 seconds (2.7x speedup relative to non-flash, float32)

Notes:

Key variations from a reference flash attention kernel are:

Attention logit biasing supported
Gating supported
Some heads have only 8 channels, they’re padded up to 16 within kernel (this is a requirement of pl.dot, we still see performance improvement relative to non-flash attn and keeps overall AlphaFold2 linear in memory requirements)
Broadcasted masks in batch, q and head dimensions supported (they’re often size 1 and implicitly broadcasted in AlphaFold2 einsums)

There's guards against kernel being called for short sequence lengths less than block sizes specified in q and k which exits to reference kernel.

I haven't done correctness checks with multimer models, I would do if there was a positive response to this pull request.
I'm not certain on the numerical stability of the implementation yet with bfloat16

(I can switch out the exp and log for exp2 and log2 for a small reduction in runtime, this leads to slightly different predictions but with testing I believe would show equivalent error in structure prediction)

sokrypton · 2024-04-20T19:18:17Z

Hi @oliverdutton ! Really cool contribution. Mind we try add it to colabfold? We already have fused attention and bfloat16 integrated into monomer model. Will be interesting to try flash attention as well.

TemplateEmbedding uses attention with batch dim broadcast which wasn't supported `mask = template_mask[None, None, None,:]`

oliverdutton · 2024-04-21T11:55:11Z

@sokrypton Of course, I've made a pull request in ColabDesign with it (sokrypton/ColabDesign#173)

Removes any OOB indexing. Previously I've allowed out-of-bounds loads and fixed them by masks in qk. I've seen nan's appear which disappear with minorly varying MHLO. This commit removes any OOB indexing.

oliverdutton · 2024-04-23T21:44:29Z

Pre d4516d8 I find transient NaN behaviour on shapes which don't evenly divide block size (so OOB loading).

gist to reproduce problem:

import jax
from jax import jit, numpy as jnp
from alphafold.model import model

key = jax.random.PRNGKey(42)
nrepeats = 100
for nres in range(128,256):
    print(nres)
    for i in range(nrepeats):
        q, k, v = jax.random.uniform(key, (3, 1024, nres, 8, 32))
        f = jax.jit(model.modules.Attention.flash_kernel, static_argnames=(
            'return_residual', 'block_q', 'block_k', 'num_warps', 'num_stages', 'grid', 'interpret', 'debug')
        )
        assert jnp.isfinite(f(q,k,v)).all(), f"Failed with {nres} on run {i}"

Post d4516d8 transient NaN behaviour error disappears. So I hope this will now always be NaN free.

xlminfei · 2024-05-04T15:54:29Z

Thank you very much, this improvement is very useful. I am using RTX3090 to predict a 3645aa heterotetramer. With this improvement, the prediction time of a single model has decreased from 59,000 seconds to 43,000 seconds (also out of GPU memory limit).

oliverdutton added 2 commits April 19, 2024 21:54

feat: flash attention

f86ad29

feat: allow bfloat16 flash attention

d4fdf81

fix: support batch dim broadcast mask

789d876

TemplateEmbedding uses attention with batch dim broadcast which wasn't supported `mask = template_mask[None, None, None,:]`

oliverdutton mentioned this pull request Apr 21, 2024

Flash attention sokrypton/ColabDesign#173

Open

jkosinski mentioned this pull request Apr 21, 2024

Flash attention? KosinskiLab/AlphaPulldown#320

Open

oliverdutton added 6 commits April 21, 2024 22:19

fix: add null callback, lowers to nan avoiding MHLO

ef77bf6

fix: index guard all loads

d4516d8

Removes any OOB indexing. Previously I've allowed out-of-bounds loads and fixed them by masks in qk. I've seen nan's appear which disappear with minorly varying MHLO. This commit removes any OOB indexing.

perf: use exp2 in place of exp for speed

8a8c1c9

fix: revert exp2 change, output variations too large

e1d2eb3

style: cleanup

80af623

fix: mask=None case bug

1787c30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash attention #931

Flash attention #931

oliverdutton commented Apr 20, 2024 •

edited

Loading

sokrypton commented Apr 20, 2024

oliverdutton commented Apr 21, 2024 •

edited

Loading

oliverdutton commented Apr 23, 2024 •

edited

Loading

xlminfei commented May 4, 2024

Flash attention #931

Are you sure you want to change the base?

Flash attention #931

Conversation

oliverdutton commented Apr 20, 2024 • edited Loading

Notes:

sokrypton commented Apr 20, 2024

oliverdutton commented Apr 21, 2024 • edited Loading

oliverdutton commented Apr 23, 2024 • edited Loading

xlminfei commented May 4, 2024

oliverdutton commented Apr 20, 2024 •

edited

Loading

oliverdutton commented Apr 21, 2024 •

edited

Loading

oliverdutton commented Apr 23, 2024 •

edited

Loading