Apply the attention mask in all decoding steps (LM inference) #2532

l-k-11235 · 2023-12-04T17:09:42Z

I noticed that LLM outputs were degraded when batch_size is greater than 1 and the batch is heterogeneous in terms of example size. I think this highlights the fact that the attention mask for padding tokens needs to be applied in all decoding steps, not just the first step (at least with the left padding recently implemented).

The fix works for "classical attention", not for flash2 attention SDPA.
The quantization in the attention layer must be deactivated with batch_size > 1

… 1 - works for 'classical attention'

…ze > 1 thanks to map_state

…_attn_type opt

l-k-11235 added 3 commits November 30, 2023 16:15

wip

d7b5f56

fixed attention mask

be6adab

some code cleaning

4644bc8

l-k-11235 changed the title ~~provide a fix for attention mask~~ Apply the attention mask in all decoding steps Dec 5, 2023

l-k-11235 added 11 commits December 5, 2023 09:33

restore and adapt previous dec_mask

c94d8c9

wip

3116271

wip - apply mask on context before final_linear + degugging

faae43d

wip - apply offset in rotaty embeddings

40caae8

wip

a805ad9

apply pad_mask for all decoding steps when batch_size is greater than…

f3ed229

… 1 - works for 'classical attention'

wip

829a861

handle finished hypotheses

8333d04

wip

50c0754

use key_pad_mask in transformer LM attention layers to handle beam_si…

b50a7e1

…ze > 1 thanks to map_state

differentiate between scaled-dot and scaled-dot-flash values for self…

7c49902

…_attn_type opt

l-k-11235 changed the title ~~Apply the attention mask in all decoding steps~~ Apply the attention mask in all decoding steps (LM inference) Dec 14, 2023

l-k-11235 added 2 commits December 15, 2023 09:53

some code cleaning

1bc3212

added warning in quickstart.md

dc34150

l-k-11235 force-pushed the masked_attn_test branch from cbee77a to dc34150 Compare December 15, 2023 10:21

l-k-11235 added 2 commits December 15, 2023 11:25

removed empty lines

3dcda9e

restored proper file permissions

bbc28a8

vince62s merged commit f01bea1 into OpenNMT:master Dec 15, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply the attention mask in all decoding steps (LM inference) #2532

Apply the attention mask in all decoding steps (LM inference) #2532

l-k-11235 commented Dec 4, 2023 •

edited

Loading

Apply the attention mask in all decoding steps (LM inference) #2532

Apply the attention mask in all decoding steps (LM inference) #2532

Conversation

l-k-11235 commented Dec 4, 2023 • edited Loading

l-k-11235 commented Dec 4, 2023 •

edited

Loading