Skip to content

Commit

Permalink
Fixed 2021.emnlp-main.523 abstract (acl-org#1649)
Browse files Browse the repository at this point in the history
  • Loading branch information
mjpost authored Nov 10, 2021
1 parent 5c0a34b commit 2f8af39
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion data/xml/2021.emnlp.xml
Original file line number Diff line number Diff line change
Expand Up @@ -6151,7 +6151,7 @@
<author><first>Ivan</first><last>Titov</last></author>
<author><first>Rico</first><last>Sennrich</last></author>
<pages>6507–6520</pages>
<abstract>Recently, it has been argued that encoder-decoder models can be made more interpretable by replacing the softmax function in the attention with its sparse variants. In this work, we introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a , and show that sparsity naturally emerges from such a formulation. Training stability is achieved with layer normalization with either a specialized initialization or an additional gating function. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. We apply ReLA to the Transformer and conduct experiments on five machine translation tasks. ReLA achieves translation performance comparable to several strong baselines, with training and decoding speed similar to that of the vanilla attention. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment than recent sparsified softmax-based models. Intriguingly, ReLA heads also learn to attend to nothing (i.e. ‘switch off’) for some queries, which is not possible with sparsified softmax alternatives.</abstract>
<abstract>Recently, it has been argued that encoder-decoder models can be made more interpretable by replacing the softmax function in the attention with its sparse variants. In this work, we introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU, and show that sparsity naturally emerges from such a formulation. Training stability is achieved with layer normalization with either a specialized initialization or an additional gating function. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. We apply ReLA to the Transformer and conduct experiments on five machine translation tasks. ReLA achieves translation performance comparable to several strong baselines, with training and decoding speed similar to that of the vanilla attention. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment than recent sparsified softmax-based models. Intriguingly, ReLA heads also learn to attend to nothing (i.e. ‘switch off’) for some queries, which is not possible with sparsified softmax alternatives.</abstract>
<url hash="543a4f27">2021.emnlp-main.523</url>
<bibkey>zhang-etal-2021-sparse</bibkey>
</paper>
Expand Down

0 comments on commit 2f8af39

Please sign in to comment.