contributors: @GitYCC
- gMLP, based on MLPs with gating
- In terms of computation cost, SGU has
$n^2e/2$ multiply-adds which is comparable to the$2n^2d$ of dot-product self-attention. - Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy.
- For BERT, our model achieves parity with Transformers on pretraining perplexity and is better on some downstream NLP tasks. On finetuning tasks where gMLP performs worse, making the gMLP model substantially larger can close the gap with Transformers.