Optimization (ICLR 2020) RAdam On the Variance of the Adaptive Learning Rate and Beyond [PDF] [Code] (EMNLP 2020) Admin Understanding the Difficulty of Training Transformers [PDF] [Code] (Other models) TorchScope