You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed from another issue that Flora is applied to Adafactor and is compared against other methods applied to Adam (in the figure below).
Wouldn't be more reasonable to make comparisons when all methods applied to the same optimizer style, such as Adam?
Currently it is hard to separate the contribution of Flora to memory savings from Adafactor alone.
Do you have any comment regarding the training speed of models using Flora? How does the training time using Flora compares against the standard Adafactor optimizer?
Best,
The text was updated successfully, but these errors were encountered:
Thank you for asking! 1. In the paper, all of our main baselines used Adafactor as the default optimizer so they are already memory-efficient. However, since many people use Adam more often in practice, the mentioned figure chose Adam to demonstrate the overall effect. 2. The speed is a little slower because we need to generate and apply random projections in our implementation. But the overhead is usually negligible, and I would expect that writing some specific CUDA kernels could largely accelerate Flora.
Hi, great work.
I noticed from another issue that Flora is applied to Adafactor and is compared against other methods applied to Adam (in the figure below).

Wouldn't be more reasonable to make comparisons when all methods applied to the same optimizer style, such as Adam?
Currently it is hard to separate the contribution of Flora to memory savings from Adafactor alone.
Do you have any comment regarding the training speed of models using Flora? How does the training time using Flora compares against the standard Adafactor optimizer?
Best,
The text was updated successfully, but these errors were encountered: