Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why avoid gradient accumulation? #69

Open
RonanKMcGovern opened this issue May 30, 2024 · 5 comments
Open

Why avoid gradient accumulation? #69

RonanKMcGovern opened this issue May 30, 2024 · 5 comments

Comments

@RonanKMcGovern
Copy link

RonanKMcGovern commented May 30, 2024

There is this quote:


**Gradient accumulation** simulates a larger batch size than the
--
252 | hardware can support and therefore does not provide any throughput
253 | benefits. It should generally be avoided in applied work.

For large GPUs and multi-GPU setups, I can see this making sense, as you can run batches of 32 and don't need accumulation.

Am I mistaken or missing something?

But, on smaller GPUs, grad accum can be important because it provides averaging in the virtual batches that stabilises the training.

@DimitrisMantas
Copy link

A lot of architecture have BN layers which don't work properly unless actually backprogated through, I think.

@RonanKMcGovern
Copy link
Author

RonanKMcGovern commented Jun 11, 2024 via email

@DimitrisMantas
Copy link

DimitrisMantas commented Jun 11, 2024

Batch normalization. Essentially, BN blocks keep track of the running batch mean and standard deviation and use them to normalize their inputs.

These parameters are non-trainable and are updated with each minibatch the blocks receive. However, because the total number of batches per epoch is not the same as that of backpropagations when using gradient accumulation, BN blocks now compute "incorrect" statistics. This problem is further magnified by their other parameters still being updated according to accumulated batches. Basically, batches and their descriptive statistics become “unsynchronized”.

BN blocks are very popular in computer vision tasks, and unfortunately, I’m not too familiar with much else. However, I believe that transformer blocks use typically use layer normalization blocks which do not depend on batch size, so you should be safe.

@DimitrisMantas
Copy link

By the way, large batch sizes are just as "dangerous" as small ones due to potential overmoothing of the gradient landscape. It's kind of a "pick your poison" situation.

@RonanKMcGovern
Copy link
Author

RonanKMcGovern commented Jun 11, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants