Why avoid gradient accumulation? #69

RonanKMcGovern · 2024-05-30T14:24:34Z

There is this quote:


**Gradient accumulation** simulates a larger batch size than the
--
252 | hardware can support and therefore does not provide any throughput
253 | benefits. It should generally be avoided in applied work.

For large GPUs and multi-GPU setups, I can see this making sense, as you can run batches of 32 and don't need accumulation.

Am I mistaken or missing something?

But, on smaller GPUs, grad accum can be important because it provides averaging in the virtual batches that stabilises the training.

The text was updated successfully, but these errors were encountered:

DimitrisMantas · 2024-06-11T20:44:03Z

A lot of architecture have BN layers which don't work properly unless actually backprogated through, I think.

RonanKMcGovern · 2024-06-11T21:23:53Z

Interesting? What is BN? Bias? When you say "a lot of", does that include Llama 2 and 3 type models? Basically you're saying that accumulating the gradients isn't enough, some important info is thrown away once you move to the next forward pass?

…

On Tue, Jun 11, 2024 at 4:44 PM Dimitris Mantas ***@***.***> wrote: A lot of architecture have BN layers which don't work properly unless actually backprogated through, I think. — Reply to this email directly, view it on GitHub <#69 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASVG6CXHG4UBKV6Z5HC4L2DZG5OSVAVCNFSM6AAAAABIRBGLXCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGU3TIMBWGA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

DimitrisMantas · 2024-06-11T22:10:39Z

Batch normalization. Essentially, BN blocks keep track of the running batch mean and standard deviation and use them to normalize their inputs.

These parameters are non-trainable and are updated with each minibatch the blocks receive. However, because the total number of batches per epoch is not the same as that of backpropagations when using gradient accumulation, BN blocks now compute "incorrect" statistics. This problem is further magnified by their other parameters still being updated according to accumulated batches. Basically, batches and their descriptive statistics become “unsynchronized”.

BN blocks are very popular in computer vision tasks, and unfortunately, I’m not too familiar with much else. However, I believe that transformer blocks use typically use layer normalization blocks which do not depend on batch size, so you should be safe.

DimitrisMantas · 2024-06-11T22:30:47Z

By the way, large batch sizes are just as "dangerous" as small ones due to potential overmoothing of the gradient landscape. It's kind of a "pick your poison" situation.

RonanKMcGovern · 2024-06-11T22:45:36Z

Thanks yeah agreed on the problems at big batches. And yeah that makes sense re ga and batch norm. Llama 2 and 3 are layer norm so should be fine but good to know for multi model models - I need to check if clip has batch norm.

…

On Tue 11 Jun 2024 at 18:31, Dimitris Mantas ***@***.***> wrote: By the way, large batch sizes are just as "dangerous" as small ones due to potential overmoothing of the gradient landscape. It's kind of a "pick your poison" situation. — Reply to this email directly, view it on GitHub <#69 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASVG6CUBGN5YHYUUFXQHODDZG53CZAVCNFSM6AAAAABIRBGLXCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRG4YDQNZSGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why avoid gradient accumulation? #69

Why avoid gradient accumulation? #69

RonanKMcGovern commented May 30, 2024 •

edited

Loading

DimitrisMantas commented Jun 11, 2024

RonanKMcGovern commented Jun 11, 2024 via email

DimitrisMantas commented Jun 11, 2024 •

edited

Loading

DimitrisMantas commented Jun 11, 2024

RonanKMcGovern commented Jun 11, 2024 via email

Why avoid gradient accumulation? #69

Why avoid gradient accumulation? #69

Comments

RonanKMcGovern commented May 30, 2024 • edited Loading

DimitrisMantas commented Jun 11, 2024

RonanKMcGovern commented Jun 11, 2024 via email

DimitrisMantas commented Jun 11, 2024 • edited Loading

DimitrisMantas commented Jun 11, 2024

RonanKMcGovern commented Jun 11, 2024 via email

RonanKMcGovern commented May 30, 2024 •

edited

Loading

DimitrisMantas commented Jun 11, 2024 •

edited

Loading