Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out why AdamW + gradient accumulation leads to different results for test case #126

Open
achalddave opened this issue Nov 30, 2023 · 6 comments

Comments

@achalddave
Copy link
Collaborator

In #125, we had to switch our gradient accumulation tests from SGD to AdamW to make gradient accumulation tests pass. It's unclear why this is the case; anecdotally, when training models with AdamW, training curves look similar with and without gradient accumulation. This could be a numerical issue, or some specific issue with AdamW that makes gradient accumulation behave differently.

@sedrick-keh-tri
Copy link
Collaborator

sedrick-keh-tri commented Dec 1, 2023

The issue might be with the forward pass.

Created this test to verify the forward pass in one batch vs multiple batches, and it seems there are some numerical discrepancies there: https://github.com/sedrick-keh-tri/open_lm/commit/f0aa4e72bf038efc2dd793ca29b8a5bd7727f83f

(Update: Seems like this forward test passes with FP32 until threshold=1e-5 but fails with 1e-6 onwards)
(Update 2: Same pattern holds for FP16, i.e. works at 1e-5 but not 1e-6 onwards)

@achalddave
Copy link
Collaborator Author

hm, it's strange that the accumulation tests pass with SGD at tol=1e-7 then if the forward pass is failing checks at threshold=1e-6...

@sedrick-keh-tri
Copy link
Collaborator

sedrick-keh-tri commented Dec 1, 2023

I am very confused at this point. Seems like the backward is generally more precise than the forward... is this something that's even possible?

I tried taking your exact branch and just editing the train_one_epoch to return the forward outputs. Forward output checking fails with precision 1e-7, but the backward tests pass fine with 1e-7.

The trend of backward being more precise than forward seems to hold for some reason. For bf16, I tried doing single-layer and turning off the norm (this seems to affect the precision somehow). With 1 layer, forward threshold is 1e-2 but backward threshold is 1e-4.

@mitchellnw
Copy link
Contributor

A useful test here could also just be a short training run with and without grad accum such that we'd expect the curves to be identical. If the model with grad accum is clearly worse then we know something is wrong. If they achieve very similar loss then that supports noise.

@sedrick-keh-tri
Copy link
Collaborator

Ran some 160M experiments here: Wandb report

Loss curves for accum_freq=1,2,4 look basically identical.
image

Validation perplexities are as follows:
accum=1 evaluation perplexity: 13.26832732791897
accum=2 evaluation perplexity: 13.302758042158114
accum=4 evaluation perplexity: 13.271497238202906

For reference, the evaluation perplexity of the other accum=4 experiment is 13.414325512942533, so there seems to be a fair bit of variation even with identical runs

@achalddave
Copy link
Collaborator Author

This is great, thanks @sedrick-keh-tri! I feel comfortable with our grad accum implementation for now, then. Let's leave this issue open in case anyone wants to dig into why the test fails, but continue using grad accum as implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants