Fix (scaling/standalone): better switch from runtime stats to param #1099

Giuseppe5 · 2024-11-21T13:58:59Z

Reason for this PR

Currently, if we switch from training to eval before stats_collection_steps is done, we never update the value parameter to store the buffer value. This has a few side effects:

When applying learned round, we might keep the model in eval model but still accumulate gradients. If the value parameter is not being used, no gradients are accumulated
When exporting state_dict, value is not exported
When doing PTQ calibration, current setup is such that the buffer is never converted to its corresponding parameter value, causing some of the issues mentioned above.

Changes Made in this PR

At eval time, during the first iteration the buffer is always converted to buffer.
The side effect of this happens in the case the user would want to switch multiple times between training/evaluation mode very early on in the training process. Although it is common to switch between training/eval to check loss one val set, it is usually done after enough iteration that the buffer has already been converted to parameter anyway.
I'd admit that it could be marked as breaking change for this edge cases.

~~This has been removed in a more recent commit. I believe there are no more breaking changes at this point.~~

Testing Summary

Risk Highlight

This PR includes code from another work (please detail).
This PR contains API-breaking changes.
This PR depends on work in another PR (please provide links/details).
This PR introduces new dependencies (please detail).
There are coverage gaps not covered by tests.
Documentation updates required in subsequent PR.

Checklist

Code comments added to any hard-to-understand areas, if applicable.
Changes generate no new warnings.
Updated any relevant tests, if applicable.
No conflicts with destination dev branch.
I reviewed my own code changes.
Initial CI/CD passing.
1+ reviews given, and any review issues addressed and approved.
Post-review full CI/CD passing.

Giuseppe5 added 2 commits November 21, 2024 13:50

Fix (scaling/standalone): better switch from runtime stats to param

c6d58a5

fix

63cc19d

Giuseppe5 requested a review from nickfraser November 21, 2024 14:57

Fix

f3f7590

Giuseppe5 requested review from nickfraser and removed request for nickfraser November 25, 2024 14:20

Fix tests

6ffb64f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix (scaling/standalone): better switch from runtime stats to param #1099

Fix (scaling/standalone): better switch from runtime stats to param #1099

Giuseppe5 commented Nov 21, 2024 •

edited

Loading

Fix (scaling/standalone): better switch from runtime stats to param #1099

Are you sure you want to change the base?

Fix (scaling/standalone): better switch from runtime stats to param #1099

Conversation

Giuseppe5 commented Nov 21, 2024 • edited Loading

Reason for this PR

Changes Made in this PR

Testing Summary

Risk Highlight

Checklist

Giuseppe5 commented Nov 21, 2024 •

edited

Loading