Add variance calculation from FusedAdam optimizer states #1726

kwyss-nvidia · 2025-04-28T21:32:05Z

Description

The state in the Adam optimizer provides moving averages for the first and second moments.
These can be combined to yield a gradient variance for a parameter.

This MR adds a method to the Adam optimizer to calculate a variance for a parameter.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Add

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Keith Wyss <[email protected]>

timmoon10 · 2025-04-28T22:42:49Z

transformer_engine/pytorch/optimizers/fused_adam.py

+        """Return the unscaled state corresponding to the input `param` and `state_name`.
+
+        Arguments:
+            param (torch.nn.Parameter): One of parameters in this optimizer.
+            state_name (string): Name of optimizer states, can be one of 'exp_avg', 'exp_avg_sq',
+                and 'master_param`.
+        """


Suggested change

"""Return the unscaled state corresponding to the input `param` and `state_name`.

Arguments:

param (torch.nn.Parameter): One of parameters in this optimizer.

state_name (string): Name of optimizer states, can be one of 'exp_avg', 'exp_avg_sq',

and 'master_param`.

"""

"""Estimate the gradient variance based on moment estimates."""

timmoon10 · 2025-04-28T22:47:29Z

tests/pytorch/test_fused_optimizer.py

+            first_moment = optimizer_.get_unscaled_state(param, "exp_avg")
+            second_moment = optimizer_.get_unscaled_state(param, "exp_avg_sq")


We could make this test more robust by manually computing exp_avg and exp_avg_sq outside of the optimizer. There's some complicated dtype-specific logic within get_unscaled_state, so I don't really trust it.

kwyss-nvidia added 2 commits April 24, 2025 14:21

Add variance calculation.

2213dcc

Signed-off-by: Keith Wyss <[email protected]>

Lint remove needless line.

88e0865

Signed-off-by: Keith Wyss <[email protected]>

timmoon10 reviewed Apr 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add variance calculation from FusedAdam optimizer states #1726

Add variance calculation from FusedAdam optimizer states #1726

kwyss-nvidia commented Apr 28, 2025

timmoon10 Apr 28, 2025

timmoon10 Apr 28, 2025

		first_moment = optimizer_.get_unscaled_state(param, "exp_avg")
		second_moment = optimizer_.get_unscaled_state(param, "exp_avg_sq")

Add variance calculation from FusedAdam optimizer states #1726

Are you sure you want to change the base?

Add variance calculation from FusedAdam optimizer states #1726

Conversation

kwyss-nvidia commented Apr 28, 2025

Description

Type of change

Changes

Checklist:

timmoon10 Apr 28, 2025

Choose a reason for hiding this comment

timmoon10 Apr 28, 2025

Choose a reason for hiding this comment