[QUESTION] Is it expected to do grad norm on dense-optimizer and moe-optimizer respectively? #1183

ezioliao · 2024-04-19T08:51:56Z

ezioliao
Apr 19, 2024

If we enable expert parallelism, there will be two optimizers for dense parameters and expert parameters. When we call optimizer.step() the two optimizers perform grad-norm for their own parameters.

But if we do not enable expert parallelism, all model parameter's grad will be normed as entirely.

So my question is that the behavior of grad norm is different mathematically whether expert parallelism is turned on. Is it expetced?

2024-06-18T18:20:52Z

github-actions[bot]
bot Jun 18, 2024

Marking as stale. No activity in 60 days.

0 replies

deepakn94 · 2024-06-19T22:32:18Z

deepakn94
Jun 19, 2024
Maintainer

I believe this behavior is fixed now.

0 replies

2024-08-19T18:21:03Z

github-actions[bot]
bot Aug 19, 2024

Marking as stale. No activity in 60 days.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Is it expected to do grad norm on dense-optimizer and moe-optimizer respectively? #1183

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

[QUESTION] Is it expected to do grad norm on dense-optimizer and moe-optimizer respectively? #1183

ezioliao Apr 19, 2024

Replies: 3 comments

github-actions[bot] bot Jun 18, 2024

deepakn94 Jun 19, 2024 Maintainer

github-actions[bot] bot Aug 19, 2024

ezioliao
Apr 19, 2024

github-actions[bot]
bot Jun 18, 2024

deepakn94
Jun 19, 2024
Maintainer

github-actions[bot]
bot Aug 19, 2024