Replies: 3 comments
-
Marking as stale. No activity in 60 days. |
Beta Was this translation helpful? Give feedback.
0 replies
-
I believe this behavior is fixed now. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Marking as stale. No activity in 60 days. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
If we enable expert parallelism, there will be two optimizers for dense parameters and expert parameters. When we call
optimizer.step()
the two optimizers perform grad-norm for their own parameters.But if we do not enable expert parallelism, all model parameter's grad will be normed as entirely.
So my question is that the behavior of grad norm is different mathematically whether expert parallelism is turned on. Is it expetced?
Beta Was this translation helpful? Give feedback.
All reactions