-
Notifications
You must be signed in to change notification settings - Fork 390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: Model diverged with loss = NaN. #890
Comments
Hypothesis: inside the model, positive and negative INFs are generated and then summed, leading to a prediction of |
Yes, for example: >>> import tensorflow as tf
>>> x = tf.constant([float("inf")], dtype=tf.float32)
>>> tf.nn.softmax(x)
<tf.Tensor: shape=(1,), dtype=float32, numpy=array([nan], dtype=float32)>
What do you mean exactly? What should be improved in the current way of reporting a training that diverges? |
Thank you - indeed, a input of inf would indeed also cause a prediciton of NaN (and thus a loss of NaN). What I meant to say is that it would be nice to have feedback about a NaN (or even inf) as early as possible, so it is more clear what the cause is. If this the gradient is This issue popped up in 2 different trainings with different data and SGD methods, and both in a training with a gradually decreasing loss of moderate scale, that then suddenly diverged. So the NaN is a surprise. A potential alternative hypothesis is that there is a bug somewhere (although data issues are more likely). Do you have a suggestion how we would get insight into what suddenly caused the model to diverge while it seemed to be converging normally ? Particularly how to flag the data under consideration at that moment ? |
Example:
|
Another example, this is with classic SGD:
|
Misaligned data can indeed cause spikes in the training loss and then exploding gradients if there is no counter measure. Since it is usually difficult to ensure that all examples in the dataset are cleaned enough, the training parameters should be tuned to prevent divergence from happening. For example, with SGD it is usually required to clip the gradients to prevent them from exploding. Are you doing that? (See the |
I disabled clipping specifically to be able to determine the root cause of the NaN ... Is there a way to get the sample or batch, and the associated loss per sample at the moment of the crash ? |
The training samples seen at the moment of the crash may not be the outliers you are looking for, because the divergence may have been started a few iterations before. A better way to check your training data for outliers is to train a standard model (e.g. a Transformer with |
I'm currently trying to run This resulted in the following crash:
The configuration is:
Is this expected given the configuration ? |
Maybe an empty line in the data file? The score task does not apply any filtering. |
For at least 1 job with this error I am certain that there are no empty lines. If I have updates on this crash, shall I post them in #523 ? |
Yes, please post updates on the other issue. I'm going to close this one as there is not much to add about the NaN loss. Feel free to create other issues if you encounter other problems. |
Hi,
this might be more of a question than an issue, but I'm observing loss = NaN, and am not clear how this could be, at least not from a theoretical standpoint.
It is possible for loss to be "INFinite" when a prediction is a hard 0 or 1, and the target is the opposite. However, a NaN should - as far as I understand - not occur.
In summary:
-log(0) = inf
(not observed, inconvenient, but understandable)-log(x) = NaN
-> Observed, but I do not know for whichx
this could happen (assumingx >= 0
)Does anyone have an explanation ?
Using classic SGD OpenNMT-tf-2.20.1, NaN loss encountered after 49900 steps.
Many thanks in advance,
Fokko
The text was updated successfully, but these errors were encountered: