Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss is NaN when using half precision #377

Open
mbcel opened this issue Jul 18, 2017 · 3 comments
Open

Loss is NaN when using half precision #377

mbcel opened this issue Jul 18, 2017 · 3 comments

Comments

@mbcel
Copy link

mbcel commented Jul 18, 2017

When I run my model on half precision(fp16) the Loss function returns NaN. It all works fine when I use normal floating point precision (fp32) so I don't think it is a problem of the learning parameters. It also is NaN right from the beginning of the training.

I am using the SpatialCrossEntropyCriterion and I also do explicitly not convert every MaxPooling and BatchNormalization to cudnn since these don't work otherwise.

Relevant code:

criterion = cudnn.SpatialCrossEntropyCriterion(classWeights):cudaHalf()

 model = createNet()
 model = model:cudaHalf()

  -- cudnn ignoeres Pooling layers due to compatibiliy problems with Unpooling
  cudnn.convert(model, cudnn, function(module)
        return torch.type(module):find("SpatialMaxPooling") ~= nil -- compatibility problems
        or torch.type(module):find("SpatialBatchNormalization") ~= nil -- apparently no cudaHalf implementation
      end)

...
-- during training this returns nan right from beginning or sometimes at second iteration
loss = criterion:forward(outputGpu, labels)

I am wondering if the reason is the (not existing?) CudaHalf implementation for the BatchNormalization module?

@mbcel
Copy link
Author

mbcel commented Jul 18, 2017

Okay I figured out that the nan's were due the adam optimisation. The default epsilon of 1e-8 is too low and rounded to zero like pointed out here. Setting it to 1e-4 fixes the nan problem but now the optimisation does not decrease the loss anymore. Is there a way to solve this wile keeping the same learning rate?

@Manuscrit
Copy link

You can keep FP32 for the optimizer as explained here : https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/
And a pytorch snippet : https://gist.github.com/ajbrock/075c0ca4036dc4d8581990a6e76e07a3

@LiJiaqi96
Copy link

LiJiaqi96 commented Dec 28, 2021

I solved this issue by using autocast instead of .half(), which was from suggestion of PyTorch team.
https://discuss.pytorch.org/t/working-with-half-model-and-half-input/88494
https://pytorch.org/docs/master/amp.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants