Loss is NaN when using half precision #377

mbcel · 2017-07-18T12:55:04Z

When I run my model on half precision(fp16) the Loss function returns NaN. It all works fine when I use normal floating point precision (fp32) so I don't think it is a problem of the learning parameters. It also is NaN right from the beginning of the training.

I am using the SpatialCrossEntropyCriterion and I also do explicitly not convert every MaxPooling and BatchNormalization to cudnn since these don't work otherwise.

Relevant code:

criterion = cudnn.SpatialCrossEntropyCriterion(classWeights):cudaHalf()

 model = createNet()
 model = model:cudaHalf()

  -- cudnn ignoeres Pooling layers due to compatibiliy problems with Unpooling
  cudnn.convert(model, cudnn, function(module)
        return torch.type(module):find("SpatialMaxPooling") ~= nil -- compatibility problems
        or torch.type(module):find("SpatialBatchNormalization") ~= nil -- apparently no cudaHalf implementation
      end)

...
-- during training this returns nan right from beginning or sometimes at second iteration
loss = criterion:forward(outputGpu, labels)

I am wondering if the reason is the (not existing?) CudaHalf implementation for the BatchNormalization module?

mbcel · 2017-07-18T20:47:49Z

Okay I figured out that the nan's were due the adam optimisation. The default epsilon of 1e-8 is too low and rounded to zero like pointed out here. Setting it to 1e-4 fixes the nan problem but now the optimisation does not decrease the loss anymore. Is there a way to solve this wile keeping the same learning rate?

Manuscrit · 2019-02-13T20:56:17Z

You can keep FP32 for the optimizer as explained here : https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/
And a pytorch snippet : https://gist.github.com/ajbrock/075c0ca4036dc4d8581990a6e76e07a3

LiJiaqi96 · 2021-12-28T03:06:20Z

I solved this issue by using autocast instead of .half(), which was from suggestion of PyTorch team.
https://discuss.pytorch.org/t/working-with-half-model-and-half-input/88494
https://pytorch.org/docs/master/amp.html

CrackerHax mentioned this issue Aug 22, 2022

Some doubt about any to any voice conversion yl4579/StarGANv2-VC#6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss is NaN when using half precision #377

Loss is NaN when using half precision #377

mbcel commented Jul 18, 2017 •

edited

Loading

mbcel commented Jul 18, 2017

Manuscrit commented Feb 13, 2019

LiJiaqi96 commented Dec 28, 2021 •

edited

Loading

Loss is NaN when using half precision #377

Loss is NaN when using half precision #377

Comments

mbcel commented Jul 18, 2017 • edited Loading

mbcel commented Jul 18, 2017

Manuscrit commented Feb 13, 2019

LiJiaqi96 commented Dec 28, 2021 • edited Loading

mbcel commented Jul 18, 2017 •

edited

Loading

LiJiaqi96 commented Dec 28, 2021 •

edited

Loading