-
I have tried several times to execute the W CUDAFunctions.cpp:109] Warning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11080). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (function operator())
Epoch: 1 | Batch: 0 | Training loss: 2,3138 | Eval loss: 2,3010 | Eval accuracy: 0,1009
Epoch: 1 | Batch: 200 | Training loss: 0,7689 | Eval loss: 0,7535 | Eval accuracy: 0,7741
Epoch: 1 | Batch: 400 | Training loss: 0,8344 | Eval loss: 0,9345 | Eval accuracy: 0,6736
Epoch: 1 | Batch: 600 | Training loss: 6,6774 | Eval loss: 9125501,0000 | Eval accuracy: 0,3796
Epoch: 1 | Batch: 800 | Training loss: 130458,1641 | Eval loss: 154628,1406 | Eval accuracy: 0,0974
Epoch: 1 | Batch: 1000 | Training loss: 363175656554496,0000 | Eval loss: 5094437340315648,0000 | Eval accuracy: 0,1421
Epoch: 1 | Batch: 1200 | Training loss: 178396147482624,0000 | Eval loss: 15321602761520103000000000,0000 | Eval accuracy: 0,1257
Epoch: 1 | Batch: 1400 | Training loss: 52412235874540960000000000,0000 | Eval loss: 188815448995733300000000000,0000 | Eval accuracy: 0,0654
Epoch: 1 | Batch: 1600 | Training loss: 289345426490735270000000000,0000 | Eval loss: 198066029980096800000000000,0000 | Eval accuracy: 0,0990
Epoch: 1 | Batch: 1800 | Training loss: 9335885960084007000000000000,0000 | Eval loss: 83514956802220020000000000000,0000 | Eval accuracy: 0,0905
Epoch: 2 | Batch: 0 | Training loss: 748397145925992400000000000000000000,0000 | Eval loss: Infinity | Eval accuracy: 0,1019
Epoch: 2 | Batch: 200 | Training loss: 445750120740847160000000000000000000,0000 | Eval loss: Infinity | Eval accuracy: 0,1017
Epoch: 2 | Batch: 400 | Training loss: 935137766515788400000000000000000000,0000 | Eval loss: Infinity | Eval accuracy: 0,1006
Epoch: 2 | Batch: 600 | Training loss: NaN | Eval loss: NaN | Eval accuracy: 0,0980
Epoch: 2 | Batch: 800 | Training loss: NaN | Eval loss: NaN | Eval accuracy: 0,0980
Epoch: 2 | Batch: 1000 | Training loss: NaN | Eval loss: NaN | Eval accuracy: 0,0980
Epoch: 2 | Batch: 1200 | Training loss: NaN | Eval loss: NaN | Eval accuracy: 0,0980
Epoch: 2 | Batch: 1400 | Training loss: NaN | Eval loss: NaN | Eval accuracy: 0,0980
Epoch: 2 | Batch: 1600 | Training loss: NaN | Eval loss: NaN | Eval accuracy: 0,0980
Epoch: 2 | Batch: 1800 | Training loss: NaN | Eval loss: NaN | Eval accuracy: 0,0980
Epoch: 3 | Batch: 0 | Training loss: NaN | Eval loss: NaN | Eval accuracy: 0,0980
Epoch: 3 | Batch: 200 | Training loss: NaN | Eval loss: NaN | Eval accuracy: 0,0980
Epoch: 3 | Batch: 400 | Training loss: NaN | Eval loss: NaN | Eval accuracy: 0,0980
Epoch: 3 | Batch: 600 | Training loss: NaN | Eval loss: NaN | Eval accuracy: 0,0980 Any suggestion on changes I try to get this to work? TIA. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 5 replies
-
Quick note: I updated my Linux OS CUDA packages to version 12.2 so no warning message appears now. However, it still does not converge. |
Beta Was this translation helpful? Give feedback.
-
Much better results. Here is what I got. [114/114] examples.runMain
Epoch: 1 | Batch: 0 | Training loss: 2,2900 | Eval loss: 2,3009 | Eval accuracy: 0,1135
Epoch: 1 | Batch: 200 | Training loss: 0,6541 | Eval loss: 0,7396 | Eval accuracy: 0,7648
Epoch: 1 | Batch: 400 | Training loss: 0,3216 | Eval loss: 0,4579 | Eval accuracy: 0,8716
Epoch: 1 | Batch: 600 | Training loss: 0,4101 | Eval loss: 0,3639 | Eval accuracy: 0,8919
Epoch: 1 | Batch: 800 | Training loss: 0,5692 | Eval loss: 0,2988 | Eval accuracy: 0,9087
Epoch: 1 | Batch: 1000 | Training loss: 0,3532 | Eval loss: 0,2647 | Eval accuracy: 0,9182
Epoch: 1 | Batch: 1200 | Training loss: 0,3281 | Eval loss: 0,2149 | Eval accuracy: 0,9352
Epoch: 1 | Batch: 1400 | Training loss: 0,1929 | Eval loss: 0,1813 | Eval accuracy: 0,9446
Epoch: 1 | Batch: 1600 | Training loss: 0,2118 | Eval loss: 0,1639 | Eval accuracy: 0,9493
Epoch: 1 | Batch: 1800 | Training loss: 0,1123 | Eval loss: 0,1424 | Eval accuracy: 0,9552
Epoch: 2 | Batch: 0 | Training loss: 0,1060 | Eval loss: 0,1429 | Eval accuracy: 0,9543
Epoch: 2 | Batch: 200 | Training loss: 0,1787 | Eval loss: 0,1418 | Eval accuracy: 0,9551
Epoch: 2 | Batch: 400 | Training loss: 0,1474 | Eval loss: 0,1350 | Eval accuracy: 0,9582
Epoch: 2 | Batch: 600 | Training loss: 0,1649 | Eval loss: 0,1276 | Eval accuracy: 0,9609
Epoch: 2 | Batch: 800 | Training loss: 0,0265 | Eval loss: 0,1179 | Eval accuracy: 0,9620
Epoch: 2 | Batch: 1000 | Training loss: 0,1586 | Eval loss: 0,1059 | Eval accuracy: 0,9660
Epoch: 2 | Batch: 1200 | Training loss: 0,3089 | Eval loss: 0,0929 | Eval accuracy: 0,9705
Epoch: 2 | Batch: 1400 | Training loss: 0,0476 | Eval loss: 0,0866 | Eval accuracy: 0,9722
Epoch: 2 | Batch: 1600 | Training loss: 0,1712 | Eval loss: 0,0849 | Eval accuracy: 0,9725
Epoch: 2 | Batch: 1800 | Training loss: 0,0953 | Eval loss: 0,0797 | Eval accuracy: 0,9745
Epoch: 3 | Batch: 0 | Training loss: 0,1474 | Eval loss: 0,0752 | Eval accuracy: 0,9771
Epoch: 3 | Batch: 200 | Training loss: 0,2855 | Eval loss: 0,0782 | Eval accuracy: 0,9753
Epoch: 3 | Batch: 400 | Training loss: 0,0772 | Eval loss: 0,0849 | Eval accuracy: 0,9728
Epoch: 3 | Batch: 600 | Training loss: 0,0579 | Eval loss: 0,0683 | Eval accuracy: 0,9789
Epoch: 3 | Batch: 800 | Training loss: 0,1340 | Eval loss: 0,0673 | Eval accuracy: 0,9797
Epoch: 3 | Batch: 1000 | Training loss: 0,0084 | Eval loss: 0,0690 | Eval accuracy: 0,9774
Epoch: 3 | Batch: 1200 | Training loss: 0,0163 | Eval loss: 0,0678 | Eval accuracy: 0,9776
Epoch: 3 | Batch: 1400 | Training loss: 0,0359 | Eval loss: 0,0692 | Eval accuracy: 0,9779
Epoch: 3 | Batch: 1600 | Training loss: 0,0136 | Eval loss: 0,0596 | Eval accuracy: 0,9799
Epoch: 3 | Batch: 1800 | Training loss: 0,0185 | Eval loss: 0,0640 | Eval accuracy: 0,9799
Epoch: 4 | Batch: 0 | Training loss: 0,0308 | Eval loss: 0,0602 | Eval accuracy: 0,9808
Epoch: 4 | Batch: 200 | Training loss: 0,1141 | Eval loss: 0,0620 | Eval accuracy: 0,9800
Epoch: 4 | Batch: 400 | Training loss: 0,0544 | Eval loss: 0,0553 | Eval accuracy: 0,9823
Epoch: 4 | Batch: 600 | Training loss: 0,0032 | Eval loss: 0,0587 | Eval accuracy: 0,9809
Epoch: 4 | Batch: 800 | Training loss: 0,0361 | Eval loss: 0,0553 | Eval accuracy: 0,9820
Epoch: 4 | Batch: 1000 | Training loss: 0,0353 | Eval loss: 0,0542 | Eval accuracy: 0,9833
Epoch: 4 | Batch: 1200 | Training loss: 0,2258 | Eval loss: 0,0511 | Eval accuracy: 0,9832
Epoch: 4 | Batch: 1400 | Training loss: 0,0433 | Eval loss: 0,0482 | Eval accuracy: 0,9839
Epoch: 4 | Batch: 1600 | Training loss: 0,0448 | Eval loss: 0,0502 | Eval accuracy: 0,9829
Epoch: 4 | Batch: 1800 | Training loss: 0,0116 | Eval loss: 0,0504 | Eval accuracy: 0,9836
Epoch: 5 | Batch: 0 | Training loss: 0,0858 | Eval loss: 0,0516 | Eval accuracy: 0,9835
Epoch: 5 | Batch: 200 | Training loss: 0,0135 | Eval loss: 0,0483 | Eval accuracy: 0,9840
Epoch: 5 | Batch: 400 | Training loss: 0,0606 | Eval loss: 0,0502 | Eval accuracy: 0,9840
Epoch: 5 | Batch: 600 | Training loss: 0,0653 | Eval loss: 0,0472 | Eval accuracy: 0,9854
Epoch: 5 | Batch: 800 | Training loss: 0,0317 | Eval loss: 0,0480 | Eval accuracy: 0,9834
Epoch: 5 | Batch: 1000 | Training loss: 0,2204 | Eval loss: 0,0489 | Eval accuracy: 0,9839
Epoch: 5 | Batch: 1200 | Training loss: 0,2653 | Eval loss: 0,0508 | Eval accuracy: 0,9842
Epoch: 5 | Batch: 1400 | Training loss: 0,0172 | Eval loss: 0,0423 | Eval accuracy: 0,9856
Epoch: 5 | Batch: 1600 | Training loss: 0,0535 | Eval loss: 0,0433 | Eval accuracy: 0,9845
Epoch: 5 | Batch: 1800 | Training loss: 0,1550 | Eval loss: 0,0433 | Eval accuracy: 0,9863 |
Beta Was this translation helpful? Give feedback.
-
It is, yes, and with amsgrad enabled it is also around 0.98 on my machine. Here's the PR that should enable the LeNet example to run on the GPU: #43 |
Beta Was this translation helpful? Give feedback.
I was able to reproduce the diverging loss on Linux. Interestingly, it does not diverge on a MacOS machine.
Could you try to reduce the learning rate like so?
In my case with that learning rate it did converge reliably then on Linux as well, but I'm still wondering why it behaves different here.
Note that even with CUDA enabled, the LeNet example currently runs on the CPU. I've fixed that now locally, PR coming soon.