train loss nan in #147

7yzx · 2024-02-28T04:42:26Z

No description provided.

7yzx · 2024-02-28T04:47:23Z

no change /home/yezixiao/project/miniconda3/condabin/conda
no change /home/yezixiao/project/miniconda3/bin/conda
no change /home/yezixiao/project/miniconda3/bin/conda-env
no change /home/yezixiao/project/miniconda3/bin/activate
no change /home/yezixiao/project/miniconda3/bin/deactivate
no change /home/yezixiao/project/miniconda3/etc/profile.d/conda.sh
no change /home/yezixiao/project/miniconda3/etc/fish/conf.d/conda.fish
no change /home/yezixiao/project/miniconda3/shell/condabin/Conda.psm1
no change /home/yezixiao/project/miniconda3/shell/condabin/conda-hook.ps1
no change /home/yezixiao/project/miniconda3/lib/python3.7/site-packages/xontrib/conda.xsh
no change /home/yezixiao/project/miniconda3/etc/profile.d/conda.csh
no change /home/yezixiao/.bashrc
No action taken.
mkdir: cannot create directory ‘./mymodels_edge/bts_nyu_v2_pytorch_test’: File exists
You have specified --do_online_eval.
This will evaluate the model every eval_freq 1000 steps and save best models for individual eval metrics.
Fixing first conv layer
Total number of parameters: 47000688
Total number of learning parameters: 46766640
Model Initialized
Initial variables' sum: -4548.164, avg: -20.036
[epoch][s/s_per_e/gs]: [0][0/1515/0], lr: 0.000100000000, loss: 6.135936260223
[epoch][s/s_per_e/gs]: [0][1/1515/1], lr: 0.000099998218, loss: 7.346207618713
[epoch][s/s_per_e/gs]: [0][2/1515/2], lr: 0.000099996436, loss: nan
NaN in loss occurred. Aborting training.
I train it on supercomputer A100 1card.
my batchsize is 16， and weight_decay=1e-3, and I also use different batchsize 4, but the result is the same.

I my own computer batchsize 2 can work because my graphics memory is 5G
I don't know how to solve it

frickyinn · 2024-06-22T01:47:34Z

This might work #149

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train loss nan in #147

train loss nan in #147

7yzx commented Feb 28, 2024

7yzx commented Feb 28, 2024

frickyinn commented Jun 22, 2024

train loss nan in #147

train loss nan in #147

Comments

7yzx commented Feb 28, 2024

7yzx commented Feb 28, 2024

frickyinn commented Jun 22, 2024