Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train loss nan in #147

Open
7yzx opened this issue Feb 28, 2024 · 2 comments
Open

train loss nan in #147

7yzx opened this issue Feb 28, 2024 · 2 comments

Comments

@7yzx
Copy link

7yzx commented Feb 28, 2024

No description provided.

@7yzx
Copy link
Author

7yzx commented Feb 28, 2024

no change /home/yezixiao/project/miniconda3/condabin/conda
no change /home/yezixiao/project/miniconda3/bin/conda
no change /home/yezixiao/project/miniconda3/bin/conda-env
no change /home/yezixiao/project/miniconda3/bin/activate
no change /home/yezixiao/project/miniconda3/bin/deactivate
no change /home/yezixiao/project/miniconda3/etc/profile.d/conda.sh
no change /home/yezixiao/project/miniconda3/etc/fish/conf.d/conda.fish
no change /home/yezixiao/project/miniconda3/shell/condabin/Conda.psm1
no change /home/yezixiao/project/miniconda3/shell/condabin/conda-hook.ps1
no change /home/yezixiao/project/miniconda3/lib/python3.7/site-packages/xontrib/conda.xsh
no change /home/yezixiao/project/miniconda3/etc/profile.d/conda.csh
no change /home/yezixiao/.bashrc
No action taken.
mkdir: cannot create directory ‘./mymodels_edge/bts_nyu_v2_pytorch_test’: File exists
You have specified --do_online_eval.
This will evaluate the model every eval_freq 1000 steps and save best models for individual eval metrics.
Fixing first conv layer
Total number of parameters: 47000688
Total number of learning parameters: 46766640
Model Initialized
Initial variables' sum: -4548.164, avg: -20.036
[epoch][s/s_per_e/gs]: [0][0/1515/0], lr: 0.000100000000, loss: 6.135936260223
[epoch][s/s_per_e/gs]: [0][1/1515/1], lr: 0.000099998218, loss: 7.346207618713
[epoch][s/s_per_e/gs]: [0][2/1515/2], lr: 0.000099996436, loss: nan
NaN in loss occurred. Aborting training.
I train it on supercomputer A100 1card.
my batchsize is 16, and weight_decay=1e-3, and I also use different batchsize 4, but the result is the same.

I my own computer batchsize 2 can work because my graphics memory is 5G
I don't know how to solve it

@frickyinn
Copy link

This might work #149

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants