Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cls loss & Total loss too big #285

Open
ChuyiZhong opened this issue May 11, 2020 · 16 comments
Open

Cls loss & Total loss too big #285

ChuyiZhong opened this issue May 11, 2020 · 16 comments

Comments

@ChuyiZhong
Copy link

屏幕快照 2020-05-11 下午2 22 00
Hi all,
I am training it on my own dataset with one class, but the cls loss and total loss seem strange to me. How to get this right?
Thanks in advance.

@zylo117
Copy link
Owner

zylo117 commented May 11, 2020

what's the training parameter and command? Try update the latest code and train with --head_only True

@ChuyiZhong
Copy link
Author

what's the training parameter and command? Try update the latest code and train with --head_only True

python train.py -c 4 -p crowdhuman --batch_size 8 --lr 1e-5 --num_epochs 100
--load_weights /home/fudan/zhongchuyi/Yet-Another-EfficientDet-Pytorch/weights/efficientdet-d4.pth

@zylo117
Copy link
Owner

zylo117 commented May 11, 2020

I think there is something wrong with the dataset. Can you run the tutorial? Try training on the shape dataset. If cls loss drop to less than 10 in a few epochs, the code is fine. It's take about 5 minutes.

@ggaziv
Copy link
Contributor

ggaziv commented May 11, 2020

@ChuyiZhong see #252

@sourabhyadav
Copy link

屏幕快照 2020-05-11 下午2 22 00
Hi all,
I am training it on my own dataset with one class, but the cls loss and total loss seem strange to me. How to get this right?
Thanks in advance.

I am facing a similar issue. Class loss is in range of hundreds after 2 epochs.

@zylo117
Copy link
Owner

zylo117 commented May 12, 2020

It's not normal. It should drop to under 1.0 real soon.
Please check your dataset.
Can you get a good result from the tutorial on your server?

@sourabhyadav
Copy link

sourabhyadav commented May 12, 2020

It's not normal. It should drop to under 1.0 real soon.
Please check your dataset.
Can you get a good result from the tutorial on your server?

Here is the results after 10 iterations:

Step: 12709. Epoch: 9/10. Iteration: 1271/1271. Cls loss: 0.46582. Reg loss: 0.02957. Total loss: 0.49539: 100%|█| 1271/1271 [05:45<00:00,  3.68it/
Val. Epoch: 9/10. Classification loss: 0.53922. Regression loss: 0.03294. Total loss: 0.57216

For 1-class model. I have cross-checked my category_id = 1.

The command used for training is:

python3 train.py -c 1 -p pc_loockdown --data_path /home/object_det/datasets/  --batch_size 8 --lr 1e-5 --num_epochs 10 --head_only True --load_weights weights/efficientdet-d1.pth

@zylo117 Does it look normal?

@zylo117
Copy link
Owner

zylo117 commented May 12, 2020

Yes, now it is.
BTW, did you update the latest code. In the previous version, head_only may not work as expected.

@sourabhyadav
Copy link

sourabhyadav commented May 12, 2020

Yes I cloned today's repo:

One doubt still remaining:
Why my training classification loss started with 1000?

[Info] loaded weights: efficientdet-d1.pth, resuming checkpoint from step: 0
[Info] freezed backbone
Step: 499. Epoch: 0/10. Iteration: 500/1271. Cls loss: 1132.17944. Reg loss: 0.06527. Total loss: 1132.24475:  39%|▍| 499/1271 [02:16<03:11,  4.04icheckpoint...
Step: 999. Epoch: 0/10. Iteration: 1000/1271. Cls loss: 866.44366. Reg loss: 0.02792. Total loss: 866.47156:  79%|▊| 999/1271 [04:27<00:56,  4.78itcheckpoint...
Step: 1270. Epoch: 0/10. Iteration: 1271/1271. Cls loss: 901.25433. Reg loss: 0.02943. Total loss: 901.28375: 100%|█| 1271/1271 [05:37<00:00,  3.76
Val. Epoch: 0/10. Classification loss: 1058.99283. Regression loss: 0.03893. Total loss: 1059.03176
Step: 1499. Epoch: 1/10. Iteration: 229/1271. Cls loss: 693.96698. Reg loss: 0.03183. Total loss: 693.99878:  18%|▏| 228/1271 [01:05<04:42,  3.69itcheckpoint...
Step: 1999. Epoch: 1/10. Iteration: 729/1271. Cls loss: 616.92957. Reg loss: 0.03471. Total loss: 616.96429:  57%|▌| 728/1271 [03:21<02:08,  4.22itcheckpoint...
Step: 2499. Epoch: 1/10. Iteration: 1229/1271. Cls loss: 294.32889. Reg loss: 0.05017. Total loss: 294.37906:  97%|▉| 1228/1271 [05:36<00:11,  3.77checkpoint...
Step: 2541. Epoch: 1/10. Iteration: 1271/1271. Cls loss: 367.30658. Reg loss: 0.03857. Total loss: 367.34515: 100%|█| 1271/1271 [05:46<00:00,  3.67
Val. Epoch: 1/10. Classification loss: 385.38531. Regression loss: 0.03736. Total loss: 385.42267
Step: 2999. Epoch: 2/10. Iteration: 458/1271. Cls loss: 228.76134. Reg loss: 0.03491. Total loss: 228.79625:  36%|▎| 457/1271 [02:05<03:08,  4.32itcheckpoint...
Step: 3499. Epoch: 2/10. Iteration: 958/1271. Cls loss: 126.56357. Reg loss: 0.03671. Total loss: 126.60027:  75%|▊| 957/1271 [04:23<01:10,  4.43itcheckpoint...
Step: 3812. Epoch: 2/10. Iteration: 1271/1271. Cls loss: 92.13811. Reg loss: 0.03446. Total loss: 92.17256: 100%|█| 1271/1271 [05:44<00:00,  3.68it
Val. Epoch: 2/10. Classification loss: 115.62846. Regression loss: 0.03631. Total loss: 115.66477

@zylo117 Is this normal behaviour for this repo?

@zylo117
Copy link
Owner

zylo117 commented May 12, 2020

It is. It might be caused by a lower lr.
For your case, I think 1e-5 is a little bit too low.
But considering you are doing great so far, don't change it.

@sourabhyadav
Copy link

@zylo117 Thanks for proactive replies.
I did the inference on my custom testset I find lower AP than pre-trained network? Pre-trained (80 class) person AP = 64.69% and fine-tuned (1-class) person AP = 30.26%.

Basically the person in the dataset is in lookdown or top-down view. I have only 10k training images. I trained with the above command.

Why this would have happened? My intuition is since it is the person class but from a different view it should have improved at least better than pre-trained model?

@zylo117
Copy link
Owner

zylo117 commented May 12, 2020

if it's not overfitting like your logs say. it's underfitting.
Maybe it's not converged.
Maybe increase learning rate and keep training

@sourabhyadav
Copy link

Yeah, after 100 epochs AP results were improved.
Thanks for awesome repo.

@lzh18628137361
Copy link

Yeah, after 100 epochs AP results were improved.
Thanks for awesome repo.

after 100 epochs ,loss = ? ,thanks

@sourabhyadav
Copy link

Yeah, after 100 epochs AP results were improved.
Thanks for awesome repo.

after 100 epochs ,loss = ? ,thanks

@lzh18628137361 approx after 50-60 epochs the losses are in the range of: classification loss: 0.00256, Reg loss: 0.0018 .

@yhenon
Copy link

yhenon commented Jul 24, 2020

This issue is caused by the initialization of the classifier layer. In practice, it is desirable to initialize the classifier layer such that it predicts 0.01 for each box. Since the large majority of boxes are negative, this leads to a much lower loss.
Add the following to line 396 of efficientdet/model.py:

        self.header.pointwise_conv.conv.weight.data.fill_(0)
        self.header.pointwise_conv.conv.bias.data.fill_(-4.59)

because sigmoid(-4.59) = 0.01
This reduced the loss at epoch 0 from 50 to 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants