Cls loss & Total loss too big #285

ChuyiZhong · 2020-05-11T06:24:23Z

Hi all,
I am training it on my own dataset with one class, but the cls loss and total loss seem strange to me. How to get this right?
Thanks in advance.

zylo117 · 2020-05-11T06:47:39Z

what's the training parameter and command? Try update the latest code and train with --head_only True

ChuyiZhong · 2020-05-11T06:57:43Z

what's the training parameter and command? Try update the latest code and train with --head_only True

python train.py -c 4 -p crowdhuman --batch_size 8 --lr 1e-5 --num_epochs 100
--load_weights /home/fudan/zhongchuyi/Yet-Another-EfficientDet-Pytorch/weights/efficientdet-d4.pth

zylo117 · 2020-05-11T07:07:27Z

I think there is something wrong with the dataset. Can you run the tutorial? Try training on the shape dataset. If cls loss drop to less than 10 in a few epochs, the code is fine. It's take about 5 minutes.

ggaziv · 2020-05-11T15:05:32Z

@ChuyiZhong see #252

sourabhyadav · 2020-05-12T13:33:06Z

Hi all,
I am training it on my own dataset with one class, but the cls loss and total loss seem strange to me. How to get this right?
Thanks in advance.

I am facing a similar issue. Class loss is in range of hundreds after 2 epochs.

zylo117 · 2020-05-12T14:47:32Z

It's not normal. It should drop to under 1.0 real soon.
Please check your dataset.
Can you get a good result from the tutorial on your server?

sourabhyadav · 2020-05-12T15:00:10Z

It's not normal. It should drop to under 1.0 real soon.
Please check your dataset.
Can you get a good result from the tutorial on your server?

Here is the results after 10 iterations:

Step: 12709. Epoch: 9/10. Iteration: 1271/1271. Cls loss: 0.46582. Reg loss: 0.02957. Total loss: 0.49539: 100%|█| 1271/1271 [05:45<00:00,  3.68it/
Val. Epoch: 9/10. Classification loss: 0.53922. Regression loss: 0.03294. Total loss: 0.57216

For 1-class model. I have cross-checked my category_id = 1.

The command used for training is:

python3 train.py -c 1 -p pc_loockdown --data_path /home/object_det/datasets/  --batch_size 8 --lr 1e-5 --num_epochs 10 --head_only True --load_weights weights/efficientdet-d1.pth

@zylo117 Does it look normal?

zylo117 · 2020-05-12T15:02:30Z

Yes, now it is.
BTW, did you update the latest code. In the previous version, head_only may not work as expected.

sourabhyadav · 2020-05-12T15:16:22Z

Yes I cloned today's repo:

One doubt still remaining:
Why my training classification loss started with 1000?

[Info] loaded weights: efficientdet-d1.pth, resuming checkpoint from step: 0
[Info] freezed backbone
Step: 499. Epoch: 0/10. Iteration: 500/1271. Cls loss: 1132.17944. Reg loss: 0.06527. Total loss: 1132.24475:  39%|▍| 499/1271 [02:16<03:11,  4.04icheckpoint...
Step: 999. Epoch: 0/10. Iteration: 1000/1271. Cls loss: 866.44366. Reg loss: 0.02792. Total loss: 866.47156:  79%|▊| 999/1271 [04:27<00:56,  4.78itcheckpoint...
Step: 1270. Epoch: 0/10. Iteration: 1271/1271. Cls loss: 901.25433. Reg loss: 0.02943. Total loss: 901.28375: 100%|█| 1271/1271 [05:37<00:00,  3.76
Val. Epoch: 0/10. Classification loss: 1058.99283. Regression loss: 0.03893. Total loss: 1059.03176
Step: 1499. Epoch: 1/10. Iteration: 229/1271. Cls loss: 693.96698. Reg loss: 0.03183. Total loss: 693.99878:  18%|▏| 228/1271 [01:05<04:42,  3.69itcheckpoint...
Step: 1999. Epoch: 1/10. Iteration: 729/1271. Cls loss: 616.92957. Reg loss: 0.03471. Total loss: 616.96429:  57%|▌| 728/1271 [03:21<02:08,  4.22itcheckpoint...
Step: 2499. Epoch: 1/10. Iteration: 1229/1271. Cls loss: 294.32889. Reg loss: 0.05017. Total loss: 294.37906:  97%|▉| 1228/1271 [05:36<00:11,  3.77checkpoint...
Step: 2541. Epoch: 1/10. Iteration: 1271/1271. Cls loss: 367.30658. Reg loss: 0.03857. Total loss: 367.34515: 100%|█| 1271/1271 [05:46<00:00,  3.67
Val. Epoch: 1/10. Classification loss: 385.38531. Regression loss: 0.03736. Total loss: 385.42267
Step: 2999. Epoch: 2/10. Iteration: 458/1271. Cls loss: 228.76134. Reg loss: 0.03491. Total loss: 228.79625:  36%|▎| 457/1271 [02:05<03:08,  4.32itcheckpoint...
Step: 3499. Epoch: 2/10. Iteration: 958/1271. Cls loss: 126.56357. Reg loss: 0.03671. Total loss: 126.60027:  75%|▊| 957/1271 [04:23<01:10,  4.43itcheckpoint...
Step: 3812. Epoch: 2/10. Iteration: 1271/1271. Cls loss: 92.13811. Reg loss: 0.03446. Total loss: 92.17256: 100%|█| 1271/1271 [05:44<00:00,  3.68it
Val. Epoch: 2/10. Classification loss: 115.62846. Regression loss: 0.03631. Total loss: 115.66477

@zylo117 Is this normal behaviour for this repo?

zylo117 · 2020-05-12T16:11:57Z

It is. It might be caused by a lower lr.
For your case, I think 1e-5 is a little bit too low.
But considering you are doing great so far, don't change it.

sourabhyadav · 2020-05-12T16:50:23Z

@zylo117 Thanks for proactive replies.
I did the inference on my custom testset I find lower AP than pre-trained network? Pre-trained (80 class) person AP = 64.69% and fine-tuned (1-class) person AP = 30.26%.

Basically the person in the dataset is in lookdown or top-down view. I have only 10k training images. I trained with the above command.

Why this would have happened? My intuition is since it is the person class but from a different view it should have improved at least better than pre-trained model?

zylo117 · 2020-05-12T23:20:31Z

if it's not overfitting like your logs say. it's underfitting.
Maybe it's not converged.
Maybe increase learning rate and keep training

sourabhyadav · 2020-05-13T11:01:36Z

Yeah, after 100 epochs AP results were improved.
Thanks for awesome repo.

lzh18628137361 · 2020-05-18T03:16:47Z

Yeah, after 100 epochs AP results were improved.
Thanks for awesome repo.

after 100 epochs ,loss = ? ,thanks

sourabhyadav · 2020-05-26T17:13:20Z

Yeah, after 100 epochs AP results were improved.
Thanks for awesome repo.

after 100 epochs ,loss = ? ,thanks

@lzh18628137361 approx after 50-60 epochs the losses are in the range of: classification loss: 0.00256, Reg loss: 0.0018 .

yhenon · 2020-07-24T12:52:22Z

This issue is caused by the initialization of the classifier layer. In practice, it is desirable to initialize the classifier layer such that it predicts 0.01 for each box. Since the large majority of boxes are negative, this leads to a much lower loss.
Add the following to line 396 of efficientdet/model.py:

        self.header.pointwise_conv.conv.weight.data.fill_(0)
        self.header.pointwise_conv.conv.bias.data.fill_(-4.59)

because sigmoid(-4.59) = 0.01
This reduced the loss at epoch 0 from 50 to 2.

oliviervluijk mentioned this issue Oct 21, 2022

Cls loss not decreasing #741

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cls loss & Total loss too big #285

Cls loss & Total loss too big #285

ChuyiZhong commented May 11, 2020

zylo117 commented May 11, 2020

ChuyiZhong commented May 11, 2020

zylo117 commented May 11, 2020

ggaziv commented May 11, 2020

sourabhyadav commented May 12, 2020

zylo117 commented May 12, 2020

sourabhyadav commented May 12, 2020 •

edited

Loading

zylo117 commented May 12, 2020

sourabhyadav commented May 12, 2020 •

edited by zylo117

Loading

zylo117 commented May 12, 2020

sourabhyadav commented May 12, 2020

zylo117 commented May 12, 2020

sourabhyadav commented May 13, 2020

lzh18628137361 commented May 18, 2020

sourabhyadav commented May 26, 2020

yhenon commented Jul 24, 2020

Cls loss & Total loss too big #285

Cls loss & Total loss too big #285

Comments

ChuyiZhong commented May 11, 2020

zylo117 commented May 11, 2020

ChuyiZhong commented May 11, 2020

zylo117 commented May 11, 2020

ggaziv commented May 11, 2020

sourabhyadav commented May 12, 2020

zylo117 commented May 12, 2020

sourabhyadav commented May 12, 2020 • edited Loading

zylo117 commented May 12, 2020

sourabhyadav commented May 12, 2020 • edited by zylo117 Loading

zylo117 commented May 12, 2020

sourabhyadav commented May 12, 2020

zylo117 commented May 12, 2020

sourabhyadav commented May 13, 2020

lzh18628137361 commented May 18, 2020

sourabhyadav commented May 26, 2020

yhenon commented Jul 24, 2020

sourabhyadav commented May 12, 2020 •

edited

Loading

sourabhyadav commented May 12, 2020 •

edited by zylo117

Loading