Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss is not decreasing #43

Open
lucasjinreal opened this issue Nov 7, 2018 · 19 comments
Open

Loss is not decreasing #43

lucasjinreal opened this issue Nov 7, 2018 · 19 comments

Comments

@lucasjinreal
Copy link

I have trained ssd with mobilenetv2 on VOC but after almost 500 epochs, the loss is still like this:

517/518 in 0.154s [##########] | loc_loss: 1.4773 cls_loss: 2.3165

==>Train: || Total_time: 79.676s || loc_loss: 1.1118 conf_loss: 2.3807 || lr: 0.000721

Wrote snapshot to: ./experiments/models/ssd_mobilenet_v2_voc/ssd_lite_mobilenet_v2_voc_epoch_525.pth
Epoch 526/1300:
0/518 in 0.193s [----------] | loc_loss: 0.8291 cls_loss: 1.9464
1/518 in 0.186s [----------] | loc_loss: 1.3181 cls_loss: 2.5404
2/518 in 0.184s [----------] | loc_loss: 1.0371 cls_loss: 2.2243

It's doesn't change and loss is very hight...... What's the problem with implementation?

@1453042287
Copy link

did you load the pre-train weight? it works fine with my dataset

@1453042287
Copy link

or maybe you didn't change the mode is train or test in the config file

@blueardour
Copy link

blueardour commented Dec 3, 2018

@jinfagang Have you solved the problem? I have the same issue.

@1453042287 I trained the yolov2-mobilenet-v2 from stratch. U mentioned 'pre-trained model', do y mean the pre-trained bone network model (such as the mobilenetv2) or both bone model and detection model? In my training, all the parameters are not pre trained.

@1453042287
Copy link

@blueardour first, make sure you change the PHASE in .yml file to 'train', then ,actually, i believe it's inappropriate to train a model from scratch, so at least, you should load the pre-train backbone, i just utilize the whole pre-train weight(including backbone and extract and so on..) the author provided, but i set the RESUME_SCOPE in the .yml file to be 'base' only and the resault is almost the same as fine-tune's

@blueardour
Copy link

blueardour commented Dec 5, 2018

@1453042287 Hi, thanks for the advise. My current training seems working.
In my previous training, I set 'base' and 'loc' so on all in the trainable_scope, and it does not give a good result. After only reload the 'base' and retrain other parameters, I successfully recover the precision.

My only problem left is the speed for test. The nms in the test procedure seems very slow. It have been discussed in #16. Yet no good solutions.

@cvtower
Copy link

cvtower commented Dec 7, 2018

@1453042287 Hi, thanks for the advise. My current training seems working.
In my previous training, I set 'base' and 'loc' so on all in the trainable_scope, and it does not give a good result. After only reload the 'base' and retrain other parameters, I successfully recover the precision.

My only problem left is the speed for test. The nms in the test procedure seems very slow. It have been discussed in #16. Yet no good solutions.

@blueardour Hi,bellow is my test result of fssd_mobilenet_v2 on coco2017 using my config files instead of the given one. training from scratch without any pre-trained model.
Shall i only reload the 'base' paras here?

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.211
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.358
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.217
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.044
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.234
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.351
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.216
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.343
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.371
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.099
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.428
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.590

@cvtower
Copy link

cvtower commented Dec 7, 2018

ok...seems like training from scratch might not be well supported.
But i just want to use this repo to verify my network arch, and imagenet pre-trained model is still on training.

@blueardour
Copy link

Yes, set all parameter to re-trainable seems hard to converge. This year, Mr He did publish a paper named 'Rethinking ImageNet Pre-training' which claimed the pre-train on imagenet is not necessary. However, it is skillful to give a good initialization of the network.

@cvtower
Copy link

cvtower commented Dec 10, 2018

Yes, set all parameter to re-trainable seems hard to converge. This year, Mr He did publish a paper named 'Rethinking ImageNet Pre-training' which claimed the pre-train on imagenet is not necessary. However, it is skillful to give a good initialization of the network.

Yes, agree with you.
I read that paper the day it is published.
My own designed network outperform(imagenet/cifar...) several networks, however, the imagenet training is still going on(72.5 1.0). Also i have verified my network on other tasks and works fine, so i believe it will get better result on detection&&segmentation task too.
Personally, i greatly agree with views from "Detnet" and "rethinking imagenet pre-training", however, seems like that much more computation cost and specific tuning skills are needed.
Before my imagenet training finished, i will have to compare sdd performance based on models trained from scratch firstly.

@blueardour
Copy link

Hi, @1453042287 @cvtower

I have another issue about the train precision and loss curve. The following is the result from tensorboardX.

issue

It can be see that the precision slowly increase and meet a jump at around 89th epoch. I don't why the precision changes so dramatically at this point. The loc and cls loss as well the learning rate seem not change so much. Do you observe a similar phenomenon or do you have any explanation on it?

@cvtower
Copy link

cvtower commented Dec 12, 2018

Hi, @1453042287 @cvtower

I have another issue about the train precision and loss curve. The following is the result from tensorboardX.

issue

It can be see that the precision slowly increase and meet a jump at around 89th epoch. I don't why the precision changes so dramatically at this point. The loc and cls loss as well the learning rate seem not change so much. Do you observe a similar phenomenon or do you have any explanation on it?

Hi @blueardour,

I did not use the CosineAnnealing LR and no such phenomenon ever happened during training.

@XiaSunny
Copy link

您好,我想请问下:作者提供的pre-train weight文件,你是如何得到的,我没有weight目录,所以也没有预训练权重文件,还是您通过其他方式获得的?谢谢您! @1453042287

@1453042287
Copy link

@XiaSunny 下载啊。。。就在这个repo的readme里面,蓝体字

@XiaSunny
Copy link

@1453042287 好的,谢谢你。

@XiaSunny
Copy link

XiaSunny commented Mar 13, 2019

您好,我用的配置文件是fssd_vgg16_train_coco.yml,当我训练coco2017时conf_loss在5左右,loc_loss在2左右,一直不下去。我的配置文件如下:
MODEL:
SSDS: fssd
NETS: vgg16
IMAGE_SIZE: [300, 300]
NUM_CLASSES: 81
FEATURE_LAYER: [[[22, 34, 'S'], [512, 1024, 512]],
[['', 'S', 'S', 'S', '', ''], [512, 512, 256, 256, 256, 256]]]
STEPS: [[8, 8], [16, 16], [32, 32], [64, 64], [100, 100], [300, 300]]
SIZES: [[30, 30], [60, 60], [111, 111], [162, 162], [213, 213], [264, 264], [315, 315]]
ASPECT_RATIOS: [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2], [1, 2]]

TRAIN:
MAX_EPOCHS: 500
CHECKPOINTS_EPOCHS: 1
BATCH_SIZE: 28
TRAINABLE_SCOPE: 'norm,extras,transforms,pyramids,loc,conf'
RESUME_SCOPE: 'base'
OPTIMIZER:
OPTIMIZER: sgd
LEARNING_RATE: 0.001
MOMENTUM: 0.9
WEIGHT_DECAY: 0.0001
LR_SCHEDULER:
SCHEDULER: SGDR
WARM_UP_EPOCHS: 150

TEST:
BATCH_SIZE: 64
TEST_SCOPE: [90, 100]

MATCHER:
MATCHED_THRESHOLD: 0.5
UNMATCHED_THRESHOLD: 0.5
NEGPOS_RATIO: 3

POST_PROCESS:
SCORE_THRESHOLD: 0.01
IOU_THRESHOLD: 0.6
MAX_DETECTIONS: 100

DATASET:
DATASET: 'coco'
DATASET_DIR: '/home/chase/Downloads/ssds.pytorch-master/data/coco'
TRAIN_SETS: [['2017', 'train']]
TEST_SETS: [['2017', 'val']]
PROB: 0.6

EXP_DIR: './experiments/models/fssd_vgg16_coco'
LOG_DIR: './experiments/models/fssd_vgg16_coco'
RESUME_CHECKPOINT: '/home/chase/Downloads/ssds.pytorch-master/weight/vgg16_fssd_coco_27.2.pth'
PHASE: ['train']
另外,我还试了 RESUME_CHECKPOINT:vgg16_reducedfc.pth,但是效果差不多。这个问题困扰我很长时间了,我不知道怎么回事,希望你能指点一下 @1453042287 @blueardour @cvtower

@Damon2019
Copy link

您好,我用的配置文件是fssd_vgg16_train_coco.yml,当我训练coco2017时conf_loss在5左右,loc_loss在2左右,一直不下去。我的配置文件如下:
MODEL:
SSDS: fssd
NETS: vgg16
IMAGE_SIZE: [300, 300]
NUM_CLASSES: 81
FEATURE_LAYER: [[[22, 34, 'S'], [512, 1024, 512]],
[['', 'S', 'S', 'S', '', ''], [512, 512, 256, 256, 256, 256]]]
STEPS: [[8, 8], [16, 16], [32, 32], [64, 64], [100, 100], [300, 300]]
SIZES: [[30, 30], [60, 60], [111, 111], [162, 162], [213, 213], [264, 264], [315, 315]]
ASPECT_RATIOS: [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2], [1, 2]]

TRAIN:
MAX_EPOCHS: 500
CHECKPOINTS_EPOCHS: 1
BATCH_SIZE: 28
TRAINABLE_SCOPE: 'norm,extras,transforms,pyramids,loc,conf'
RESUME_SCOPE: 'base'
OPTIMIZER:
OPTIMIZER: sgd
LEARNING_RATE: 0.001
MOMENTUM: 0.9
WEIGHT_DECAY: 0.0001
LR_SCHEDULER:
SCHEDULER: SGDR
WARM_UP_EPOCHS: 150

TEST:
BATCH_SIZE: 64
TEST_SCOPE: [90, 100]

MATCHER:
MATCHED_THRESHOLD: 0.5
UNMATCHED_THRESHOLD: 0.5
NEGPOS_RATIO: 3

POST_PROCESS:
SCORE_THRESHOLD: 0.01
IOU_THRESHOLD: 0.6
MAX_DETECTIONS: 100

DATASET:
DATASET: 'coco'
DATASET_DIR: '/home/chase/Downloads/ssds.pytorch-master/data/coco'
TRAIN_SETS: [['2017', 'train']]
TEST_SETS: [['2017', 'val']]
PROB: 0.6

EXP_DIR: './experiments/models/fssd_vgg16_coco'
LOG_DIR: './experiments/models/fssd_vgg16_coco'
RESUME_CHECKPOINT: '/home/chase/Downloads/ssds.pytorch-master/weight/vgg16_fssd_coco_27.2.pth'
PHASE: ['train']
另外,我还试了 RESUME_CHECKPOINT:vgg16_reducedfc.pth,但是效果差不多。这个问题困扰我很长时间了,我不知道怎么回事,希望你能指点一下 @1453042287 @blueardour @cvtower

@XiaSunny 你好,我也遇到了你这个问题,请问你解决了吗

@Damon2019
Copy link

@1453042287 @XiaSunny 你好,我想使用预训练模型

TRAINABLE_SCOPE: 'base,norm,extras,loc,conf'
RESUME_SCOPE: 'base,norm,extras,loc,conf'
这里面的参数我应该如何修改? 谢谢!

@XiaSunny
Copy link

XiaSunny commented Dec 2, 2019 via email

@Bobby2090
Copy link

您好,我用的配置文件是fssd_vgg16_train_coco.yml,当我训练coco2017时conf_loss在5左右,loc_loss在2左右,一直不下去。我的配置文件如下:
MODEL:
SSDS: fssd
NETS: vgg16
IMAGE_SIZE: [300, 300]
NUM_CLASSES: 81
FEATURE_LAYER: [[[22, 34, 'S'], [512, 1024, 512]],
[['', 'S', 'S', 'S', '', ''], [512, 512, 256, 256, 256, 256]]]
STEPS: [[8, 8], [16, 16], [32, 32], [64, 64], [100, 100], [300, 300]]
SIZES: [[30, 30], [60, 60], [111, 111], [162, 162], [213, 213], [264, 264], [315, 315]]
ASPECT_RATIOS: [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2], [1, 2]]

TRAIN:
MAX_EPOCHS: 500
CHECKPOINTS_EPOCHS: 1
BATCH_SIZE: 28
TRAINABLE_SCOPE: 'norm,extras,transforms,pyramids,loc,conf'
RESUME_SCOPE: 'base'
OPTIMIZER:
OPTIMIZER: sgd
LEARNING_RATE: 0.001
MOMENTUM: 0.9
WEIGHT_DECAY: 0.0001
LR_SCHEDULER:
SCHEDULER: SGDR
WARM_UP_EPOCHS: 150

TEST:
BATCH_SIZE: 64
TEST_SCOPE: [90, 100]

MATCHER:
MATCHED_THRESHOLD: 0.5
UNMATCHED_THRESHOLD: 0.5
NEGPOS_RATIO: 3

POST_PROCESS:
SCORE_THRESHOLD: 0.01
IOU_THRESHOLD: 0.6
MAX_DETECTIONS: 100

DATASET:
DATASET: 'coco'
DATASET_DIR: '/home/chase/Downloads/ssds.pytorch-master/data/coco'
TRAIN_SETS: [['2017', 'train']]
TEST_SETS: [['2017', 'val']]
PROB: 0.6

EXP_DIR: './experiments/models/fssd_vgg16_coco'
LOG_DIR: './experiments/models/fssd_vgg16_coco'
RESUME_CHECKPOINT: '/home/chase/Downloads/ssds.pytorch-master/weight/vgg16_fssd_coco_27.2.pth'
PHASE: ['train']
另外,我还试了 RESUME_CHECKPOINT:vgg16_reducedfc.pth,但是效果差不多。这个问题困扰我很长时间了,我不知道怎么回事,希望你能指点一下 @1453042287 @blueardour @cvtower

你好,我最近训练也遇到loss不下降的问题,一直维持在4左右,下载的模型,没做任何修改,只是重新加载base进行训练,求问你最终是如何解决的,万分感谢~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants