Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The reason why the classification accuracy is different from the result of the paper #1

Open
userDJX opened this issue Oct 30, 2019 · 15 comments

Comments

@userDJX
Copy link

userDJX commented Oct 30, 2019

Thank you for your contribution. I would like to ask why there is a big error between the experimental results and the paper. I don't know right now.

@sairin1202
Copy link
Owner

The reason might be the distillation loss that I did not implement.

@userDJX
Copy link
Author

userDJX commented Nov 5, 2019

I was first introduced to incremental learning, and when I read your code, I found that you adjusted the parameters a little bit to make your code more accurate to the paper. As for the distillation loss, I think what you wrote is consistent with the paper. I don't know right now.

@userDJX
Copy link
Author

userDJX commented Nov 5, 2019

Excuse me, do you have any way to achieve the same level of results as the paper? I hope you can help me, [email protected], thank you

@sairin1202
Copy link
Owner

I think the best way is to contact the author of that paper.

@srvCodes
Copy link

srvCodes commented Feb 7, 2020

A major issue with your implementation is that the layers of the main model are trainable even while adjusting the bias correction parameters when it should ideally be frozen. Also, the bias layer's parameters should be frozen during training the FC and convolutional layers.

@userDJX
Copy link
Author

userDJX commented Jul 26, 2020

Hello, I seem to have found a problem with this code. If the sample set is removed and the bias is removed, the remaining part should be LwF, but when I run the LwF algorithm, I still find the result is wrong.

I'm thinking about two things, one is that the FC layer is directly output to 10 in this code, and the other is the part of network parameters. I feel that as long as the accuracy of that part of LwF algorithm is improved, the accuracy of this code will be improved, but My ability is limited, I hope you can help me ~

@sairin1202
Copy link
Owner

@srvCodes Thank you for pointing out my mistakes. I have changed the code. @userDJX After fixing the parameter when training, the results seem better. Thank you for your response and advice.

@userDJX
Copy link
Author

userDJX commented Jul 29, 2020

If you want to send to improve accuracy of incremental, can try to modify the size of the train_x, from 9000 to 10000, in cifar100. Add in py, as well as in the BIC algorithm, the paper just changed the new category of bias, the old class bias did not change, this you can see, the last point is that you need in front of the previous_model plus with the torch. No_grad or self. Previous_model. Eval ()

@userDJX
Copy link
Author

userDJX commented Jul 29, 2020

I hope your algorithm can reach the result of the paper as soon as possible

@userDJX
Copy link
Author

userDJX commented Jul 29, 2020

I duplicated it successfully, and the result was 0.817 0.7265 0.6555 0.5971 0.5561. I checked the experimental part of BIC paper again, and found that the author might deliberately choose the best data to write in the paper. The reason is that in Figure 8 of the paper, the first 20 categories do not increase. If the same model is selected, such as ResNet, for training, then the purple circle is unlikely to be 2% higher than other coils, such as ICARL

@sairin1202
Copy link
Owner

It's possible. Thank you for your help!

@srvCodes
Copy link

srvCodes commented Aug 1, 2020

I have incorporated the same with a dynamic model and a couple of other details, e.g., the authors say that the bias correction should be done only after the second incremental batch has arrived. You can find the implementation at https://github.com/srvCodes/continual-learning-benchmark. @sairin1202 - thanks for having made your code public, would not be possible without that. 👍

@EdenBelouadah
Copy link

Hello,
Please, I wonder why you multiply the distillation loss by T² and not by alpha?

I would say that the original formula of distillation is:

loss = alpha * loss_soft_target + (1-alpha) * loss_hard_target

instead of

loss = loss_soft_target * T * T + (1-alpha) * loss_hard_target

@bwolfson97
Copy link

@EdenBelouadah I think they scale the distillation loss by T² because that's they say to do in the original knowledge distillation paper when using both soft and hard targets in the loss:

"Since the magnitudes of the gradients produced by the soft targets scale as 1/T² it is important to multiply them by T² when using both hard and soft targets. This ensures that the relative contributions of the hard and soft targets remain roughly unchanged if the temperature used for distillation is changed while experimenting with meta-parameters."

  • (See the last paragraph of section 2 "Distillation" here).

I don't think they do this scaling though in the original Large Scale Incremental Learning paper though. (See calculation of loss here). It looks like in the original implementation, they use:

loss = alpha * loss_soft_target + (1-alpha) * loss_hard_target

as described in the paper.

@EdenBelouadah
Copy link

EdenBelouadah commented Mar 11, 2021

@EdenBelouadah I think they scale the distillation loss by T² because that's they say to do in the original knowledge distillation paper when using both soft and hard targets in the loss:

"Since the magnitudes of the gradients produced by the soft targets scale as 1/T² it is important to multiply them by T² when using both hard and soft targets. This ensures that the relative contributions of the hard and soft targets remain roughly unchanged if the temperature used for distillation is changed while experimenting with meta-parameters."

  • (See the last paragraph of section 2 "Distillation" here).

I don't think they do this scaling though in the original Large Scale Incremental Learning paper though. (See calculation of loss here). It looks like in the original implementation, they use:

loss = alpha * loss_soft_target + (1-alpha) * loss_hard_target

as described in the paper.

Thank you for the answer. I understand the use of T². However, the distillation used here is:

loss = loss_soft_target * T * T + (1-alpha) * loss_hard_target

I still don't understand why "loss_hard_target" is multiplied by (1-alpha)?
alpha is supposed to weighten the contribution of distillation vs. classification loss, isn't it? (I mean shouldn't we multiply "loss_soft_target * T * T" by alpha?
Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants