You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 15, 2024. It is now read-only.
Hi, thanks for your great work!
I'm confused about the setting of ablation experiment of DeiT below:
As you can see, the DeiT– usual distillation and DeiT– hard distillation don't use the GT for training?
But in the early version of the paper, the setting is contrary, which indicades the GT label is used for training. Like this:
In this experiment, the result indicades that the model supervised by teacher's output is better than the GT. That's all right?
Can you explain the reason of the phenomenon for me? Looking forward your reply! :)
The text was updated successfully, but these errors were encountered:
Hi @Berry-Wu ,
Thanks for your message.
Sorry, the table is maybe not very clear we use the GT labels with the different distillation approach.
The advantage of distillation is that it can adapt to data-augmentation which can make the label noisy (see the example below).
Best,
Hugo
Hi, thanks for your great work!


I'm confused about the setting of ablation experiment of DeiT below:
As you can see, the
DeiT– usual distillation
andDeiT– hard distillation
don't use the GT for training?But in the early version of the paper, the setting is contrary, which indicades the GT label is used for training. Like this:
In this experiment, the result indicades that the model supervised by teacher's output is better than the GT. That's all right?
Can you explain the reason of the phenomenon for me? Looking forward your reply! :)
The text was updated successfully, but these errors were encountered: