Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initializing the student network with pre-trained weights #124

Open
RuixiangZhao opened this issue Feb 13, 2025 · 1 comment
Open

initializing the student network with pre-trained weights #124

RuixiangZhao opened this issue Feb 13, 2025 · 1 comment

Comments

@RuixiangZhao
Copy link

  • I have a question regarding the student networks in your method. In AM-RADIO, the student networks are all trained from scratch. I’m wondering if you have tried initializing the student network with pre-trained weights before starting the distillation process? For example, using the pre-trained teachers (DINOv2, CLIP, SAM) to distill a standard ViT-L initialized with pre-trained weights from OpenAI's CLIP.

  • The reason I ask is that training a student network from scratch requires substantial training data and computing resources. In contrast, initializing the student with pre-trained parameters might reduce the computational burden and potentially lead to better results, assuming it is done correctly.

  • If I wanted to use RADIO for this kind of distillation task, what recommendations would you suggest? Specifically, in terms of learning rate settings, choice of training dataset for distillation, or any other adjustments that might be beneficial.

Thank you for your time, and I am looking forward to your reply!

@mranzinger
Copy link
Collaborator

Hi, yes, we've done quite a few experiments with initializing from existing weights. The ViT-g/14 model we released actually was initialized from DINOv2-g-reg.

That said, I've seen mixed/inconclusive results with whether random initialization or pretrained initialization is better. Across our metric suite, it appears as though initial convergence is rapid with the pretrained models, but random init eventually catches up by the end of training. However, I've mostly been watching this with the >= ViT-L models. There's a decent chance that the smaller models could benefit from pretrained init.

We haven't done much of a hyperparameter sweep of the space, instead just relying on what appears to be some common settings.

LR: 1e-3 with cosine annealing loss (no restarts)
Weight Decay: 1e-2
Dataset: Depends on resources. If you're compute constrained, using ImageNet-1k seems to have a much more rapid initial convergence. In the long run, it seems to be the case that DataComp-1B ultimately works better.

If you have a target domain in mind, then perhaps biasing more with that data would be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants