initializing the student network with pre-trained weights #124

RuixiangZhao · 2025-02-13T03:11:58Z

I have a question regarding the student networks in your method. In AM-RADIO, the student networks are all trained from scratch. I’m wondering if you have tried initializing the student network with pre-trained weights before starting the distillation process? For example, using the pre-trained teachers (DINOv2, CLIP, SAM) to distill a standard ViT-L initialized with pre-trained weights from OpenAI's CLIP.
The reason I ask is that training a student network from scratch requires substantial training data and computing resources. In contrast, initializing the student with pre-trained parameters might reduce the computational burden and potentially lead to better results, assuming it is done correctly.
If I wanted to use RADIO for this kind of distillation task, what recommendations would you suggest? Specifically, in terms of learning rate settings, choice of training dataset for distillation, or any other adjustments that might be beneficial.

Thank you for your time, and I am looking forward to your reply!

mranzinger · 2025-02-13T20:24:29Z

Hi, yes, we've done quite a few experiments with initializing from existing weights. The ViT-g/14 model we released actually was initialized from DINOv2-g-reg.

That said, I've seen mixed/inconclusive results with whether random initialization or pretrained initialization is better. Across our metric suite, it appears as though initial convergence is rapid with the pretrained models, but random init eventually catches up by the end of training. However, I've mostly been watching this with the >= ViT-L models. There's a decent chance that the smaller models could benefit from pretrained init.

We haven't done much of a hyperparameter sweep of the space, instead just relying on what appears to be some common settings.

LR: 1e-3 with cosine annealing loss (no restarts)
Weight Decay: 1e-2
Dataset: Depends on resources. If you're compute constrained, using ImageNet-1k seems to have a much more rapid initial convergence. In the long run, it seems to be the case that DataComp-1B ultimately works better.

If you have a target domain in mind, then perhaps biasing more with that data would be helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initializing the student network with pre-trained weights #124

initializing the student network with pre-trained weights #124

RuixiangZhao commented Feb 13, 2025

mranzinger commented Feb 13, 2025

initializing the student network with pre-trained weights #124

initializing the student network with pre-trained weights #124

Comments

RuixiangZhao commented Feb 13, 2025

mranzinger commented Feb 13, 2025