Short summary of “DINO”

Poonam Saini
2 min readJun 13, 2024


Self-distillation with no labels : self-supervised method.

To study the impact of self-supervised pretraining on ViT features.

Importance of using smaller patches with ViTs to improve the quality of the resulting features.

DINO simplifies self-supervised training by directly predicting the output of a teacher network — built with a momentum encoder — by using a standard cross-entropy loss.

Figure from the paper

The approach is related to codistillation where student and teacher have the same architecture and use distillation during training. However in DINO, the teachers in codistillation is also distilling from the student, while it is updated with an average of the student.


Knowledge distillation is a learning paradigm where we train a student network gθs to match the output of a given teacher network gθt , parameterized by θs and θt respectively.

First, different distorted views, or crops, of an image with multi-crop strategy are constructed. More precisely, from a given image, generate a set V of different views. This set contains two global views, xg1 and xg2 (for example greater than 50%) and several local views of smaller resolution ( for example less than 50%). All crops are passed through the student while only the global views are passed through the teacher, therefore encouraging “local-to-global” correspondences.

The parameters θs are learned by minimizing the loss with stochastic gradient descent and θt with the momentum encoder (ema update rule).

The teacher network has better performance than the student throughout the training and hence, guides the training of the student by providing target features of higher quality.

DINO avoids models collapse with centering and sharpening of the momentum teacher. Centering is obtained as adding a bias term c to the teacher and is updated with an exponential moving average. Output sharpening is obtained by using a low value for the temperature in the teacher softmax normalization.