Short summary of “SwAV”
This method clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or views) of the dame image, instead of comparing features directly as in contrastive learning.
In this paper, a simple “swapped” prediction problem is proposed where the code of a view from the representation of another view is predicted. They learn features by Swapping Assignments between multiple Views of the same image (SwAV).
This paper also proposes a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full resolution views, without increasing memory or compute requirements.
Given two image features z_t and z_s from two different augmentations of the same image, their codes q_t and q_s are computed by matching these
features to a set of K prototypes {c1 , . . . , cK }. Then a “swapped” prediction problem with the following loss function is set up:
L(z_t , z_s )= l(z_t , q_s ) + l(z_s , q_t ) where the function l(z,q) measures the fit between features z and a code q.