PaperSummary14 : SupMAE
The paper introduces SupMAE, a novel approach to supervised pre-training of vision transformers by integrating supervised learning into the Masked Autoencoder (MAE) framework. Traditional MAE is self-supervised which reconstructs masked image patches to learn representations but lacks global feature learning. SupMAE addresses this limitation by incorporating a supervised classification branch, enabling the learning of both local and global features from golden labels (i.e., class labels). This approach uses only a subset of visible image patches for classification, enhancing efficiency.
The methodology steps are:
- Framework: SupMAE adds a supervised classification branch alongside the reconstruction objective of MAE. Reconstruction branch reconstructs missing pixels using a lightweight decoder, focusing on local features. Classification branch uses visible patches for supervised classification, enabling global features.
- Pre-training Objectives: A weighted combination of reconstruction and classification losses is used, balancing local and global feature learning. Random masking works as a form of augmentation ensuring efficiency and robust feature learning.
- Training and Fine-Tuning: Pre-training involves processing 25% of the image patches for classification and reconstructing the rest. During fine-tuning, the encoder (trained on masked patches) is used with uncorrupted images for downstream tasks.
Overall, this hybrid approach shows efficiency, superior performance in few-shot and dense prediction tasks and better transferability.
References: