PaperSummary13 : MAE

1 min readJan 13, 2025

The paper presents Masked Autoencoders (MAEs) as a simple yet scalable self-supervised learning approach for computer vision. Inspired by masked language models in NLP, MAEs mask random patches of images and reconstruct the missing parts. This method builds efficient and generalizable vision models by leveraging a high masking ratio, encouraing semantic feature learning.

The key steps are:

Encoder-Decoder Architecture: Encoder operates only on visible patches, ignoring mask tokens which significantly reduces computation. Decoder reconstructs the image from latent representations and mask tokens.
Masking Strategy: It randomly masks 75% of image patches to create a self supervised task.
Asymmetric Design: This shifts the computational burden to the lightweight decoder, allowing large models to process only 25% of the input image.
Training: It uses pixel reconstruction with mean squared error as the loss function which is computed only on masked patches.

MAEs show success in efficient pretraining which is 3x faster with reduced memory consumption compared to traditional methods. They outperform supervised pretraining on ImageNet-1k. They excel in downstream tasks like object detection and semantic segmentation, achieving better results than supervised and contrastive pre-training methods.

References:

Masked Autoencoders Are Scalable Vision Learners

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE…

arxiv.org

PaperSummary13 : MAE

Masked Autoencoders Are Scalable Vision Learners

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE…

Written by Poonam Saini

No responses yet