PaperSummary13 : MAE

Poonam Saini
1 min readJan 13, 2025

--

The paper presents Masked Autoencoders (MAEs) as a simple yet scalable self-supervised learning approach for computer vision. Inspired by masked language models in NLP, MAEs mask random patches of images and reconstruct the missing parts. This method builds efficient and generalizable vision models by leveraging a high masking ratio, encouraing semantic feature learning.

The key steps are:

  1. Encoder-Decoder Architecture: Encoder operates only on visible patches, ignoring mask tokens which significantly reduces computation. Decoder reconstructs the image from latent representations and mask tokens.
  2. Masking Strategy: It randomly masks 75% of image patches to create a self supervised task.
  3. Asymmetric Design: This shifts the computational burden to the lightweight decoder, allowing large models to process only 25% of the input image.
  4. Training: It uses pixel reconstruction with mean squared error as the loss function which is computed only on masked patches.

MAEs show success in efficient pretraining which is 3x faster with reduced memory consumption compared to traditional methods. They outperform supervised pretraining on ImageNet-1k. They excel in downstream tasks like object detection and semantic segmentation, achieving better results than supervised and contrastive pre-training methods.

References:

--

--

Poonam Saini
Poonam Saini

Written by Poonam Saini

PhD Student, Research Associate @ Ulm University

No responses yet