Short summary of “I-JEPA”

Poonam Saini
1 min readJun 4, 2024


Joint-Embedding Predictive Architecture (I-JEPA): a non generative approach for self supervised learning from images.

This work explores how to improve the semantic level of self-supervised representations without using extra prior knowledge encoded through image transformations.

Method: given a context block, predict the representations of various target blocks in the same image. Vision transformer is used for the context encoder, target encoder and predictor. The predictions are made in representation space.

Figure from the paper IJEPA
Figure from the paper I-JEPA

I-JEPA makes use of abstract predictions targets for which unnecessary pixel-level details are potentially eliminated => leading the model to learn more semantic features.

Loss is average L2 distance between the predicted patch-level representations and the target patch-level representations.

The parameters of the predictor and context encoder are learned through gradient-based optimization, while the parameters of the target encoder are updated via an exponential moving average of the context-encoder parameters.