EN | CS231n

May 2, 2026

CS231n Lecture Note: Self-Supervised Learning

With self-supervised learning, we can train neural networks without the need for manually labelled datasets.

Basics

We define a pretext task based on the data itself. It does not require manual annotation. The labels/outputs are automatically generated from the data. We train an encoder from them to get the learned representation.

These learned representations can then be reused for a downstream task, either by freezing the encoder and training a task-specific head, or by fine-tuning the encoder together with the downstream model.

Common pretext tasks include image completion, rotation prediction, jigsaw puzzle solving, colorization, contrastive learning, and masked image modeling.

Pretext tasks focus on “visual common sense”, forcing the model to learn good features.

Solving the pretext tasks allow the model to learn good features. And we can automatically generate labels for those tasks.

Evaluation

Pretext Task Performance

Measure how well the model performs on the task it was trained on without labels.

Representation Quality

Evaluate the quality of the learned representations:
- Linear Evaluation Protocol: Train a linear classifier on the learned representations.
- Clustering: Measure clustering performance.
- t-SNE: Visualize the representations to assess their separability.

Robustness and Generalization

Test how well the model generalizes to different datasets and is robust to variations.

Computational Efficiency

Assess the efficiency of the method in terms of training time and resource requirements.

Transfer Learning and Downstream Task Performance

Assess the utility of the learned representations by transferring them to a downstream supervised task.

Masked Auto Encoders (MAE)

We divide the input into non-overlapping patches. Uniformly sample a very large proportion (75%) of these patches and mask them.

Masking a high ratio makes the task challenging and meaningful.

The MAE encoder only operates on unmasked patches. We embed the patches by linear projection and add positional embeddings, and then use transformer blocks for the model.

For the MAE decoder, we merge the encoder outputs with the shared mask tokens in previously masked places, adding positional encodings to them. It uses transformer blocks, followed by a linear projection for finalizing pixel reconstruction.

Since the decoder is solely responsible for reconstruction,it is independent of the encoder design, making it flexible.

We compute loss only for masked patches, and use the MSE (mean squared error loss) in the pixel space between the input image and the reconstructed image.

Linear Probing and Full Fine-tuning

In linear probing, the pre-trained model is fixed, and only one linear layer is added at the end, to predict the labels (or produce the output). This method is used to assess the quality of representations from a pre-trained feature extraction model.

In fine-tuning, pre-trained model is further trained (not fixed), and one or more layers, possibly with non-linearities are added.

Contrastive Representation Learning

In contrastive representation, the transformed and the original image are marked as positive, while the other images are marked negative.

We want to get a score function:

score(f(x), f(x^+)) >> score(f(x), f(x^-))

Given a chosen score function, we aim to learn an encoder function $f$ that yields high score for positive pairs $(x, x^+)$ and low scores for negative pairs $(x, x^-)$ .

Loss function given 1 positive sample and N - 1 negative samples:

L = -\mathbb{E}_X \left[ \log \frac{\exp(s(f(x), f(x^+)))}{\exp(s(f(x), f(x^+))) + \sum_{j=1}^{N-1} \exp(s(f(x), f(x_j^-)))} \right]

This is commonly known as the InfoNCE loss. It is a lower bound on the mutual information between f(x) and f(x+).

Typical Contrastive Learning models include SimCLR, MoCo, and DINO.

About this Post

This post is written by Louis C Deng, licensed under CC BY-NC 4.0.

#CS231n #Deep Learning