Self Supervised Learning in Vision

Layer–Wise contrastive learning in vision, a hands-on journey through SimCLR, MoCo, BYOL, and DINO!

Self-supervised learning (SSL) in computer vision has rapidly transformed how we think about representation learning. Instead of relying on massive amounts of labeled data, SSL methods mine supervisory signals hidden in the data itself, augmenting images, forming proxy “pretext” tasks, and learning embeddings that transfer to downstream tasks like classification, detection, and segmentation. In this blog, we’ll walk through four landmark SSL frameworks: SimCLR, MoCo, BYOL, and DINO, explaining the core intuition and showing end-to-end PyTorch code snippets.

Although the concepts may seem abstract at first, we’ll keep this as code-heavy and hands-on as possible, think of it as a lab notebook that you can clone, tweak, and reproduce.

Why Self-Supervised?

In supervised learning, we typically train a network f_θ(x) to minimize a cross-entropy loss against ground-truth labels.

In Self-Supervised Learning (SSL), we pretend that labels don’t exist. Instead, we artificially generate pairs (x̃_i, x̃_j) from the same image x via aggressive augmentations (random crop, color jitter, gaussian blur). The goal is to learn an encoder f_θ whose representations of x̃_i and x̃_j are ‘close’ in some embedding space, while pushing away representations of augmentations from different images.

Minibatch Construction

Let’s say we draw a minibatch of N images: {x_k} for k = 1 ... N.
For each x_k, we generate two stochastic augmentations: x̃_{2k-1} and x̃_{2k}.
Now we have 2N samples, and each original image yields exactly one positive pair:(x̃_{2k-1}, x̃_{2k}).

All other 2N - 2 combinations are negative pairs.

If we pass these through an encoder f_θ and (optionally) a projection head g_φ, we get embeddings:

z_i = g_φ(f_θ(x̃_i)) ∈ ℝ^d

Contrastive Loss (NT-Xent Loss, as in SimCLR)

The canonical contrastive loss (NT-Xent) is:

ℓ_{i,j} = -log [ exp(sim(z_i, z_j) / τ) / sum_{k=1}^{2N} 1[k ≠ i] exp(sim(z_i, z_k) / τ) ]

where