Self-supervised learning (SSL) in computer vision has rapidly transformed how we think about representation learning. Instead of relying on massive amounts of labeled data, SSL methods mine supervisory signals hidden in the data itself, augmenting images, forming proxy “pretext” tasks, and learning embeddings that transfer to downstream tasks like classification, detection, and segmentation. In this blog, we’ll walk through four landmark SSL frameworks: SimCLR, MoCo, BYOL, and DINO, explaining the core intuition and showing end-to-end PyTorch code snippets.
Although the concepts may seem abstract at first, we’ll keep this as code-heavy and hands-on as possible, think of it as a lab notebook that you can clone, tweak, and reproduce.
In supervised learning, we typically train a network f_θ(x)
to minimize a cross-entropy loss against ground-truth labels.
In Self-Supervised Learning (SSL), we pretend that labels don’t exist. Instead, we artificially generate pairs (x̃_i, x̃_j)
from the same image x
via aggressive augmentations (random crop, color jitter, gaussian blur). The goal is to learn an encoder f_θ
whose representations of x̃_i
and x̃_j
are ‘close’ in some embedding space, while pushing away representations of augmentations from different images.
Minibatch Construction
N
images: {x_k}
for k = 1 ... N
.x_k
, we generate two stochastic augmentations: x̃_{2k-1}
and x̃_{2k}
.2N
samples, and each original image yields exactly one positive pair:(x̃_{2k-1}, x̃_{2k})
.All other 2N - 2
combinations are negative pairs.
If we pass these through an encoder f_θ
and (optionally) a projection head g_φ
, we get embeddings:
z_i = g_φ(f_θ(x̃_i)) ∈ ℝ^d
Contrastive Loss (NT-Xent Loss, as in SimCLR)
The canonical contrastive loss (NT-Xent) is:
ℓ_{i,j} = -log [ exp(sim(z_i, z_j) / τ) / sum_{k=1}^{2N} 1[k ≠ i] exp(sim(z_i, z_k) / τ) ]
where