A Simple Framework for Contrastive Learning of Visual Representations

SimCLR (Chen et al., 2020) learns representation by maximizing agreement between differently augmented views of the same data example via a contrastive loss in the latent space.

$A simple framework for contrastive learning of visual representations. Two separate data augmentation operators are sampled from the same family of augmentations ($t \sim \mathcal{T}$ and $t \sim \mathcal{T}$) and applied to each data example to obtain two correlated views. A base encoder network $f(\cdot)$ and a projection head $g(\cdot)$ are trained to maximize agreement using a contrastive loss. After training is completed, we throw away the projection head $g(\cdot)$ and use encoder $f (\cdot)$ and representation h for downstream tasks.$

A simple framework for contrastive learning of visual representations. Two separate data augmentation operators are sampled from the same family of augmentations ($t \sim \mathcal{T}$ and $t \sim \mathcal{T}$) and applied to each data example to obtain two correlated views. A base encoder network $f(\cdot)$ and a projection head $g(\cdot)$ are trained to maximize agreement using a contrastive loss. After training is completed, we throw away the projection head $g(\cdot)$ and use encoder $f (\cdot)$ and representation h for downstream tasks.

Code

They randomly sample $N$ examples and apply two different augmentations on each image resulting in $2N$ data points, each one image has one positive sample and $2(N-1)$ negative samples. Their contrastive loss for a positive pair of examples as $\displaystyle \ell_{i, j}=-\log \frac{\exp \left(\operatorname{sim}\left(\boldsymbol{z}{i}, \boldsymbol{z}{j}\right) / \tau\right)}{\sum_{k=1}^{2 N} \mathbb{1}{[k \neq i]} \exp \left(\operatorname{sim}\left(\boldsymbol{z}{i}, \boldsymbol{z}_{k}\right) / \tau\right)}$ (similar to MoCo) with $\operatorname{sim}(\boldsymbol{u}, \boldsymbol{v})=\boldsymbol{u}^{\top} \boldsymbol{v} /\|\boldsymbol{u}\|\|\boldsymbol{v}\|$.

SimCLR mainly shows that

composition of data augmentations plays a critical role in defining effective predictive tasks.

Linear evaluation (ImageNet top-1 accuracy) under in- dividual or composition of data augmentations, applied only to one branch. For all columns but the last, diagonal entries corre- spond to single transformation, and off-diagonals correspond to composition of two transformations (applied sequentially). The last column reflects the average over the row.
introducing a learnable nonlinear transformation (MLP with one hidden layer as projection head) between the representation and the contrastive loss substantially improves the quality of the learned representations.

$Linear evaluation of representations with different projection heads $g(\cdot)$ and various dimensions of $\boldsymbol{z} = g(\boldsymbol{h})$. The representation $\boldsymbol{h}$ (before projection) is 2048-dimensional here.$

Linear evaluation of representations with different projection heads $g(\cdot)$ and various dimensions of $\boldsymbol{z} = g(\boldsymbol{h})$. The representation $\boldsymbol{h}$ (before projection) is 2048-dimensional here.
contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.

Interesting behavior of projection head