A Simple Framework for Contrastive Learning of Visual Representations

SimCLR (Chen et al., 2020) learns representation by maximizing agreement between differently augmented views of the same data example via a contrastive loss in the latent space.

A simple framework for contrastive learning of visual representations. Two separate data augmentation operators are sampled from the same family of augmentations ($t \sim \mathcal{T}$ and $t \sim \mathcal{T}$) and applied to each data example to obtain two correlated views. A base encoder network $f(\cdot)$ and a projection head $g(\cdot)$ are trained to maximize agreement using a contrastive loss. After training is completed, we throw away the projection head $g(\cdot)$ and use encoder $f (\cdot)$ and representation h for downstream tasks.

A simple framework for contrastive learning of visual representations. Two separate data augmentation operators are sampled from the same family of augmentations ($t \sim \mathcal{T}$ and $t \sim \mathcal{T}$) and applied to each data example to obtain two correlated views. A base encoder network $f(\cdot)$ and a projection head $g(\cdot)$ are trained to maximize agreement using a contrastive loss. After training is completed, we throw away the projection head $g(\cdot)$ and use encoder $f (\cdot)$ and representation h for downstream tasks.

Code

Code

They randomly sample $N$ examples and apply two different augmentations on each image resulting in $2N$ data points, each one image has one positive sample and $2(N-1)$ negative samples. Their contrastive loss for a positive pair of examples as $\displaystyle \ell_{i, j}=-\log \frac{\exp \left(\operatorname{sim}\left(\boldsymbol{z}{i}, \boldsymbol{z}{j}\right) / \tau\right)}{\sum_{k=1}^{2 N} \mathbb{1}{[k \neq i]} \exp \left(\operatorname{sim}\left(\boldsymbol{z}{i}, \boldsymbol{z}_{k}\right) / \tau\right)}$ (similar to MoCo) with $\operatorname{sim}(\boldsymbol{u}, \boldsymbol{v})=\boldsymbol{u}^{\top} \boldsymbol{v} /\|\boldsymbol{u}\|\|\boldsymbol{v}\|$.

SimCLR mainly shows that

  1. composition of data augmentations plays a critical role in defining effective predictive tasks.

    Linear evaluation (ImageNet top-1 accuracy) under in- dividual or composition of data augmentations, applied only to one branch. For all columns but the last, diagonal entries corre- spond to single transformation, and off-diagonals correspond to composition of two transformations (applied sequentially). The last column reflects the average over the row.

    Linear evaluation (ImageNet top-1 accuracy) under in- dividual or composition of data augmentations, applied only to one branch. For all columns but the last, diagonal entries corre- spond to single transformation, and off-diagonals correspond to composition of two transformations (applied sequentially). The last column reflects the average over the row.

  2. introducing a learnable nonlinear transformation (MLP with one hidden layer as projection head) between the representation and the contrastive loss substantially improves the quality of the learned representations.

    Linear evaluation of representations with different projection heads $g(\cdot)$ and various dimensions of $\boldsymbol{z} = g(\boldsymbol{h})$. The representation $\boldsymbol{h}$ (before projection) is 2048-dimensional here.

    Linear evaluation of representations with different projection heads $g(\cdot)$ and various dimensions of $\boldsymbol{z} = g(\boldsymbol{h})$. The representation $\boldsymbol{h}$ (before projection) is 2048-dimensional here.

  3. contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.

Interesting behavior of projection head