Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

SwAV (Caron et al., 2020) proposes a "swapped" prediction mechanism to predict the cluster assignment of a view from the representation of another view. The method is based on a relaxed version of instance discrimination, namely, clustering-based discrimination, i.e. instead of pushing each instance apart from each other, they are first clustered together and the clusters are pushed apart. They also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or computing requirements. They argue that their method is more memory efficient and can scale to unlimited amounts of data.

"Swapped" prediction mechanism

Forward pass: each image $\mathbf{x}n$ is transformed into an augmented view $\mathbf{x}{nt}$ by applying a transformation $t$ sampled from the set $\mathcal{T}$ of image transformations. The augmented view is mapped to a vector representation by encoder network $f_\theta$. The feature is then normalized, i.e. $\mathbf{z}{nt}=f{\theta}\left(\mathbf{x}{n t}\right) /\left\|f{\theta}\left(\mathbf{x}{n t}\right)\right\|{2}$. SwAV then compute a code $\mathbf{q}{nt}$ from this feature by mapping $\mathbf{z}{nt}$ ti a set if $K$ trainable prototypes vectors, $\{\mathbf{c}_1,...,\mathbf{c}_K\}$, denoted as $C$ (columns are prototype vectors).

Swapped prediction

SwAV defines the contrastive loss as $L(\mathbf{z}{t}, \mathbf{z}{s})=\ell(\mathbf{z}{t}, \mathbf{q}{s})+\ell(\mathbf{z}{s}, \mathbf{q}{t})$, where function $\ell(\mathbf{z},\mathbf{q})$ measures the fit between features $\mathbf{z}$ and a code $\mathbf{q}$, formally defined as

$$ \ell\left(\mathbf{z}{t}, \mathbf{q}{s}\right)=-\sum_{k} \mathbf{q}{s}^{(k)} \log \mathbf{p}{t}^{(k)}, \quad \text { where } \ \mathbf{p}{t}^{(k)}=\frac{\exp \left(\frac{1}{\tau} \mathbf{z}{t}^{\top} \mathbf{c}{k}\right)}{\sum{k^{\prime}} \exp \left(\frac{1}{\tau} \mathbf{z}{t}^{\top} \mathbf{c}{k^{\prime}}\right)}. $$

where $\tau$ is a temperature parameter. Taking this loss over all the images and pairs of data augmentation leads to the following loss function for the swapped prediction problem:

$$ -\frac{1}{N} \sum_{n=1}^{N} \sum_{s, t \sim \mathcal{T}}\left[\frac{1}{\tau} \mathbf{z}{n t}^{\top} \mathbf{C q}{n s}+\frac{1}{\tau} \mathbf{z}{n s}^{\top} \mathbf{C q}{n t}-\log \sum_{k=1}^{K} \exp \left(\frac{\mathbf{z}{n t}^{\top} \mathbf{c}{k}}{\tau}\right)-\log \sum_{k=1}^{K} \exp \left(\frac{\mathbf{z}{n s}^{\top} \mathbf{c}{k}}{\tau}\right) \right] $$

Intuitively, the method compares the features $\mathbf{z}_t$ and $\mathbf{z}_s$ using intermediate codes $\mathbf{q}_t$ and $\mathbf{q}_s$.

Computing codes online

In order to make the method online, they compute the codes using only the image features within a batch. SwAV computes codes using the prototypes $\bf C$ such that all the examples in a batch are equally partitioned by the prototypes. This equipartition constraint ensures that the codes for different images in a batch are distinct, thus preventing the trivial solution where every image has the same code.

Given $B$ feature vectors $\mathbf Z = [\mathbf z_1, . . . , \mathbf z_{B}]$, we are interested in mapping them to the prototypes $\mathbf C = [\mathbf c_1, . . . , \mathbf c_K]$. We denote this mapping or codes by $\mathbf Q = [\mathbf q_1, . . . , \mathbf q_B]$, and optimize Q to maximize the similarity between the features and the prototype, i.e.

$$ \max _{\mathbf{Q} \in \mathcal{Q}} \operatorname{Tr}\left(\mathbf{Q}^{\top} \mathbf{C}^{\top} \mathbf{Z}\right)+\varepsilon H(\mathbf{Q}) $$

where $H$ is the entropy function, $H(\mathbf{Q})=-\sum_{i j} \mathbf{Q}{i j} \log \mathbf{Q}{i j}$ and $\varepsilon$ is a parameter that controls the smoothness of the mapping.

$$ \mathcal{Q}=\left\{\mathbf{Q} \in \mathbb{R}{+}^{K \times B} \mid \mathbf{Q} \mathbf{1}{B}=\frac{1}{K} \mathbf{1}{K}, \mathbf{Q}^{\top} \mathbf{1}{K}=\frac{1}{B} \mathbf{1}_{B}\right\} $$