SwAV (Caron et al., 2020) proposes a "swapped" prediction mechanism to predict the cluster assignment of a view from the representation of another view. The method is based on a relaxed version of instance discrimination, namely, clustering-based discrimination, i.e. instead of pushing each instance apart from each other, they are first clustered together and the clusters are pushed apart. They also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or computing requirements. They argue that their method is more memory efficient and can scale to unlimited amounts of data.

Forward pass: each image $\mathbf{x}n$ is transformed into an augmented view $\mathbf{x}{nt}$ by applying a transformation $t$ sampled from the set $\mathcal{T}$ of image transformations. The augmented view is mapped to a vector representation by encoder network $f_\theta$. The feature is then normalized, i.e. $\mathbf{z}{nt}=f{\theta}\left(\mathbf{x}{n t}\right) /\left\|f{\theta}\left(\mathbf{x}{n t}\right)\right\|{2}$. SwAV then compute a code $\mathbf{q}{nt}$ from this feature by mapping $\mathbf{z}{nt}$ ti a set if $K$ trainable prototypes vectors, $\{\mathbf{c}_1,...,\mathbf{c}_K\}$, denoted as $C$ (columns are prototype vectors).
SwAV defines the contrastive loss as $L(\mathbf{z}{t}, \mathbf{z}{s})=\ell(\mathbf{z}{t}, \mathbf{q}{s})+\ell(\mathbf{z}{s}, \mathbf{q}{t})$, where function $\ell(\mathbf{z},\mathbf{q})$ measures the fit between features $\mathbf{z}$ and a code $\mathbf{q}$, formally defined as
$$ \ell\left(\mathbf{z}{t}, \mathbf{q}{s}\right)=-\sum_{k} \mathbf{q}{s}^{(k)} \log \mathbf{p}{t}^{(k)}, \quad \text { where } \ \mathbf{p}{t}^{(k)}=\frac{\exp \left(\frac{1}{\tau} \mathbf{z}{t}^{\top} \mathbf{c}{k}\right)}{\sum{k^{\prime}} \exp \left(\frac{1}{\tau} \mathbf{z}{t}^{\top} \mathbf{c}{k^{\prime}}\right)}. $$
where $\tau$ is a temperature parameter. Taking this loss over all the images and pairs of data augmentation leads to the following loss function for the swapped prediction problem:
$$ -\frac{1}{N} \sum_{n=1}^{N} \sum_{s, t \sim \mathcal{T}}\left[\frac{1}{\tau} \mathbf{z}{n t}^{\top} \mathbf{C q}{n s}+\frac{1}{\tau} \mathbf{z}{n s}^{\top} \mathbf{C q}{n t}-\log \sum_{k=1}^{K} \exp \left(\frac{\mathbf{z}{n t}^{\top} \mathbf{c}{k}}{\tau}\right)-\log \sum_{k=1}^{K} \exp \left(\frac{\mathbf{z}{n s}^{\top} \mathbf{c}{k}}{\tau}\right) \right] $$
Intuitively, the method compares the features $\mathbf{z}_t$ and $\mathbf{z}_s$ using intermediate codes $\mathbf{q}_t$ and $\mathbf{q}_s$.
In order to make the method online, they compute the codes using only the image features within a batch. SwAV computes codes using the prototypes $\bf C$ such that all the examples in a batch are equally partitioned by the prototypes. This equipartition constraint ensures that the codes for different images in a batch are distinct, thus preventing the trivial solution where every image has the same code.
Given $B$ feature vectors $\mathbf Z = [\mathbf z_1, . . . , \mathbf z_{B}]$, we are interested in mapping them to the prototypes $\mathbf C = [\mathbf c_1, . . . , \mathbf c_K]$. We denote this mapping or codes by $\mathbf Q = [\mathbf q_1, . . . , \mathbf q_B]$, and optimize Q to maximize the similarity between the features and the prototype, i.e.
$$ \max _{\mathbf{Q} \in \mathcal{Q}} \operatorname{Tr}\left(\mathbf{Q}^{\top} \mathbf{C}^{\top} \mathbf{Z}\right)+\varepsilon H(\mathbf{Q}) $$
where $H$ is the entropy function, $H(\mathbf{Q})=-\sum_{i j} \mathbf{Q}{i j} \log \mathbf{Q}{i j}$ and $\varepsilon$ is a parameter that controls the smoothness of the mapping.
$$ \mathcal{Q}=\left\{\mathbf{Q} \in \mathbb{R}{+}^{K \times B} \mid \mathbf{Q} \mathbf{1}{B}=\frac{1}{K} \mathbf{1}{K}, \mathbf{Q}^{\top} \mathbf{1}{K}=\frac{1}{B} \mathbf{1}_{B}\right\} $$