Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning

BYOL (Grill et al., 2020) introduces a RL-like contrastive method that achieves high performance without negative pairs. It relies on an online network and a target network.

BYOL define $f_\theta$ as online representation encoder, $g_\theta$ as online projection head, $q_\theta$ as online prediction layer, and $f_\xi$ as target representation encoder (update using $\xi \leftarrow \tau \xi+(1-\tau) \theta$), $g_\xi$ as target projection head, $\rm sg$ as stop gradient. $q_\theta$ serves as prediction function, that tries to match the output of online network to the output of target network.

BYOL samples 2 image augmentations from 2 different distributions then produces 2 augmented view $v \triangleq t(x)$ and $v' \triangleq t'(x)$. Apply $f$ and $g$, then get $z$ and $z'$. BYOL uses a prediction head to match $z$ to $z'$ using MSE loss. Before applying loss function, $q(z)$ and $z'$ are both $\ell_{2}$-normalized. Their loss is defined in mean squared error style as $\mathcal{L}{\theta}^{\mathrm{BYOL}} \triangleq \|\overline{q{\theta}}(z_{\theta})-\bar{z}{\xi}^{\prime}\|{2}^{2} = 2 - 2 \cdot \langle q_{\theta}(z_{\theta}), z_{\xi}^{\prime}\rangle/ \left\|q_{\theta}(z_{\theta})\right\|{2} \|z{\xi}^{\prime}\|{2}$. BYOL also symmetrize the loss by seprately feedding $v'$ to the online network and $v$ to the target network to compute $\tilde{\mathcal{L}}{\theta}^{\mathrm{BYOL}}$ and perform optimization to minimize $\mathcal{L}{\theta}^{\mathrm{BYOL}} + \tilde{\mathcal{L}}{\theta}^{\mathrm{BYOL}}$.

One interest finding is that even with a fixed randomly initialized target network, the online network still is able to learn a representation from that. The performance improvement compared to a randomly initialized network under linear evaluation protocol is non-trivial. The ablation study shows that BYOL is more resilient to changes in hyper-parameters and choices of image augmentations. The removal of negative pairs appears to be only helping procedures like BYOL.