Siamese networks
Contrastive learning
Clustering
BYOL
two randomly augmented views $x_1$ and $x_2$ from an image $x$
an encoder network $f$ consisting of a backbone (e.g., ResNet) and a projection MLP head
A prediction MLP head ($h$) : transforms the output of one view and matches it to the other view
$$ \mathcal{D}(p_1, z_2) = -\frac{p_1}{||p_1||_2} \cdot \frac{z_2}{||z_2||_2} \tag{1} $$
a symmetrized loss
$$ \mathcal{L} = \frac{1}{2}\mathcal{D}(p_1, z_2) + \frac{1}{2}\mathcal{D}(p_2, z_1). \tag{2} $$
stop-gradient ($\tt{stopgrad}$)
$$ \mathcal{D}(p_1, \tt{stopgrad}(z_2)). \tag{3} $$
$$ \mathcal{L} = \frac{1}{2}\mathcal{D}(p_1, \tt{stopgrad}(z_2)) + \frac{1}{2}\mathcal{D}(p_2, \tt{stopgrad}(z_1)). \tag{4} $$
Baseline settings