1. Introduction

2. Related Work

Siamese networks

Contrastive learning

Clustering

BYOL

two randomly augmented views $x_1$ and $x_2$ from an image $x$
an encoder network $f$ consisting of a backbone (e.g., ResNet) and a projection MLP head
A prediction MLP head ($h$) : transforms the output of one view and matches it to the other view

$$ \mathcal{D}(p_1, z_2) = -\frac{p_1}{||p_1||_2} \cdot \frac{z_2}{||z_2||_2} \tag{1} $$
- $p_1 \triangleq h(f(x_1))$, $z_2 \triangleq f(x_2)$
- $||\cdot||_2$ : $l_2$-norm
a symmetrized loss

$$ \mathcal{L} = \frac{1}{2}\mathcal{D}(p_1, z_2) + \frac{1}{2}\mathcal{D}(p_2, z_1). \tag{2} $$
- minimum possivle value : -1
stop-gradient ($\tt{stopgrad}$)
- assymetric
$$ \mathcal{D}(p_1, \tt{stopgrad}(z_2)). \tag{3} $$
- Symmetric
$$ \mathcal{L} = \frac{1}{2}\mathcal{D}(p_1, \tt{stopgrad}(z_2)) + \frac{1}{2}\mathcal{D}(p_2, \tt{stopgrad}(z_1)). \tag{4} $$

Baseline settings

Optimizer
- SGD with 0.9 momentum for pretraining
- learning rate : $lr \times$ BatchSize / 256
  - base $lr$ = 0.05
- cosine decay schedule
- weight decay : 0.0001
- batch size : 512 (8-GPU implementations)
  - other batch sizes also work well
- batch normalization (BN) synchronized across devices
Projection MLP ($f$)
- FC (fully-connected) + BN
  - 2048-d X 3 layers
Prediction MLP ($h$)
- FC + BN (except output FC)
  - 2 layers for MLP
  - input and output : 2048-d / hidden layer : 512-d
ResNet-50 with 100 epochs