Abstract

Not necessary
- negative sample pairs
- large batches
- momentum encoders
Stop-gradient is important
Proof-of-concept
- k-means
Competitive results

Method

Distance is defined as negative cosine similarity

$$ \mathcal{D}\left(p_{1}, z_{2}\right)=-\frac{p_{1}}{\left\|p_{1}\right\|{2}} \cdot \frac{z{2}}{\left\|z_{2}\right\|_{2}} $$

Loss is symmetric

$$ \mathcal{L}=\frac{1}{2} \mathcal{D}\left(p_{1}, z_{2}\right)+\frac{1}{2} \mathcal{D}\left(p_{2}, z_{1}\right) $$

Empirical Study

Stop-gradient

Predictor

b). not because of collapsing solution

Batch Size