Momentum Contrast for Unsupervised Visual Representation Learning

Momentum Contrast (MoCo) trains a visual representation encoder by matching an encoded query q to a dictionary of encoded keys using a contrastive loss. The dictionary keys { k 0 , k 1 , k 2 , ... } are deﬁned on-the-ﬂy by a set of data samples. The dictionary is built as a queue, with the current mini-batch enqueued and the oldest mini-batch dequeued, decoupling it from the mini-batch size. The keys are encoded by a slowly progressing encoder, driven by a momentum update with the query encoder. This method enables a large and consistent dictionary for learning visual representations.

Code

MoCo (He et al., 2019) follows a simple instance discrimination task: a query matches a key if they are encoded views (e.g., different augmentations) of the same images. This is a dictionary lookup problem in a different perspective: an encoded query should be similar to its matching key and dissimilar to others (i.e., negative samples). Learning is formulated as minimizing a contrastive loss.

The contrastive loss is $\displaystyle \mathcal{L}{q}=-\log \frac{\exp \left(q \cdot k{+} / \tau\right)}{\sum_{i=0}^{K} \exp \left(q \cdot k_{i} / \tau\right)}$. The sum is over one positive and $K$ negative samples. The loss is the log loss of a $(K+1)$-way softmax-based classifier that tries to classify $q$ as $k_+$.

MoCo argues that it is desirable to build dictionaries that are 1) large and 2) consistent as they evolve during training. So MoCo propose:

a FIFO queue that enables the dictionary to be large, and ensures consistent keys by dequeueing the oldest key (the least consistent key)
momentum update method ($\theta_k \gets m\theta_k+(1-m)\theta_q$) for key encoder to maintain consistency of keys.

Comparision to previous contrastive method

Performance comparision

Momentum ablation study