A. Radford et al., "Learning Transferable Visual Models From Natural Language Supervision,"
arXiv preprint arXiv:2103.00020, 2021. [Online].
Available: <https://github.com/OpenAI/CLIP>
This work addresses the gap in computer vision where SOTA models are primarily trained to predict a fixed set of predetermined object categories. This requires specialized, large, crowd-labeled datasets like ImageNet, which limits the generality and usability of the models because specifying any new visual concept requires gathering additional labeled data.
In contrast, breakthroughs in NLP (like GPT-3) leverage scalable pre-training methods from raw text, enabling models to perform a wide variety of tasks and transfer zero-shot to downstream tasks. The paper asks whether similar scalable pre-training methods learning directly from web text could yield a breakthrough in CV, since prior attempts using natural language supervision showed performance far below the current state of the art on common benchmarks.
The model, CLIP (Contrastive Language-Image Pre-training), utilizes a large dataset of 400 million (image, text) pairs (WIT) combined with a contrastive learning objective.
Step-by-Step Approach:

Contrastive Pre-training - CLIP jointly trains an Image Encoder and a Text Encoder in a multi-modal embedding space.
Given a batch of N paired image and text examples, the model is trained to predict which of the N \\times N possible pairings across the batch are the correct real pairs.
The objective is to maximize the cosine similarity between the embeddings of the N correct pairs while minimizing the cosine similarity of the $N^2 - N$ incorrect pairings (negatives).
This is optimized using a symmetric cross-entropy loss over the similarity scores.
The core calculation for scaled pairwise cosine similarities is: logits = np.dot(I_e, T_e.T) * np.exp(t) .

The training objective is contrastive, which was found to be 4x more efficient than a Bag-of-Words prediction objective, making it key to scaling the approach.

Zero-Shot Classifier Synthesis - At test time, natural language is used to define the visual concepts.
Zero-Shot Prediction - An input image is encoded ($I_1$). The model calculates the cosine similarity between the image embedding and all pre-computed class text embeddings. The image is classified as the category corresponding to the text with the highest similarity score.
CLIP was evaluated extensively on zero-shot transfer, representation learning (linear probing), and robustness to distribution shift.
Training - All models were trained for 32 epochs using the Adam optimizer with a large minibatch size of 32,768.
Evaluation - Benchmarking was performed on a comprehensive suite of over 30 datasets, including a standardized 12-dataset suite and a broader 27-dataset suite.

Zero-Shot Optimization - The zero-shot approach included Prompt Engineering and Ensembling. For instance, the default prompt "A photo of a {label}." was typically used. This technique provided a performance boost similar to a 4x increase in model compute over the contextless baseline.
