CLIP | Notion

A. Radford et al., "Learning Transferable Visual Models From Natural Language Supervision," 
arXiv preprint arXiv:2103.00020, 2021. [Online]. 
Available: <https://github.com/OpenAI/CLIP>

Problem Statement

This work addresses the gap in computer vision where SOTA models are primarily trained to predict a fixed set of predetermined object categories. This requires specialized, large, crowd-labeled datasets like ImageNet, which limits the generality and usability of the models because specifying any new visual concept requires gathering additional labeled data.

In contrast, breakthroughs in NLP (like GPT-3) leverage scalable pre-training methods from raw text, enabling models to perform a wide variety of tasks and transfer zero-shot to downstream tasks. The paper asks whether similar scalable pre-training methods learning directly from web text could yield a breakthrough in CV, since prior attempts using natural language supervision showed performance far below the current state of the art on common benchmarks.

Core Methodology

The model, CLIP (Contrastive Language-Image Pre-training), utilizes a large dataset of 400 million (image, text) pairs (WIT) combined with a contrastive learning objective.

Step-by-Step Approach:

Screenshot 2025-09-27 130143.png

Contrastive Pre-training - CLIP jointly trains an Image Encoder and a Text Encoder in a multi-modal embedding space.
- Given a batch of N paired image and text examples, the model is trained to predict which of the N \\times N possible pairings across the batch are the correct real pairs.
- The objective is to maximize the cosine similarity between the embeddings of the N correct pairs while minimizing the cosine similarity of the $N^2 - N$ incorrect pairings (negatives).
- This is optimized using a symmetric cross-entropy loss over the similarity scores.
- The core calculation for scaled pairwise cosine similarities is: logits = np.dot(I_e, T_e.T) * np.exp(t) .
- The training objective is contrastive, which was found to be 4x more efficient than a Bag-of-Words prediction objective, making it key to scaling the approach.
Zero-Shot Classifier Synthesis - At test time, natural language is used to define the visual concepts.
- The Text Encoder processes the names or descriptions of the target classes (like "a photo of a zebra") to create text embeddings ($T_1, T_2, ... T_N$).
- The resulting text embeddings are cached and function as the weights of a zero-shot linear classifier. The text encoder thus acts as a hypernetwork generating classifier weights based on natural language input.
Zero-Shot Prediction - An input image is encoded ($I_1$). The model calculates the cosine similarity between the image embedding and all pre-computed class text embeddings. The image is classified as the category corresponding to the text with the highest similarity score.

Model Architecture & Scaling

Image Encoder - The models tested include modified ResNet architectures (including ResNet-D and attention pooling) and **Vision Transformers (ViT)**. Models were scaled across width, depth, and resolution in an EfficientNet-style approach.
- A series of 5 ResNets (RN50, RN101, RN50x4, RN50x16, RN50x64) and 3 Vision Transformers (ViT-B/32, ViT-B/16, ViT-L/14) were trained for comparison. The best performing model was ViT-L/14@336px.
Text Encoder - A Transformer based on improvements from Radford et al. For the ResNet models, only the width of the text encoder was scaled, as its capacity was found to be less sensitive to performance.

Experiments

CLIP was evaluated extensively on zero-shot transfer, representation learning (linear probing), and robustness to distribution shift.

Training - All models were trained for 32 epochs using the Adam optimizer with a large minibatch size of 32,768.
Evaluation - Benchmarking was performed on a comprehensive suite of over 30 datasets, including a standardized 12-dataset suite and a broader 27-dataset suite.
Zero-Shot Optimization - The zero-shot approach included Prompt Engineering and Ensembling. For instance, the default prompt "A photo of a {label}." was typically used. This technique provided a performance boost similar to a 4x increase in model compute over the contextless baseline.