A. Radford et al., "Learning Transferable Visual Models From Natural Language Supervision," 
arXiv preprint arXiv:2103.00020, 2021. [Online]. 
Available: <https://github.com/OpenAI/CLIP>

Problem Statement

This work addresses the gap in computer vision where SOTA models are primarily trained to predict a fixed set of predetermined object categories. This requires specialized, large, crowd-labeled datasets like ImageNet, which limits the generality and usability of the models because specifying any new visual concept requires gathering additional labeled data.

In contrast, breakthroughs in NLP (like GPT-3) leverage scalable pre-training methods from raw text, enabling models to perform a wide variety of tasks and transfer zero-shot to downstream tasks. The paper asks whether similar scalable pre-training methods learning directly from web text could yield a breakthrough in CV, since prior attempts using natural language supervision showed performance far below the current state of the art on common benchmarks.

Core Methodology

The model, CLIP (Contrastive Language-Image Pre-training), utilizes a large dataset of 400 million (image, text) pairs (WIT) combined with a contrastive learning objective.

Step-by-Step Approach:

Screenshot 2025-09-27 130143.png

Model Architecture & Scaling

Experiments

CLIP was evaluated extensively on zero-shot transfer, representation learning (linear probing), and robustness to distribution shift.