For years, the AI community has chased a moonshot: creating open-source models that rival the reasoning power of giants like OpenAI. Today, that moonshot just landed. DeepSeek-R1, a new open-source language model released under the MIT license, not only matches OpenAI’s cutting-edge “o1” models in reasoning benchmarks — it does so at a fraction of the cost. Let’s unpack why this matters and how DeepSeek pulled it off.

The DeepSeek Breakthrough: AI That Thinks Step-by-Step

DeepSeek-R1 is part of a new class of “thinking models” that mimic human-like reasoning. Unlike traditional language models that generate answers in a single pass, DeepSeek-R1 breaks problems down, debates alternatives, and self-corrects — all visible in its “Chain of Thought” outputs. For example, when asked “How many Rs are in ‘strawberry’?”, the model writes:

“First, I’ll spell it out: S-T-R-A-W-B-E-R-R-Y. Now I’ll count: positions 3 (R), 8 (R), and 9 (R). Wait, is that right? Let me check again… Yes, three R’s.”

This isn’t just a parlor trick. On benchmarks like AIME 2024 (a math competition), DeepSeek-R1 outperforms OpenAI o1, and it’s neck-and-neck on coding tasks (Codeforces) and real-world problem-solving (SWE-Bench). Even more impressive? It does this while being 10x cheaper than OpenAI’s API (0.14vs.0.14vs.15 per million tokens for outputs).

How They Built a “Thinking Machine”

The team tackled a critical problem: How do you teach an AI to reason without massive human feedback? Traditional methods rely on supervised fine-tuning (SFT), where humans manually craft examples. DeepSeek’s answer? Reinforcement Learning (RL) on steroids.

DeepSeek-R1-Zero: The AlphaGo of Language Models The first model, R1-Zero, learned purely through trial and error using a technique called Group Relative Policy Optimization (GRPO). Here’s the twist:

DeepSeek-R1: Fixing the Quirks

R1-Zero had flaws: its outputs were messy (mixing languages like English and Chinese) and hard to read. The team fixed this with a “cold start” phase:

The result? A model that thinks clearly, stays on-task, and even outperforms GPT-4o on coding benchmarks like LiveCodeBench.

The Secret Sauce: Technical Innovations

Group Relative Policy Optimization (GRPO)

Instead of using a separate “critic” model (like OpenAI’s PPO), GRPO compares multiple responses in a group. Analogy: Imagine students working on a math problem. The teacher rewards the group based on relative performance, not absolute scores. This pushes the model to self-improve competitively.

Reasoning-Oriented Rewards