Paper: https://arxiv.org/pdf/2506.10910
Topic: Post-Training for Reasoning
1. What’s Cool?
The dominant paradigm for training reasoning models today is SFT → RL to teach reasoning by imitating traces from humans or stronger models.
HOWEVER, Magistral Medium breaks this.
- No SFT on reasoning traces.
- Reasoning emerges purely from RL with verifiable rewards.
- Rewards cover (1) correctness, (2) formatting, (3) generation length ****(test-time compute), and (4) ****language consistency.
Key insights:
- Reasoning can be discovered, not imitated.
- RL on text-reasoning improves multimodal performance.
- Language consistency is explicitly optimized for UX.
- Avoiding third-party traces preserves data sovereignty and independence.
2. Setup Overview
Three components: (1) RL — GRPO + reward design, (2) Data Filtering, (3) Optional SFT — used for Small
2.1 Reinforcement Learning (RL)
Algorithm (GRPO modifications)
- Remove KL divergence → cheaper, more exploration
- Normalize loss & advantages → stable training
- Relax trust regions → larger policy updates
Reward Design
Focuses on verifiable correctness, factors that scale reasoning, and language.
- Formatting Reward
- Use of:
<think> tags
- final answers in
\\boxed
- markdown and code blocks when relevant
- Reward: +0.1 ( iff all satisfied )
- Correctness Reward
- Length Reward
- Encourages sufficient internal reasoning
- Acts as proxy for allocating test-time compute