Paper: https://arxiv.org/pdf/2506.10910

Topic: Post-Training for Reasoning

1. What’s Cool?

The dominant paradigm for training reasoning models today is SFT → RL to teach reasoning by imitating traces from humans or stronger models.

HOWEVER, Magistral Medium breaks this.

No SFT on reasoning traces.
Reasoning emerges purely from RL with verifiable rewards.
Rewards cover (1) correctness, (2) formatting, (3) generation length ****(test-time compute), and (4) ****language consistency.

Key insights:

Reasoning can be discovered, not imitated.
RL on text-reasoning improves multimodal performance.
Language consistency is explicitly optimized for UX.
Avoiding third-party traces preserves data sovereignty and independence.

2. Setup Overview

Three components: (1) RL — GRPO + reward design, (2) Data Filtering, (3) Optional SFT — used for Small

2.1 Reinforcement Learning (RL)

Algorithm (GRPO modifications)

Remove KL divergence → cheaper, more exploration
Normalize loss & advantages → stable training
Relax trust regions → larger policy updates

Reward Design

Focuses on verifiable correctness, factors that scale reasoning, and language.

Formatting Reward
- Use of:
  - <think> tags
  - final answers in \\boxed
  - markdown and code blocks when relevant
- Reward: +0.1 ( iff all satisfied )
Correctness Reward
- Focus on verifiable tasks (numerical answers, code)
  
  ⇒ Proof-based problems are excluded
- Reward: +0.9
Length Reward
- Encourages sufficient internal reasoning
- Acts as proxy for allocating test-time compute