Paper: https://arxiv.org/pdf/2506.10910

Topic: Post-Training for Reasoning


1. What’s Cool?

The dominant paradigm for training reasoning models today is SFT → RL to teach reasoning by imitating traces from humans or stronger models.

HOWEVER, Magistral Medium breaks this.

Key insights:


2. Setup Overview

Three components: (1) RL — GRPO + reward design, (2) Data Filtering, (3) Optional SFT — used for Small

2.1 Reinforcement Learning (RL)

Algorithm (GRPO modifications)

Reward Design

Focuses on verifiable correctness, factors that scale reasoning, and language.

  1. Formatting Reward
  2. Correctness Reward
  3. Length Reward