The Pretraining

Pi7_Gif (2).gif

Introduction

Kimi K2 by Moonshot Labs has created a new state of the art model that likely represents a new wave of frontier models. What sets this model apart is not only its performance and benchmark results, but also its architecture specially the Muon optimizer and its more optimal version, the Muonclip optimizer. Throughout the training process, Kimi K2 implements various strategies that enhance both computational efficiency and performance, from data synthesis pipelines to preconditioners. The model is built around agentic capabilities and specifically trained with corresponding data. With its context window extending up to 128,000 tokens, a leading 53.7% score on LiveCodeBench v6 (Pass@1), and advanced support for long-context tool use, Kimi K2 significantly pushes the frontier of large language models on the Pareto chart.

Following pretraining, Kimi K2 undergoes a long post-training process to make it more interactive and proficient. This involves fine-tuning the model on diverse instructions, instructing it to learn user preferences and applying reinforcement learning not only from human responses but also from the model critiquing its own outputs(RLVR). A large emphasis is placed on the training of the model to utilize tools efficiently in multi step tasks, particularly those demanding reasoning across longer interactions. Kimi K2 is trained through simulated tool use situations and is rewarded for generating high-quality output, from scores provided by its own internal judging systems. All of this assists the model in transcending merely responding to questions it becomes a more dynamic, agent-like assistant that can solve problems, utilize tools, and learn within complicated environments.

With 1 trillion parameters and only 32B active parameters per forward pass, running and hosting this model requires approximately 10-12× H100 or 16× A100 GPUs. For 4-bit quantized models, you'd need 4× H100, 5× A100, or 10+ RTX 4090 GPUs.

image.png

Benchmark Area Kimi K2 Score Beaten Model Their Score
SWE-bench (Verified) Software Engineering 65.8% GPT-4.1 54.6%
Claude Sonnet ~58–60%
LiveCodeBench v6 Code Generation 53.7% GPT-4.1 44.7%
Claude Sonnet 42.9%
OJBench Competitive Coding 27.1% GPT-4.1 19.5%
Claude Opus 19.6%
MATH (500 questions) Advanced Math 97.4% GPT-4.1 92.4%
Claude Sonnet ~90%
AIME 2025 Olympiad Math 49.5% GPT-4.1 37.0%
Claude Opus ~33–34%
GPQA (Diamond) Physics/QA Reasoning 75.1% GPT-4.1 66.3%
Claude Sonnet ~65%
Tau-2 Tool Use + Planning ~70.6% GPT-4.1 ~54.3%
Claude Sonnet ~65%
MMLU General Knowledge 89.5% Claude Sonnet ~85%
GPT-4.1 ~86–87%

Data

The Kimi K2 pre-training corpus is built with 15.5 trillion tokens of curated, high-quality data spanning four main domains: Web Text, Code, Mathematics, and Knowledge. The goal with pre training was to generalize prior with limited high-quality data, elevating token efficiency learning signal per token as a critical scaling coefficient