my notes (polished with claude)


Table of Contents


Part 1: Inference Basics

1.1 Prefill vs Decode

When you send a message to an LLM - system prompt + user message - two things happen: prefill and decoding.

Prefill

The model processes all input tokens at once, in a causally-masked way (each token attends to everything before it, nothing after). Two outputs come out:

  1. The KV cache - computed Key and Value vectors for every input token, at every layer. This is the important one from a serving perspective - it’s what makes decode fast.
  2. The first output token - sampled from the probability distribution at the final token position.

Prefill is compute-bound. All tokens processed at once = matrix-matrix multiply = high arithmetic intensity. GPU compute cores are the bottleneck, not memory bandwidth.

Decode

Prefill outputs the first token. Then decode begins - the model takes that token, generates the next one, then the next, one token at a time, until a stop condition is hit.

At each decode step, attention needs K and V vectors for all previous tokens - but it loads them from the KV cache instead of recomputing. That’s what makes decode feasible.