my notes (polished with claude)
When you send a message to an LLM - system prompt + user message - two things happen: prefill and decoding.
The model processes all input tokens at once, in a causally-masked way (each token attends to everything before it, nothing after). Two outputs come out:
Prefill is compute-bound. All tokens processed at once = matrix-matrix multiply = high arithmetic intensity. GPU compute cores are the bottleneck, not memory bandwidth.
Prefill outputs the first token. Then decode begins - the model takes that token, generates the next one, then the next, one token at a time, until a stop condition is hit.
At each decode step, attention needs K and V vectors for all previous tokens - but it loads them from the KV cache instead of recomputing. That’s what makes decode feasible.