my notes, polished with claude
You may want to jump to infrastructure when latency becomes a problem in your project involving LLM calls. Quantization, better GPUs, inference engines, batching. All valid, all take time and effort. I brought latency down by ~60% on a production pipeline without touching any of that - just by changing how I structured my prompts.
When an LLM processes your request, two things happen:
Prefill - the model processes your entire input (system prompt + user message) in parallel and builds a KV cache: Key and Value vectors for every input token, at every layer. This is expensive and scales with your input length. Longer prompt = more prefill cost.
Decode - the model generates output one token at a time. Each step is a separate operation. More output tokens = more steps = more time.
Fewer input tokens → cheaper prefill → faster time to first token. Fewer output tokens → fewer decode steps → lower total latency.
Modern inference engines and APIs implement prefix caching: if the start of your prompt is identical to a previous request, the KV cache for that part is already computed and stored. The engine just reuses it - no prefill needed for those tokens.
Only a prefix gets cached, not some random middle chunk. Cache matching works from the start of the prompt. The moment your prompt diverges from a cached version, everything after that gets recomputed from scratch.
So prompt order matters a lot. Static content - rules, instructions, examples - needs to go first. Dynamic content - the actual input that changes per request - needs to go last. If your dynamic input is sitting at the top, your cache hit rate is basically zero no matter how much static content you have below it.
(If you want to go deep on how KV caching, prefill/decode actually work - I wrote a separate piece that covers all of it.)
My prompt looked something like this - dynamic input at the top, then a big block of rules and few-shot examples below it:
[dynamic input - changes every request]
[500 tokens of rules and examples]
Every request looked different from token 1. Cache hit rate: zero. The entire prompt was being prefilled every single time.
I flipped it. Moved all the static stuff into the system prompt. Kept only the dynamic input at the end of the user message:
System prompt: [500 tokens of rules and examples] ← cached after first request
User message: [dynamic input] ← only this part hits prefill