This document is focused on reducing LLM context size in LLM interactions that can become arbitrarily long due to their open-ended nature. For example, Claude or ChatGPT or agentic flows.
If you have a fixed workflow you can decompose it into a set of predetermined steps that avoid long context issues. Some of the strategies mentioned here could nonetheless apply to fixed scenarios.
Today's LLM APIs are stateless. This means that for each turn of the conversation, the entire conversation history, including the system prompt, output schema, tool definitions, and tool call/results must be posted back to the LLM API to generate the next message. As the conversation progresses, the size of this history increases, causing later interactions in the conversation to be larger and larger. The cumulative token count of the conversation at turn N is the sum of ALL the previous tokens up to that point.
Large contexts create several challenges:
Because of these issues, most production AI apps must manage their LLM context size carefully.
Several strategies exist, each targeting different elements of context size growth.
Modern LLM APIs can cache shared context prefixes. This reduces latency because the LLM's internal state is cached for the prefix and it can resume generation from that point forward.
Pros:
Cons: