This document is focused on reducing LLM context size in LLM interactions that can become arbitrarily long due to their open-ended nature. For example, Claude or ChatGPT or agentic flows.

If you have a fixed workflow you can decompose it into a set of predetermined steps that avoid long context issues. Some of the strategies mentioned here could nonetheless apply to fixed scenarios.

Background

Today's LLM APIs are stateless. This means that for each turn of the conversation, the entire conversation history, including the system prompt, output schema, tool definitions, and tool call/results must be posted back to the LLM API to generate the next message. As the conversation progresses, the size of this history increases, causing later interactions in the conversation to be larger and larger. The cumulative token count of the conversation at turn N is the sum of ALL the previous tokens up to that point.

Large contexts create several challenges:

  1. Latency. The longer the input context, the longer LLMs take to generate an output.
  2. Cost. Processing tokens costs money. Repeatedly sending large payloads increases token usage and thus cost.
  3. Context size limits. LLMs have a maximum context size (~100–200K tokens, with Gemini notably supporting up to 1M). It's possible to hit these limits during a conversation.
  4. Accuracy. Longer contexts could lead to model confusion.

Because of these issues, most production AI apps must manage their LLM context size carefully.

Several strategies exist, each targeting different elements of context size growth.

Caching

Modern LLM APIs can cache shared context prefixes. This reduces latency because the LLM's internal state is cached for the prefix and it can resume generation from that point forward.

Pros:

Cons: