For readers who want the underlying mental model. Every Claude request, regardless of which interface produced it, is a JSON payload structured in layers. Understanding the layers makes every recommendation in the chapters feel obvious rather than arbitrary.
Layer 1: the API envelope. The outer JSON wrapper. model, max_tokens, stream, system, messages, tools. Not billed as content tokens, but governs the call.
Layer 2: the system prompt. Sent before any messages, on every turn, regardless of how short your question is. In the consumer chat interface, contains product identity, safety rules, formatting rules, tool rules. In API applications, the developer writes it.
Layer 3: user preferences (chat only). Appended to the system prompt as a <userPreferences> block on every message in every conversation.
Layer 4: user memories (chat only). A separate <userMemories> block, auto-generated by Anthropic from your conversation history, injected fresh into every conversation.
Layer 5: tool definitions, skills, MCP schemas. Every available tool is sent as a full JSON schema on every turn, regardless of whether the query needs any tool. Includes built-in tools (web search, weather, etc.), the listing of your installed skills, and the full schemas of every connected MCP app. The single largest variable in many configurations.
Layer 6: conversation history. Every prior user message, every prior assistant response, every prior tool result. In full. Resent on every turn. The cost grows linearly with conversation length.
Layer 7: the user message. The thing you actually typed, plus any attachments. Images are tokenized by resolution; the formula for Sonnet 4.6 and earlier models is approximately (width × height) / 750 tokens. PDFs are tokenized page by page. Uploaded code and text files are converted to text and included in full. Multimodal content persists in history exactly like text.
Layer 8: output tokens. What comes back from the model. Billed at 5x the input rate on every current Claude model. Re-enters the conversation history on the next turn at the input rate. Output is therefore expensive twice: once when generated, again on each subsequent turn.
For one mid-conversation question in a typical Claude.ai session, the input might break down as: API envelope (~15 tokens), Anthropic system prompt (several thousand), tool definitions (several thousand), preferences (~1,000), memories (~500), skills listing (~2,000), MCP schemas (10,000+ if you have several apps connected), conversation history (varies wildly), the actual question (~10). The question is typically a fraction of one percent of the payload. Everything else is infrastructure and accumulated state.
defer_loading: true: API parameter that excludes a tool's schema from the prompt prefix until Claude searches for it on demand. Up to 85% reduction in tool-schema token usage in Anthropic's testing./compact: Claude Code CLI command that replaces verbatim history with a model-generated summary. The most valuable single token-management lever in the CLI.