Technical Whitepaper - v0.3 February 2026
Current large language models (LLMs) based on the Transformer architecture suffer from quadratic attention complexity $O\left(n^{2}\right)$, severely limiting their ability to process long contexts efficiently. We propose Hippocampus, a lightweight auxiliary State Space Model (SSM) that acts as a learned KV-cache filter for any frozen pretrained Transformer. Inspired by the hippocampal-neocortical memory system in the human brain, the controller dynamically decides which segments of the host model’s key-value cache are relevant and should be retained, and which are irrelevant and should be evicted to cold storage. Unlike static heuristics (H2O, ScissorHands), filtering decisions are made by a learned model conditioned on both the user’s original prompt and the host model’s hidden states. Eviction is reversible - offloaded segments can be recalled if context shifts. The architecture is model-agnostic: the host Transformer remains frozen, making Hippocampus a plug-in module compatible with any open-weight LLM. By focusing on binary filtering (retain or evict) rather than compression, this proposal minimizes implementation complexity while isolating the core research question: can a small learned model outperform static heuristics at selecting which context to keep?
Keywords: attention efficiency, KV-cache management, state space models, long-context inference, learned cache eviction, context filtering
The Transformer architecture has become the dominant paradigm for language modeling. However, the self-attention mechanism computes pairwise interactions between all tokens in a sequence, resulting in $\mathrm{O}\left(\mathrm{n}^{2}\right)$ time and memory complexity with respect to sequence length n . This quadratic scaling is the primary bottleneck for extending context windows beyond current limits (typically $8 \mathrm{~K}-128 \mathrm{~K}$ tokens in production models).
Existing approaches fall into three categories: (1) architectural modifications (Longformer, BigBird, sliding window attention) requiring full retraining; (2) alternative architectures like State Space Models (Mamba) that achieve linear complexity but sacrifice some expressiveness; and (3) KV-cache management heuristics (H2O, ScissorHands, FastGen) that prune the key-value cache at inference time using fixed rules.
Category (3) is the most practical - it requires no retraining and applies post-hoc to any Transformer - but is fundamentally limited by static heuristics that cannot adapt to the semantic content of the sequence or the host model’s evolving informational needs.