Hippocampus

A Learned KV-Cache Filter for Efficient Long-Context Transformer Inference

Technical Whitepaper - v0.3 February 2026

Abstract

Current large language models (LLMs) based on the Transformer architecture suffer from quadratic attention complexity $O\left(n^{2}\right)$, severely limiting their ability to process long contexts efficiently. We propose Hippocampus, a lightweight auxiliary State Space Model (SSM) that acts as a learned KV-cache filter for any frozen pretrained Transformer. Inspired by the hippocampal-neocortical memory system in the human brain, the controller dynamically decides which segments of the host model’s key-value cache are relevant and should be retained, and which are irrelevant and should be evicted to cold storage. Unlike static heuristics (H2O, ScissorHands), filtering decisions are made by a learned model conditioned on both the user’s original prompt and the host model’s hidden states. Eviction is reversible - offloaded segments can be recalled if context shifts. The architecture is model-agnostic: the host Transformer remains frozen, making Hippocampus a plug-in module compatible with any open-weight LLM. By focusing on binary filtering (retain or evict) rather than compression, this proposal minimizes implementation complexity while isolating the core research question: can a small learned model outperform static heuristics at selecting which context to keep?

Keywords: attention efficiency, KV-cache management, state space models, long-context inference, learned cache eviction, context filtering

Table of Contents

  1. Introduction
  2. Related Work
  3. Proposed Architecture 3.1 System Overview 3.2 The Controller (Hippocampus) 3.3 Semantic Segmentation 3.4 Context-Aware Routing Signals 3.5 Two-Tier Memory: Retain or Evict 3.6 Cold Storage and Recall 3.7 Periodic Re-Scoring 3.8 Interface with Host Model
  4. Training Methodology
  5. Complexity Analysis and Expected Benefits
  6. Proposed Experimental Validation
  7. Limitations and Open Questions
  8. Future Directions
  9. Conclusion

1. Introduction

The Transformer architecture has become the dominant paradigm for language modeling. However, the self-attention mechanism computes pairwise interactions between all tokens in a sequence, resulting in $\mathrm{O}\left(\mathrm{n}^{2}\right)$ time and memory complexity with respect to sequence length n . This quadratic scaling is the primary bottleneck for extending context windows beyond current limits (typically $8 \mathrm{~K}-128 \mathrm{~K}$ tokens in production models).

Existing approaches fall into three categories: (1) architectural modifications (Longformer, BigBird, sliding window attention) requiring full retraining; (2) alternative architectures like State Space Models (Mamba) that achieve linear complexity but sacrifice some expressiveness; and (3) KV-cache management heuristics (H2O, ScissorHands, FastGen) that prune the key-value cache at inference time using fixed rules.

Category (3) is the most practical - it requires no retraining and applies post-hoc to any Transformer - but is fundamentally limited by static heuristics that cannot adapt to the semantic content of the sequence or the host model’s evolving informational needs.