Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-aware Cache Compression

based on the conversation with chatGPT

TL;DR

Ecco compresses LLM weights/activations/KV at the L2↔HBM boundary using quantization + Huffman, then decompresses back to FP16 at L2 so SMs/tensor cores run unchanged FP16 kernels. It delivers bandwidth/capacity gains without modifying compute. Our direction: keep data compressed to the compute boundary, decode into INT8/INT4 (or BF16) fragments for tensor cores, and replace Huffman with ANS, plus additional system and format upgrades.

1) Paper At-a-Glance

Problem. LLM inference is often memory-bound; KV caches dominate memory footprint and traffic.

Positioning. Add domain-aware (de)compression in GPU L2 to cut HBM traffic while requiring minimal changes above L2.

Core method.

Quantization + shared k-means patterns (group-wise) for 4× targets; uniform quant for 2× targets.
Huffman (variable-length) coding with a highly parallel, pipelined decoder replicated to match L2 bandwidth.
Fixed 64-B compressed blocks and power-of-two ratios for simple memory transactions and predictable service rates.

Where it lives. Compressor/decompressor blocks are integrated at the L2 cache interface; data is already re-expanded to FP16 before the SMs see it.

Key outcomes (as reported). Large effective bandwidth/capacity gains; speedups over fused dequant+GEMM baselines while preserving accuracy at common configs (e.g., W4/A8/KV4).

2) Why Ecco Reconstructs FP16 at L2

Integration simplicity. Re-emitting FP16 at L2 keeps the SM/tensor-core path unchanged (standard FP16 kernels, no new MMA fragment formats, no kernel rewrites).
Algorithm interface. The 4× path maps Huffman-decoded indices to FP16 centroids, scaled (per group), and merges outliers; the natural output of that mapping stage is FP16.
System fit. L2 already handles “compute-data compression.” Placing decompression there amortizes dequantization once per cache line, not inside every GEMM.

TL;DR

1) Paper At-a-Glance

2) Why Ecco Reconstructs FP16 at L2

3) Limitations & Critique (Attack Surface)