based on the conversation with chatGPT
Ecco compresses LLM weights/activations/KV at the L2↔HBM boundary using quantization + Huffman, then decompresses back to FP16 at L2 so SMs/tensor cores run unchanged FP16 kernels. It delivers bandwidth/capacity gains without modifying compute. Our direction: keep data compressed to the compute boundary, decode into INT8/INT4 (or BF16) fragments for tensor cores, and replace Huffman with ANS, plus additional system and format upgrades.
Problem. LLM inference is often memory-bound; KV caches dominate memory footprint and traffic.
Positioning. Add domain-aware (de)compression in GPU L2 to cut HBM traffic while requiring minimal changes above L2.
Core method.
Where it lives. Compressor/decompressor blocks are integrated at the L2 cache interface; data is already re-expanded to FP16 before the SMs see it.
Key outcomes (as reported). Large effective bandwidth/capacity gains; speedups over fused dequant+GEMM baselines while preserving accuracy at common configs (e.g., W4/A8/KV4).