based on the conversaion with chatGPT
(Summary, Q&A, and Focus Notes)
Goal (rephrased)
Make the paper easy to reuse: understand its core idea and results, answer the specific questions you raised (registers, GPU mapping, compression menu, and storage-only entropy coding), and finish with a concrete checklist you can execute.
1) Paper at a glance
- Problem. With LLM weights stored compressed (low-bit + unstructured sparsity), modern CPUs with in-core GeMM (e.g., AMX/TMUL) become decompression-bound: vector units unpack/dequant/expand tiles slower than memory can supply compressed data or the matrix engine can consume dense tiles.
- Model. A 3-D “Roof-Surface” generalizes roofline: performance is jointly bounded by (i) memory bandwidth on compressed tensors, (ii) vector-stage decompression throughput, and (iii) matrix-engine throughput.
- Architecture. DECA (near-core decompressor by each core, near L2) performs dequantization + expansion and hands off dense tiles.
- ISA/Runtime. TEPL (Tile External Preprocess & Load) overlaps DECA work with GeMM; double-buffering avoids fences and hides latency.
- Results (CPU+HBM sim). Up to 4× speedup over tuned software decompression; 1.6–2.6× lower next-token latency (and 2.5–5× vs uncompressed). Area cost is tiny relative to the die.
2) Core idea (one line)
Treat decompression as a first-class pipeline stage. Right-size it (via the 3-D model) and move it into a tiny near-core engine that streams ready-to-use dense tiles into the matrix path, overlapped via TEPL.
3) The 3-D Roof-Surface (why 2-D roofline fails)
-
Axes:
Mem axis = HBM/L2→core rate on compressed data.
Vector axis = unpack/dequant/expand rate to dense tiles.
Matrix axis = AMX/TMUL consumption rate for dense tiles.
-
Achieved perf = min of the three (after normalizing units). Many CPU cases land in the vector-bound region—software decompression is the limiter.
4) DECA + TEPL (what they build)
- DECA: per-core micro-engine with small LUTs and a pipelined datapath (bit-unpack → scale/zero-point → optional sparse expand). Outputs land in tile-out registers local to DECA or (with TEPL) directly into AMX tiles.
- TEPL: an instruction that kicks DECA and retires when a specified AMX tile register is populated; allows out-of-order overlap with compute.
5) Compression formats explicitly treated in the paper