DECA: A Near-Core LLM Decompression Accelerator Grounded on a 3D Roofline Model

based on the conversaion with chatGPT

(Summary, Q&A, and Focus Notes)

Goal (rephrased)

Make the paper easy to reuse: understand its core idea and results, answer the specific questions you raised (registers, GPU mapping, compression menu, and storage-only entropy coding), and finish with a concrete checklist you can execute.

1) Paper at a glance

Problem. With LLM weights stored compressed (low-bit + unstructured sparsity), modern CPUs with in-core GeMM (e.g., AMX/TMUL) become decompression-bound: vector units unpack/dequant/expand tiles slower than memory can supply compressed data or the matrix engine can consume dense tiles.
Model. A 3-D “Roof-Surface” generalizes roofline: performance is jointly bounded by (i) memory bandwidth on compressed tensors, (ii) vector-stage decompression throughput, and (iii) matrix-engine throughput.
Architecture. DECA (near-core decompressor by each core, near L2) performs dequantization + expansion and hands off dense tiles.
ISA/Runtime. TEPL (Tile External Preprocess & Load) overlaps DECA work with GeMM; double-buffering avoids fences and hides latency.
Results (CPU+HBM sim). Up to 4× speedup over tuned software decompression; 1.6–2.6× lower next-token latency (and 2.5–5× vs uncompressed). Area cost is tiny relative to the die.

2) Core idea (one line)

Treat decompression as a first-class pipeline stage. Right-size it (via the 3-D model) and move it into a tiny near-core engine that streams ready-to-use dense tiles into the matrix path, overlapped via TEPL.

3) The 3-D Roof-Surface (why 2-D roofline fails)

Axes:

Mem axis = HBM/L2→core rate on compressed data.

Vector axis = unpack/dequant/expand rate to dense tiles.

Matrix axis = AMX/TMUL consumption rate for dense tiles.
Achieved perf = min of the three (after normalizing units). Many CPU cases land in the vector-bound region—software decompression is the limiter.

4) DECA + TEPL (what they build)

DECA: per-core micro-engine with small LUTs and a pipelined datapath (bit-unpack → scale/zero-point → optional sparse expand). Outputs land in tile-out registers local to DECA or (with TEPL) directly into AMX tiles.
TEPL: an instruction that kicks DECA and retires when a specified AMX tile register is populated; allows out-of-order overlap with compute.

Goal (rephrased)

1) Paper at a glance

2) Core idea (one line)

3) The 3-D Roof-Surface (why 2-D roofline fails)

4) DECA + TEPL (what they build)

5) Compression formats explicitly treated in the paper