https://arxiv.org/pdf/2501.09251
Summary
Built a library for efficent SpMM
- data affinity-based re-ordering (→ to increase the valid data in TC)
- memory efficient compressed format (→ little overhead for decompression)
- high-throughput pipeline (prefetching dense B)
- adaptive sparsity-aware load balancing
Workload Analysis
-
memory access-intensive compuation due to its irregular memory access patterns
- (dense B) irregualr non-contiguous access → cache misses, high access-request overhead → latency / stall dominated.
- low BW utilization (low effective BW)
This prevents full utilization of the memory bandwidth and results in low TCU pipeline utilization (dense B’s global to reg access)
-
poor efficiency of ILP which can’t overlap memory access and compuation enough which leads to low memoy bandwidth or lots of pipeline bubbles
- has to wait dense B matrix.
- TC idle while loading B tiles. (G to Reg)
in reality, the computation time is much shorter than the data loading time.
Proposed Works
Data affinity based reordering

This will increase the TC utilization and cache hit.
Memory efficient compressed format
