Summary

Built a library for efficent SpMM

Workload Analysis

memory access-intensive compuation due to its irregular memory access patterns
- (dense B) irregualr non-contiguous access → cache misses, high access-request overhead → latency / stall dominated.
- low BW utilization (low effective BW)
This prevents full utilization of the memory bandwidth and results in low TCU pipeline utilization (dense B’s global to reg access)
poor efficiency of ILP which can’t overlap memory access and compuation enough which leads to low memoy bandwidth or lots of pipeline bubbles
- has to wait dense B matrix.
- TC idle while loading B tiles. (G to Reg)
in reality, the computation time is much shorter than the data loading time.

This will increase the TC utilization and cache hit.