Summary

To overcome the SpMV based GNN acceleration, they invented general purpose SpMM with cusomized CSR-based.

Naively expanding SpMV based scheme makes two pain points (uncoalesced access pattern between threads, week reuse == redundant data loading)

For deal with problems, they mada two schemes;

colasced row caching : using shared mem to enable efficient memory access
coarse-grained warp merging : reducing redundant sparse matrix loading across warps

target

GNN

if the reduce is sum = SpMM

if the reduce is max-pooling = SpMM-like (this paper’s target; cuSPARSE does not support)

workload Analysis

They argued that the BW is saturated when we have high N value (which is a column size of output matrix; feature dimension). Therefore, they said that data re-use mechanism is necessary. This means that BW is not efficiently used.

However, they said that SpMV is bounded by low BW utilization. But, these results are based on the RTX 2080 which is very old machine.

Summary

target

workload Analysis

Proposed Work