Summary 1 : general

workload

Tensor Core - GNN

background

The baseline TCGNN-SpMM algorithm makes the sparse A into condensed form by packing them into left sides. And gather the corresponding columns of dense B to put them into Tensor Cores.

but, even with this approach, the tensor core would suffer a lot from underutilization based on the sparsity level of sparse A. Still, the tensor core gives much higher throughput than cuda cores.

So, CUDA cores can avoid all the non-zeros, but have to do strong index calcuations. The tensor cores might do un-necessary calucation including zeros.

1) cuda core
CUDA Core time ∝ NNZ × (compute + index overhead)

2) tensor core
Tensor Core time ∝ NumTCBlocks × (TC operation)
D_frag[16×8] = A_frag[16×8] × B_frag[8×8] + C_frag[16×8]

problems

(observation 2) Low density of TC blocks