Summary 1 : general

workload

Tensor Core - GNN

background

image.png

image.png

The baseline TCGNN-SpMM algorithm makes the sparse A into condensed form by packing them into left sides. And gather the corresponding columns of dense B to put them into Tensor Cores.

image.png

but, even with this approach, the tensor core would suffer a lot from underutilization based on the sparsity level of sparse A. Still, the tensor core gives much higher throughput than cuda cores.

image.png

So, CUDA cores can avoid all the non-zeros, but have to do strong index calcuations. The tensor cores might do un-necessary calucation including zeros.

1) cuda core
CUDA Core time ∝ NNZ × (compute + index overhead)

2) tensor core
Tensor Core time ∝ NumTCBlocks × (TC operation)
D_frag[16×8] = A_frag[16×8] × B_frag[8×8] + C_frag[16×8]

problems

(observation 2) Low density of TC blocks