| Previous work | this work’s approach | challenge | solution | scheme |
|---|---|---|---|---|
| Current Tensor Core supports only weight sparsity | Supports activation sparsity on Tensor Core. | |||
| (first in GPU) | activation sparsity is un-predictable | BITMAP-Based SpGEMM | - Outer-product-based operation (dense multiplication and gather/scatter-based accumulation) |
cf) NVIDIA 4:2 pruning. off-line | | |
cuDNN applys im2col for CONV on Tensor cores. cuDNN uses the implicit im2col not to expand the memory footprint. (implicit im2col : original feature map is on global memory and gather them to on-chip memory using address calculation)
The im2col is mainly for activation (GUESS : weight can be processed off-line). So far, when utilizing the weight sparsity, the im2col is based on dense format.

Under the current Tensor Core design, B(activation matrix)’s sparsity produces an under-utilization problem in Tensor Core (’Not used’ in Fig 3(c)). This damages the parallelism of dot products.
Also, prior ASIC papers proposed several methods. But, the cost of those peripherals are considerable overhead to Tensor Cores. (propotional to large TC die-size)

If the sparse vectors are condensed, then vector-vector outer product would be condensed (full util, Fig 4(c) ). Also, some of the unnecessary computations would be skipped.