Summary

Previous work this work’s approach challenge solution scheme
Current Tensor Core supports only weight sparsity Supports activation sparsity on Tensor Core.
(first in GPU) activation sparsity is un-predictable BITMAP-Based SpGEMM - Outer-product-based operation (dense multiplication and gather/scatter-based accumulation)
  1. cost of im2col : affected by access to Shared MEM. (we don’t want to use register file) | | 1. activation sparsity : online. prefer to do it online.

cf) NVIDIA 4:2 pruning. off-line | | |

Background

SpCONV

cuDNN applys im2col for CONV on Tensor cores. cuDNN uses the implicit im2col not to expand the memory footprint. (implicit im2col : original feature map is on global memory and gather them to on-chip memory using address calculation)

The im2col is mainly for activation (GUESS : weight can be processed off-line). So far, when utilizing the weight sparsity, the im2col is based on dense format.

Proposed Work

BITMAP-based SpGEMM

image.png

Under the current Tensor Core design, B(activation matrix)’s sparsity produces an under-utilization problem in Tensor Core (’Not used’ in Fig 3(c)). This damages the parallelism of dot products.

Also, prior ASIC papers proposed several methods. But, the cost of those peripherals are considerable overhead to Tensor Cores. (propotional to large TC die-size)

image.png

If the sparse vectors are condensed, then vector-vector outer product would be condensed (full util, Fig 4(c) ). Also, some of the unnecessary computations would be skipped.