Virgo: Cluster-level Matrix Unit Integration in GPUs for Scalability and Energy Efficiency

Background

WGMMA (Hopper)

WGMMA instruction is introduced in Hopper architecture. This instruction utilizes the warp-group (4 consecutive warps) to execute on bigger operands for better reuse. Also, It utilizes the SMEM to store matrix B or (sometimes) matrix A instead of storing them solely in register files. But, C (partial sum) must be accumulated using the register. Moreover, tensor core asynchronously accesses the SMEM for matrix B and gets the data on fifo queue to overlap SMEM accesses and tensor core execution so that we can hide the SMEM access time.

cp.async

This new instruction copies the operand from GMEM to SMEM. Using this approach data access from GMEM to SMEM can be overlapped with computations. To achieve this, NVIDIA implemented a TMA (Tensor Memory Accelerator).

Motivation

The tensor core is restricted by register size and bandwidth. Because of this, many fine-grained instructions are necessary to execute big matrix operations. This incurs high instruction management, I-cache usage, and address-generation energy consumption.

Also, there is no cross-core data sharing among tensor cores. Because register files are allocated only for their own tensor core. The data in register files for different tensor cores are not shared. So, only small fractions of matrix operations are divided into multiple cores without data reuse.

Solution

Dis-aggregated Tensor Core

Instead of many SIMT-coupled Tensor Core, Virgo made a large matrix multiplication unit. This can be executed based on MIMD style instruction. As each instruction is executed for a bigger operand, the total number of instructions are reduced. This unit loaded the operand directly from SMEM so that it didn’t need a register file. As it does not use the register files, there is energy reduction by not toggling the register files.

Also, as the big tensor core executes upon larger operands, the reuse opportunity grows so that it can fix the cross-core no re-use problem which exists in SIMT-coupled tensor core structure.

Based on these factors, Virgo can save 67.3% and 24.2% energy saving from the Ampere-style and Hopper-style architectures.

As the big tensor core is separated from SIMT core, the SIMT core and Tensor core can operate independently. This heterogeneity gives an opportunity to overlap the different kernel executions like softmax + matmul. This also gives a performance boost.

Something that I don’t understand

Area efficiency

The Virgo added one cluster-level matrix unit, a small accumulator SRAM, and SMEM interconnect support. On the other hand, the Virgo removed the SIMT-coupled tensor cores and its control units. So, in total the area should be bigger than the original one. But, they argued that there was -0.1% saving for Volta-style architecture.

Novelty of this paper