Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow
Yang-gon Kim, 251023
The typical applications that GPU had targeted (ex. Texturing ) don’t need branch instructions. But this paper targets non-graphics applications.
The benefit of SIMD compared to MIMD is that SIMD can save a lot of data-independent logic area.
There are two different ways for latency hiding.
| Traditional microprocessor (CPU) | GPU | |
|---|---|---|
| way | Out-of-order | Interleaved threads execution. |
By the nature of SIMD cores, the instructions need to read their operands in parallel from a highly banked register file.
The barrel processor hides the latency of memory access by interleaving the different thread groups. This means that if one thread group is stalled by memory access latency, then another thread group takes the SIMD core to maintain the throughput.
For each bank of the register file, the one thread only makes access to the one specific bank.
Some applications had been successfully mapped into GPU HW. But they haven’t improved much because of the difficult handling of control-flow. The authors found that naive handling of control divergence (which does not recover the divergences) would damage the utilization of SIMD cores.
Also, the authors’ assumption for this paper is that most existing general-purpose applications tend to have much more diverse control-flow unlike the most existing graphics rendering routines. But, for this assumption, I don’t really understand how the author persuaded the committee with non-existing workloads. The paper was published in 2007 and there might be no workloads that have much divergence. Today, the authors’ assumption might be true. But is it really persuasive to talk about futuristic workloads to empathize with the impact of the paper?
A naive approach to handle branch divergence is to serialize the threads’ execution when they encounter divergence. But this approach would lead to highly low SIMD core utilization.
Also, there is a cost of re-convergence after divergence. The hardware must know which instruction is the re-convergence point to recover the full utilization which is lost after the divergences.
For the re-convergence point, the author used the immediate post-dominator. These immediate post-dominators are found at the compile time as part of the analysis.
To fully leverage the re-convergence point, the authors describe the hardware mechanism(Dynamic Warp Formation) to improve the performance. This method is to create a new warp by combining the diverged sub-warps from different warps in the same thread block. By doing this the HW utilization increases, which leads to boost the IPC.
The most important part of DWF is the mapping between the register file and the lane for each thread. If a dynamically mapped thread can access any bank in the register file, then we need a crossbar. Even with the crossbar, there might be bank conflicts when several threads would make access to the same bank after DWF.