WASP: Exploiting GPU Pipeline Parallelism with Hardware-Accelerated Automatic Warp Specialization

based on the conversation with chatGPT

TL;DR. WASP co-designs hardware and an automatic compiler pass to turn alternating “memory/compute phases” in GPU kernels into concurrent pipeline stages (e.g., dedicated memory-stage warps feeding compute-stage warps). It adds stage-aware scheduling/mapping, per-stage register allocation, fast register-file queues (RFQs), and a small TMA-like offload for streaming/gather. Reported gains are ~1.47× geomean with modest HW cost, mainly by overlapping memory and compute and cutting software queue overheads.

1) Problem the paper tackles

Modern GPUs can’t always hide memory latency: many kernels execute in lockstep phases (“load tile → compute → load tile …”), so all resident warps often stall together on memory or synchronization. Manual warp specialization exists but is brittle and invisible to the hardware.

Typical way of hiding the memory latency using the other warps

Even though current GPU can hide the memory access latency with other warps reside in warp registers, this only can be applicable with manual splitting of original waprs into stage 0 and stage 1.

Fig. 1(a) can be turned into Fig. 1(b) on today’s GPUs by manually doing warp specialization: you program one warp as a producer (global→SMEM tile moves) and another as a consumer (compute from SMEM), coordinate them with arrive/wait barriers, and double-buffer the SMEM tiles so the baseline scheduler can interleave the two warps. This split is a known, manual technique on current GPUs, not WASP itself.

2) Core idea in one sentence

Make warp specialization first-class: automatically split kernels into pipeline stages (often memory producer and compute consumer) and give the SM stage awareness (IDs, scheduling, mapping, registers, queues, and a data-movement offload) so memory and compute overlap reliably.

3) Main mechanisms

3.1 Compiler (who does the split)

Automatic partitioning. A compiler pass analyzes the program dependence graph and cuts at global-memory load/use boundaries, producing stage code and metadata.