What I do know

without wgmma/TMA, TC can only have 63% peak performance.

wgmma is important to hide the SMEM↔register memory access and expand the memory space using SMEM.

TMA is importatnt to hide the GMEM↔SMEM memory access.

What i don’t know

How tma works and overlap with compute (who decode it, when it is executed)

How wgmma async works and overlap with other works(softmax)

How to program it (their kernel’s)