https://hazyresearch.stanford.edu/blog/2024-05-12-tk

What I do know

without wgmma/TMA, TC can only have 63% peak performance.

wgmma is important to hide the SMEM↔register memory access and expand the memory space using SMEM.

TMA is importatnt to hide the GMEM↔SMEM memory access.

What i don’t know

How tma works and overlap with compute (who decode it, when it is executed)

How wgmma async works and overlap with other works(softmax)

How to program it (their kernel’s)

New Knowledge

Shared Memory

Flash Attention Kernel