Memory Layout Comparison:

So, this is not really a blog, just a quick illustration of why [B, H, T, Hd] is preferred over [B, T, H, Hd] in Attention mechanism.

B = 1 (batch_size)

H = 2 (number of heads)

T = 2 (tokens→ “cat”, “ate”)

Hd = 2 (dimensions per head)

head0_dim0 → h0_d0

head1_dim1 → h1_d1

P: Positions

FOR [B, T, H, Hd] when T is before H

cat_h0_d0    cat_h0_d1   cat_h1_d0    cat_h1_d1   ate_h0_d0       ate_h0_d1      ate_h1_d0   ate_h1_d1

to get head0, positions are scattered [0,1,4,5],

to get head1 → [2,3,6,7]

cat_h0_d0    cat_h0_d1   ate_h0_d0    ate_h0_d1   cat_h1_d0       cat_h1_d1      ate_h1_d0   ate_h1_d1