So, this is not really a blog, just a quick illustration of why [B, H, T, Hd] is preferred over [B, T, H, Hd] in Attention mechanism.
B = 1 (batch_size)
H = 2 (number of heads)
T = 2 (tokens→ “cat”, “ate”)
Hd = 2 (dimensions per head)
head0_dim0 → h0_d0
head1_dim1 → h1_d1
P: Positions
cat_h0_d0 cat_h0_d1 cat_h1_d0 cat_h1_d1 ate_h0_d0 ate_h0_d1 ate_h1_d0 ate_h1_d1
to get head0, positions are scattered [0,1,4,5],
to get head1 → [2,3,6,7]
cat_h0_d0 cat_h0_d1 ate_h0_d0 ate_h0_d1 cat_h1_d0 cat_h1_d1 ate_h1_d0 ate_h1_d1