The Math behind Rotating Attention

**Prerequisites:

Attention Mechanism
Positional Encodings and their need
Absolute Position Encodings (sinusoidal)
Relative Position Encoding**

Paper 🔗 : https://arxiv.org/pdf/2104.09864

<aside> 🤔 Why do we need Rotary Positional Encoding when above techniques already exist?

In sinusoidal encoding, directly adding positional encoding into the token embedding would pollute the semantics carried by token embeddings since token embeddings hold semantic information.
Same Words at different positions but with same distance get different position encodings
- Example→ sentence 1: read the book sentence 2: the book Difference between “the book” is same but positions are different so they get different encodings in absolute encoding.
Relative position encoding→ Computing overhead due to separate position encodings for layers and each tokens in a sequence.
Relative position encoding might not be as effective in capturing “absolute positions” since it focuses only on relationship between tokens. </aside>

<aside> 🔍

Core Problem Statement:

Instead of just adding positional embeddings, Can’t we inject/encode the position information in the attention mechanism itself?

What above equations says → Attention should work the same way that are the same distance apart regardless of where they appear in the sequence.

Explanation of above equation:

Left side: $⟨f_q(x_m, m), f_k(x_n, n)⟩$

Inner product (<,>) between query and key vector should depend only on relative position difference (m-n) → where m & n are positions of two tokens.) and not their absolute positions. Given the $f_q$ takes word & position and returns the query vector, same for $f_k$ (work with keys) so, this is attention scores between position m & n..

Key vector: $W_k @ (x_m + posEmbed[m])$ → (in standard absolute encoding) where m is the position, @ - matrix multiplication, W_k - learnable weight matrix

Right side: $g(x_m, x_n, m - n)$

takes 2 word embeddings and difference between the relative positions. so it uses (m-n), not individual positions.

The Mathematical Challenge:

We want to find functions that takes word embeddings + position f_q(x_m, m), f_k(x_n, n) and their inner product depends only on their relative difference.

And is this even possible??

</aside>

Yes, RoPE made it possible!

<aside> 📌 But HOW?

2-D Case Core Intuition

By using complex number representations,

Use complex multiplications for rotations

$e^imθ$ → complex exponential (rotation by angle m * theta)

θ is non-zero constant, because if it is zero there would eventually be no capturing of relative distance.

Core Idea → Injecting positional information by rotating query and key vector by an angle proportional to their position index.

affine transformation: Before attention, we apply linear projections (learned matrices) to create query & key vectors (scroll to top for what is key vector!)

Let’s Focus on

And How this is derived 👇

When you multiply complex numbers: $e^(i*m*θ) * e^(-i*n*θ) = e^(i*(m-n)*θ)$

$e^(i * (m-n) * θ )$ comes from taking the complex conjugate in the dot product. Since, regular multiplication does not work for complex vector, dot product needs complex conjugate.

$⟨f_q(x_m, m), f_k(x_n, n)⟩$ = $f_q(x_m, m), [f_k(x_n, n)]*$

Taking conjugate of 2nd vector

$[f_k(x_n, n)]$ = $(W_k x_n) * e^(-inθ)$

$<f_q,fk>$ = $[(W_q x_m) * e^(imθ)] · [(W_k x_n)* * e^(-inθ)]$

               = $(W_q x_m) · (W_k x_n)* * e^(imθ) * e^(-inθ)$

               = $(W_q x_m) · (W_k x_n)* * e^(i*(m-n)*θ)$

Complex Conjugate

z = 3 + 4i → z* = 3 - 4i z = 2 - 5i → z* = 2 + 5i

RoPE uses e^(i * m * θ ), it’s actually using 👇

$e^(imθ) = cos(m * θ) + isin(m*θ)$ —> Euler's formula

Now, this actually made sense 👇

In practice (real matrices):

Here, conjugate effect comes from the matrix transpose.

Okay! Fine, but where does this θ come from?

So, isn’t this something we saw in sinusoidal? Yes!

In RoPE, Rotation angle = $pos * 1 / 10000^ (2i/d)$ , applying rotations using sin, cos.

Now,

$⟨f_q(x_m, m), f_k(x_n, n)⟩ = Re[(W_q x_m)(W_k x_n)* e^(i(m-n)θ)]$ Inner product depends on (m-n) θ → relative position.

And that’s what we want!

</aside>

<aside> 💡 Some Key Insights from paper

As distance between two token increases maximum possible value (upper bound) of dot product of rotated vector decreases.

The farther apart the two tokens are, the less sharply RoPE can preserve their interactions.

Because words that are closer have high changes to have stronger semantic relationships so closer queries have similar position encodings while farther queries have different.

Something to note as well 👇

Value with which angle rotate depends on position & i value. With Lower indices rotation value increases fast with increase in position whereas for higher indices on increasing position this is not the case!

Can we relate again to sinusoidal? Yes, same case there ! Lower indices oscillates fast between positions, higher indices oscillates slow between positions.

</aside>

<aside> 💡 After Digesting the math, Let’s move on to “how it is actually implemented”?

This is How RoPE injects position information into attention using rotation.

Gave a clear view!!

</aside>

Lets Code it!!

Feel free to ping me with any questions or suggestions!