**Prerequisites:
Paper đź”— : https://arxiv.org/pdf/2104.09864
<aside> 🤔 Why do we need Rotary Positional Encoding when above techniques already exist?
<aside> 🔍
Core Problem Statement:
Instead of just adding positional embeddings, Can’t we inject/encode the position information in the attention mechanism itself?
What above equations says → Attention should work the same way that are the same distance apart regardless of where they appear in the sequence.
Explanation of above equation:
Left side: $⟨f_q(x_m, m), f_k(x_n, n)⟩$
Inner product (<,>) between query and key vector should depend only on relative position difference (m-n) → where m & n are positions of two tokens.) and not their absolute positions. Given the $f_q$ takes word & position and returns the query vector, same for $f_k$ (work with keys) so, this is attention scores between position m & n..
Key vector: $W_k @ (x_m + posEmbed[m])$ → (in standard absolute encoding) where m is the position, @ - matrix multiplication, W_k - learnable weight matrix
Right side: $g(x_m, x_n, m - n)$
The Mathematical Challenge:
We want to find functions that takes word embeddings + position f_q(x_m, m), f_k(x_n, n)
and their inner product depends only on their relative difference.
</aside>
<aside> 📌 But HOW?
2-D Case Core Intuition
By using complex number representations,
$e^imθ$ → complex exponential (rotation by angle m * theta)
θ is non-zero constant, because if it is zero there would eventually be no capturing of relative distance.
Core Idea → Injecting positional information by rotating query and key vector by an angle proportional to their position index.
affine transformation: Before attention, we apply linear projections (learned matrices) to create query & key vectors (scroll to top for what is key vector!)
Let’s Focus on
And How this is derived 👇
When you multiply complex numbers: $e^(i*m*θ) * e^(-i*n*θ) = e^(i*(m-n)*θ)$
$e^(i * (m-n) * θ )$ comes from taking the complex conjugate in the dot product. Since, regular multiplication does not work for complex vector, dot product needs complex conjugate.
$⟨f_q(x_m, m), f_k(x_n, n)⟩$
= $f_q(x_m, m), [f_k(x_n, n)]*$
Taking conjugate of 2nd vector
$[f_k(x_n, n)]$ = $(W_k x_n) * e^(-inθ)$
$<f_q,fk>$ = $[(W_q x_m) * e^(imθ)] · [(W_k x_n)* * e^(-inθ)]$
= $(W_q x_m) · (W_k x_n)* * e^(imθ) * e^(-inθ)$
= $(W_q x_m) · (W_k x_n)* * e^(i*(m-n)*θ)$
Complex Conjugate
z = 3 + 4i → z* = 3 - 4i z = 2 - 5i → z* = 2 + 5i
$e^(imθ) = cos(m * θ) + isin(m*θ)$ —> Euler's formula
Now, this actually made sense 👇
In practice (real matrices):
Here, conjugate effect comes from the matrix transpose.
Okay! Fine, but where does this θ come from?
So, isn’t this something we saw in sinusoidal? Yes!
In RoPE, Rotation angle = $pos * 1 / 10000^ (2i/d)$ , applying rotations using sin, cos.
Now,
$⟨f_q(x_m, m), f_k(x_n, n)⟩ = Re[(W_q x_m)(W_k x_n)* e^(i(m-n)θ)]$ Inner product depends on (m-n) θ → relative position.
And that’s what we want!
</aside>
<aside> đź’ˇ Some Key Insights from paper
As distance between two token increases maximum possible value (upper bound) of dot product of rotated vector decreases.
The farther apart the two tokens are, the less sharply RoPE can preserve their interactions.
Because words that are closer have high changes to have stronger semantic relationships so closer queries have similar position encodings while farther queries have different.
Something to note as well 👇
Value with which angle rotate depends on position & i value. With Lower indices rotation value increases fast with increase in position whereas for higher indices on increasing position this is not the case!
Can we relate again to sinusoidal? Yes, same case there ! Lower indices oscillates fast between positions, higher indices oscillates slow between positions.
</aside>
<aside> 💡 After Digesting the math, Let’s move on to “how it is actually implemented”?
This is How RoPE injects position information into attention using rotation.
Gave a clear view!!
</aside>
đź”— https://github.com/Khan-Ramsha/FinetuneX/blob/main/modules/positional_encoding.py