**Prerequisites:

Paper đź”— : https://arxiv.org/pdf/2104.09864

<aside> 🤔 Why do we need Rotary Positional Encoding when above techniques already exist?

<aside> 🔍

Core Problem Statement:

Instead of just adding positional embeddings, Can’t we inject/encode the position information in the attention mechanism itself?

image.png

What above equations says → Attention should work the same way that are the same distance apart regardless of where they appear in the sequence.

Explanation of above equation:

Left side: $⟨f_q(x_m, m), f_k(x_n, n)⟩$

Inner product (<,>) between query and key vector should depend only on relative position difference (m-n) → where m & n are positions of two tokens.) and not their absolute positions. Given the $f_q$ takes word & position and returns the query vector, same for $f_k$ (work with keys) so, this is attention scores between position m & n..

Key vector: $W_k @ (x_m + posEmbed[m])$ → (in standard absolute encoding) where m is the position, @ - matrix multiplication, W_k - learnable weight matrix

Right side: $g(x_m, x_n, m - n)$

The Mathematical Challenge:

We want to find functions that takes word embeddings + position f_q(x_m, m), f_k(x_n, n) and their inner product depends only on their relative difference.

And is this even possible??

</aside>

Yes, RoPE made it possible!

<aside> 📌 But HOW?

2-D Case Core Intuition

image.png

By using complex number representations,

$e^imθ$ → complex exponential (rotation by angle m * theta)

θ is non-zero constant, because if it is zero there would eventually be no capturing of relative distance.

image.png

Core Idea → Injecting positional information by rotating query and key vector by an angle proportional to their position index.

affine transformation: Before attention, we apply linear projections (learned matrices) to create query & key vectors (scroll to top for what is key vector!)

Let’s Focus on

image.png

And How this is derived 👇

image.png

image.png

When you multiply complex numbers: $e^(i*m*θ) * e^(-i*n*θ) = e^(i*(m-n)*θ)$

$e^(i * (m-n) * θ )$ comes from taking the complex conjugate in the dot product. Since, regular multiplication does not work for complex vector, dot product needs complex conjugate.

$⟨f_q(x_m, m), f_k(x_n, n)⟩$ = $f_q(x_m, m), [f_k(x_n, n)]*$

Taking conjugate of 2nd vector

$[f_k(x_n, n)]$ = $(W_k x_n) * e^(-inθ)$

$<f_q,fk>$ = $[(W_q x_m) * e^(imθ)] · [(W_k x_n)* * e^(-inθ)]$

               = $(W_q x_m) · (W_k x_n)* * e^(imθ) * e^(-inθ)$

               = $(W_q x_m) · (W_k x_n)* * e^(i*(m-n)*θ)$

Complex Conjugate

z = 3 + 4i → z* = 3 - 4i z = 2 - 5i → z* = 2 + 5i

RoPE uses e^(i * m * θ ), it’s actually using 👇

$e^(imθ) = cos(m * θ) + isin(m*θ)$ —> Euler's formula

Now, this actually made sense 👇

image.png

In practice (real matrices):

image.png

Here, conjugate effect comes from the matrix transpose.

Okay! Fine, but where does this θ come from?

image.png

So, isn’t this something we saw in sinusoidal? Yes!

image.png

In RoPE, Rotation angle = $pos * 1 / 10000^ (2i/d)$ , applying rotations using sin, cos.

Now,

$⟨f_q(x_m, m), f_k(x_n, n)⟩ = Re[(W_q x_m)(W_k x_n)* e^(i(m-n)θ)]$ Inner product depends on (m-n) θ → relative position.

And that’s what we want!

</aside>


<aside> đź’ˇ Some Key Insights from paper

image.png

image.png

As distance between two token increases maximum possible value (upper bound) of dot product of rotated vector decreases.

The farther apart the two tokens are, the less sharply RoPE can preserve their interactions.

Because words that are closer have high changes to have stronger semantic relationships so closer queries have similar position encodings while farther queries have different.

Something to note as well 👇

image.png

Value with which angle rotate depends on position & i value. With Lower indices rotation value increases fast with increase in position whereas for higher indices on increasing position this is not the case!

Can we relate again to sinusoidal? Yes, same case there ! Lower indices oscillates fast between positions, higher indices oscillates slow between positions.

</aside>


<aside> 💡 After Digesting the math, Let’s move on to “how it is actually implemented”?

This is How RoPE injects position information into attention using rotation.

image.png

Gave a clear view!!

</aside>

Lets Code it!!


image.png

đź”— https://github.com/Khan-Ramsha/FinetuneX/blob/main/modules/positional_encoding.py