image.png

Motivation

Because transformers process tokens in parallel (rather than sequentially), they need a way to understand the ordering of tokens. Positional encoding provides this mechanism. Without it, a transformer can’t distinguish “the cat sat on the mat” from “the mat sat on the cat”. When transformers were introduced in Attention Is All You Need, the authors used absolute positional embeddings. Different positional encoding techniques have been created since then, but rotary position embedding (RoPE) is the most popular.

<aside>

⚠️ The terms positional encoding and positional embeddings are often used interchangeably.

</aside>

RoPE

Instead of adding positional embeddings directly to the token embeddings, RoPE is multiplicative and introduces positional information into the query and key through rotations.

<aside>

Recall that in transformers, the query and key each have shape $(B,H,T,d)$ where:

As per Equation (14) in the paper, RoPE is applied to the query and key:

$$ f_{q,k}(x_m, m) = R^d_{\Theta,m} W_{q,k} x_m $$

where

$$ R^d_{\Theta,m} = \begin{pmatrix}\cos m\theta_1 & -\sin m\theta_1 & 0 & 0 & \cdots & 0 & 0 \\\sin m\theta_1 & \cos m\theta_1 & 0 & 0 & \cdots & 0 & 0 \\0 & 0 & \cos m\theta_2 & -\sin m\theta_2 & \cdots & 0 & 0 \\0 & 0 & \sin m\theta_2 & \cos m\theta_2 & \cdots & 0 & 0 \\\vdots & \vdots & \vdots & \ddots & \ddots & \vdots & \vdots \\\vdots & \vdots & \vdots & \ddots & \ddots & \vdots & \vdots \\0 & 0 & 0 & 0 & \cdots & \cos m\theta_{d/2} & -\sin m\theta_{d/2} \\0 & 0 & 0 & 0 & \cdots & \sin m\theta_{d/2} & \cos m\theta_{d/2}\end{pmatrix} $$

Each 2x2 block rotates a feature pair (identified by index $i$) by the angle $mθ_i$, where

$$ \Theta = \{\theta_i = 10000^{-2(i-1)/d}, \quad i \in [1, 2, \ldots, d/2]\} $$