RoPE and Rotations

Motivation

Because transformers process tokens in parallel (rather than sequentially), they need a way to understand the ordering of tokens. Positional encoding provides this mechanism. Without it, a transformer can’t distinguish “the cat sat on the mat” from “the mat sat on the cat”. When transformers were introduced in Attention Is All You Need, the authors used absolute positional embeddings. Different positional encoding techniques have been created since then, but rotary position embedding (RoPE) is the most popular.

<aside>

⚠️ The terms positional encoding and positional embeddings are often used interchangeably.

</aside>

RoPE

Instead of adding positional embeddings directly to the token embeddings, RoPE is multiplicative and introduces positional information into the query and key through rotations.

<aside>

Recall that in transformers, the query and key each have shape $(B,H,T,d)$ where:

$B$ is the batch size,
$H$ is the number of heads in multi-headed attention,
$T$ is the sequence length (number of tokens),
$d$ is the number of features. </aside>

As per Equation (14) in the paper, RoPE is applied to the query and key:

$$ f_{q,k}(x_m, m) = R^d_{\Theta,m} W_{q,k} x_m $$

where

$x_m$ is the embedding of the token at position $m$,
$W_{q}$ is a weight matrix which projects $x_m$ into the query $q_m$ via $q_m=W_qx_m$ (similarly for the key),
$R^d_{\Theta,m}$ is a block-diagonal rotation matrix that tells us how to rotate the query/key. It is given by Equation (15) in the paper:

$$ R^d_{\Theta,m} = \begin{pmatrix}\cos m\theta_1 & -\sin m\theta_1 & 0 & 0 & \cdots & 0 & 0 \\\sin m\theta_1 & \cos m\theta_1 & 0 & 0 & \cdots & 0 & 0 \\0 & 0 & \cos m\theta_2 & -\sin m\theta_2 & \cdots & 0 & 0 \\0 & 0 & \sin m\theta_2 & \cos m\theta_2 & \cdots & 0 & 0 \\\vdots & \vdots & \vdots & \ddots & \ddots & \vdots & \vdots \\\vdots & \vdots & \vdots & \ddots & \ddots & \vdots & \vdots \\0 & 0 & 0 & 0 & \cdots & \cos m\theta_{d/2} & -\sin m\theta_{d/2} \\0 & 0 & 0 & 0 & \cdots & \sin m\theta_{d/2} & \cos m\theta_{d/2}\end{pmatrix} $$

Each 2x2 block rotates a feature pair (identified by index $i$) by the angle $mθ_i$, where

$m$ is the position of the token in the input sequence,
$θ_i$ is a frequency that depends on the feature pair index $i$:

$$ \Theta = \{\theta_i = 10000^{-2(i-1)/d}, \quad i \in [1, 2, \ldots, d/2]\} $$