Positional Embeddings

The Problem They Solve

Transformers process all tokens in parallel (unlike RNNs which process sequentially). This is great for speed, but creates a problem: the model has no idea about the ORDER of words.

Without positional information:

"The cat chased the dog"
"The dog chased the cat"

Would look identical to the model! Both have the same tokens, just in different positions.

How Positional Embeddings Work

In `init`:

self.pos_emb = nn.Embedding(config["context_length"], config["emb_dim"])

This creates a learnable lookup table:

Rows: One for each position (0 to context_length-1)
Columns: Embedding dimension (same as token embeddings)

Think of it as a matrix of shape [context_length, emb_dim], where:

Row 0 = embedding for "position 0"
Row 1 = embedding for "position 1"
Row 2 = embedding for "position 2"
... and so on

Example with small numbers:

If context_length = 512 and emb_dim = 768:
pos_emb is a [512 × 768] learnable parameter matrix

The Problem They Solve

How Positional Embeddings Work

In __init__:

In forward:

In `init`:

In `forward`: