The Problem They Solve

Transformers process all tokens in parallel (unlike RNNs which process sequentially). This is great for speed, but creates a problem: the model has no idea about the ORDER of words.

Without positional information:

Would look identical to the model! Both have the same tokens, just in different positions.

How Positional Embeddings Work

In __init__:

self.pos_emb = nn.Embedding(config["context_length"], config["emb_dim"])

This creates a learnable lookup table:

Think of it as a matrix of shape [context_length, emb_dim], where:

Example with small numbers:

If context_length = 512 and emb_dim = 768:
pos_emb is a [512 × 768] learnable parameter matrix

In forward: