Transformers process all tokens in parallel (unlike RNNs which process sequentially). This is great for speed, but creates a problem: the model has no idea about the ORDER of words.
Without positional information:
Would look identical to the model! Both have the same tokens, just in different positions.
__init__:self.pos_emb = nn.Embedding(config["context_length"], config["emb_dim"])
This creates a learnable lookup table:
Think of it as a matrix of shape [context_length, emb_dim], where:
Example with small numbers:
If context_length = 512 and emb_dim = 768:
pos_emb is a [512 × 768] learnable parameter matrix
forward: