Inspired by CSC420, a computer vision course I took in university

Transformers power modern large language models (LLMs) like GPT, BERT, and LLaMA. But behind the buzzwords, they’re built from surprisingly elegant building blocks. In this guide, we’ll build a minimal Transformer from scratch in PyTorch.

Why Transformers?

Before 2017, sequence models were dominated by RNNs and LSTMs. The Transformer, introduced in Attention Is All You Need, replaced recurrence with self-attention, enabling:

At its core, a Transformer is:

  1. Token embeddings + positional encodings
  2. Repeated blocks of multi-head self-attention and feed-forward networks
  3. A final projection for tasks (e.g., language modeling, classification)

1. Token & Positional Embeddings

Transformers don’t inherently understand order, so we add positional encodings.

import torch
import torch.nn as nn
import math

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)  # shape: (1, max_len, d_model)
        self.register_buffer("pe", pe)

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

2. Scaled Dot-Product Attention

The heart of the Transformer:

$$ \text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V $$

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    attn = scores.softmax(dim=-1)
    return attn @ V, attn

3. Multi-Head Attention