Inspired by CSC420, a computer vision course I took in university | By Chris Oh
Transformers power modern large language models (LLMs) like GPT, BERT, and LLaMA. But behind the buzzwords, they’re built from surprisingly elegant building blocks. In this guide, we’ll build a minimal Transformer from scratch in PyTorch.
Before 2017, sequence models were dominated by RNNs and LSTMs. The Transformer, introduced in Attention Is All You Need, replaced recurrence with self-attention, enabling:
At its core, a Transformer is:
Transformers don’t inherently understand order, so we add positional encodings.
import torch
import torch.nn as nn
import math
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super().__init__()
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1)
div_term = torch.exp(
torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0) # shape: (1, max_len, d_model)
self.register_buffer("pe", pe)
def forward(self, x):
return x + self.pe[:, :x.size(1)]
The heart of the Transformer:
$$ \text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V $$
def scaled_dot_product_attention(Q, K, V, mask=None):
d_k = Q.size(-1)
scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attn = scores.softmax(dim=-1)
return attn @ V, attn