Inspired by CSC420, a computer vision course I took in university
Transformers power modern large language models (LLMs) like GPT, BERT, and LLaMA. But behind the buzzwords, they’re built from surprisingly elegant building blocks. In this guide, we’ll build a minimal Transformer from scratch in PyTorch.
Before 2017, sequence models were dominated by RNNs and LSTMs. The Transformer, introduced in Attention Is All You Need, replaced recurrence with self-attention, enabling:
At its core, a Transformer is:
Transformers don’t inherently understand order, so we add positional encodings.
import torch
import torch.nn as nn
import math
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super().__init__()
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1)
div_term = torch.exp(
torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0) # shape: (1, max_len, d_model)
self.register_buffer("pe", pe)
def forward(self, x):
return x + self.pe[:, :x.size(1)]
The heart of the Transformer:
$$ \text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V $$
def scaled_dot_product_attention(Q, K, V, mask=None):
d_k = Q.size(-1)
scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attn = scores.softmax(dim=-1)
return attn @ V, attn