How to Build a Transformer from Scratch (Using PyTorch)

Inspired by CSC420, a computer vision course I took in university | By Chris Oh

Transformers power modern large language models (LLMs) like GPT, BERT, and LLaMA. But behind the buzzwords, they’re built from surprisingly elegant building blocks. In this guide, we’ll build a minimal Transformer from scratch in PyTorch.

Why Transformers?

Before 2017, sequence models were dominated by RNNs and LSTMs. The Transformer, introduced in Attention Is All You Need, replaced recurrence with self-attention, enabling:

Parallelism (no sequential bottleneck)
Long-range dependencies (attention can connect distant tokens directly)
Scalability (stacking layers yields powerful models)

At its core, a Transformer is:

Token embeddings + positional encodings
Repeated blocks of multi-head self-attention and feed-forward networks
A final projection for tasks (e.g., language modeling, classification)

1. Token & Positional Embeddings

Transformers don’t inherently understand order, so we add positional encodings.

import torch
import torch.nn as nn
import math

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)  # shape: (1, max_len, d_model)
        self.register_buffer("pe", pe)

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

2. Scaled Dot-Product Attention

The heart of the Transformer:

$$ \text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V $$

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    attn = scores.softmax(dim=-1)
    return attn @ V, attn

Why Transformers?

1. Token & Positional Embeddings

2. Scaled Dot-Product Attention

3. Multi-Head Attention