In this post, we'll implement a GPT from scratch in just 60 lines of numpy. We'll then load the trained GPT-2 model weights released by OpenAI into our implementation and generate some text.

Note:

EDIT (Feb 9th, 2023): Added a "What's Next" section and updated the intro with some notes.EDIT (Feb 28th, 2023): Added some additional sections to "What's Next".

Table of Contents

What is a GPT?

GPT stands for Generative Pre-trained Transformer. It's a type of neural network architecture based on the Transformer. Jay Alammar's How GPT3 Works is an excellent introduction to GPTs at a high level, but here's the tl;dr:

Large Language Models (LLMs) like OpenAI's GPT-3, Google's LaMDA, and Cohere's Command XLarge are just GPTs under the hood. What makes them special is they happen to be 1) very big (billions of parameters) and 2) trained on lots of data (hundreds of gigabytes of text).

Fundamentally, a GPT generates text given a prompt. Even with this very simple API (input = text, output = text), a well-trained GPT can do some pretty awesome stuff like write your emails, summarize a book, give you instagram caption ideas, explain black holes to a 5 year old, code in SQL, and even write your will.

So that's a high-level overview of GPTs and their capabilities. Let's dig into some more specifics.

Input / Output

The function signature for a GPT looks roughly like this:

def gpt(inputs: list[int]) -> list[list[float]]:
    # inputs has shape [n_seq]
    # output has shape [n_seq, n_vocab]
    output = # beep boop neural network magic
    return output