paper- language model are unsupervised multitask learners (gpt2)

-gpt2 weights are public and llama too

we will code the arch of gpt2 model

GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 1024, # Context length
    "emb_dim": 768,         # Embedding dimension
    "n_heads": 12,          # Number of attention heads
    "n_layers": 12,         # Number of layers
    "drop_rate": 0.1,       # Dropout rate
    "qkv_bias": False       # Query-Key-Value bias
}

layer normalization -

layer norm improves the stability and efficiency of nn training

main idea of layer norm: adjust output of nn to have mean zero and normalize variance to 1

applied both before and after multi-head attention module within transformer block

  1. if layer output is too large or too small, gradient magnitudes can become too large or small -

this affect training —> layer normalization keeps gradient stable

  1. as training proceeds , input to each layer can change (internal covariate shift) -

this delays convergence → layer norm prevents this