LLaMA from scratch | Notion

Came out in Feb 2023

Pre-requisites:

Structure of the Transformer model
How attention mechanisms work
Training and inference of the transformer model

Topics:

Architectural differences between Vanilla Transformer and LLaMA
RMS normalization
Rotary Positional embeddings
KV cache
Multi-query attention
Grouped multi-query attention
SwiGLU activation function

Difference between Transformer and LLaMA

All the normalization layers have been moved before the big blocks as compared to the Transformer, where all the normalization was done after the blocks
Transformer is an encoder-decoder model while LLaMA is just the decoder