-
Came out in Feb 2023

Pre-requisites:
- Structure of the Transformer model
- How attention mechanisms work
- Training and inference of the transformer model
Topics:
- Architectural differences between Vanilla Transformer and LLaMA
- RMS normalization
- Rotary Positional embeddings
- KV cache
- Multi-query attention
- Grouped multi-query attention
- SwiGLU activation function
Difference between Transformer and LLaMA
- All the normalization layers have been moved before the big blocks as compared to the Transformer, where all the normalization was done after the blocks
- Transformer is an encoder-decoder model while LLaMA is just the decoder