Pre-requisites:

  1. Structure of the Transformer model
  2. How attention mechanisms work
  3. Training and inference of the transformer model

Topics:

Difference between Transformer and LLaMA

  1. All the normalization layers have been moved before the big blocks as compared to the Transformer, where all the normalization was done after the blocks
  2. Transformer is an encoder-decoder model while LLaMA is just the decoder