https://blog.eleuther.ai/transformer-math/

https://kipp.ly/transformer-inference-arithmetic/

Positional Embeddings

Skip Connections

Why do we log the probabilities when doing forward pass and to calculate loss

Difference between Gradient Descent and Gradient Ascent

Why heavy weights are not good for a model, and makes the model highly sensitive

Numerical Stability with Smaller Weights

Adam Optimizer

AdamW and Weight Decay