https://blog.eleuther.ai/transformer-math/
https://kipp.ly/transformer-inference-arithmetic/
Why do we log the probabilities when doing forward pass and to calculate loss
Difference between Gradient Descent and Gradient Ascent
Why heavy weights are not good for a model, and makes the model highly sensitive
Numerical Stability with Smaller Weights
Evolution of Attention Mechanisms
LLM Inference Engines and Optimization