https://blog.eleuther.ai/transformer-math/
https://kipp.ly/transformer-inference-arithmetic/
Why do we log the probabilities when doing forward pass and to calculate loss
Difference between Gradient Descent and Gradient Ascent
Why heavy weights are not good for a model, and makes the model highly sensitive