Why do we log the probabilities when doing forward pass and to calculate loss

These are two foundational questions that touch on the practical engineering and mathematical theory of Deep Learning.

Here is the breakdown of why we do things this way.

1. Why do we Log the Probabilities?

You are correct that it is "easier," but the main reason is actually survival against the computer's limits.

Reason A: Numerical Stability (The "Underflow" Problem)

Probabilities are always between 0 and 1. When you train a model on a sequence of text, you are essentially calculating the probability of the entire sequence happening. To do this, you multiply the probabilities of each word together.

$$P(\text{sequence}) = p_1 \times p_2 \times p_3 \times \dots \times p_{100}$$

Computers have a limit on how small a number they can store (floating-point precision). If you multiply tiny numbers together, they vanish.

$0.1^{100} = 1.0 \times 10^{-100}$ (This is dangerously close to zero for a computer).
Eventually, the computer just rounds it to 0.0, and your training crashes because you can't calculate the gradient of zero.

The Log Fix:

Logarithms map these tiny numbers to manageable negative numbers.

$\log(0.1^{100}) = -230.25$
The computer can handle 230.25 easily. It cannot handle 0.000...001.

1. Why do we Log the Probabilities?

Reason A: Numerical Stability (The "Underflow" Problem)

Reason B: Math Simplification (Product to Sum)