Before we start, we have to know about entropy first.
Suprizal = $-\log p(x)$
Entropy = $-\Sigma p(x)\log p(x)$
Cross Entropy=$-\Sigma p(x) \log q(x)$
KL-Divergence=$-\Sigma p(x) \log q(x)-(-\Sigma p(x)\log p(x))=\Sigma p(x)\log \frac{p(x)}{q(x)}$
The definition $I(x) = -\log p(x)$ originates from two core requirements: Monotonicity and Additivity.
$p(A \cap B) = p(A) \cdot p(B)$, $\log(p(A) \cdot p(B)) = \log p(A) + \log p(B)$
$H(x) = \sum p(x) \underbrace{[-\log p(x)]}_{\text{Surprise}}$
Shannon Entropy is simply the average amount of surprise."
It quantifies the expected value of information we get from observing a random variable. In generative modeling, managing this "average surprise" is the key to balancing reconstruction accuracy and latent space continuity.
$H(p, q) = \mathbb{E}_{x \sim p} [-\log q(x)] = -\sum p(x) \log q(x)$
Cross Entropy: Explaining $p$ through $q$
Cross Entropy measures the average surprise encountered when we use a model q to describe data that actually follows distribution p. If our model is inaccurate, we experience "extra" surprise. Therefore, minimizing cross entropy is equivalent to forcing our model to align its expectations with the physical reality of the data.