Before we start, we have to know about entropy first.
Suprizal = $-\log p(x)$
Entropy = $-\Sigma p(x)\log p(x)$
Cross Entropy=$-\Sigma p(x) \log q(x)$
KL-Divergence=$-\Sigma p(x) \log q(x)-(-\Sigma p(x)\log p(x))=\Sigma p(x)\log \frac{p(x)}{q(x)}$
The definition $I(x) = -\log p(x)$ originates from two core requirements: Monotonicity and Additivity.
$p(A \cap B) = p(A) \cdot p(B)$, $\log(p(A) \cdot p(B)) = \log p(A) + \log p(B)$
$H(x) = \sum p(x) \underbrace{[-\log p(x)]}_{\text{Surprise}}$
Shannon Entropy is simply the average amount of surprise."
It quantifies the expected value of information we get from observing a random variable. In generative modeling, managing this "average surprise" is the key to balancing reconstruction accuracy and latent space continuity.
$H(p, q) = \mathbb{E}_{x \sim p} [-\log q(x)] = -\sum p(x) \log q(x)$
Cross Entropy: Explaining $p$ through $q$