Before we start, we have to know about entropy first.

Suprizal = $-\log p(x)$

Entropy = $-\Sigma p(x)\log p(x)$

Cross Entropy=$-\Sigma p(x) \log q(x)$

KL-Divergence=$-\Sigma p(x) \log q(x)-(-\Sigma p(x)\log p(x))=\Sigma p(x)\log \frac{p(x)}{q(x)}$

Surprizal !

The definition $I(x) = -\log p(x)$ originates from two core requirements: Monotonicity and Additivity.

$p(A \cap B) = p(A) \cdot p(B)$, $\log(p(A) \cdot p(B)) = \log p(A) + \log p(B)$

  1. Inverse Relationship: Surprise must be inversely proportional to probability; the rarer the event, the higher the information content.
  2. Additivity: When two independent events occur, their joint probability is multiplicative $(p(A)p(B))$, but their total surprise should be additive. The logarithm is the unique mathematical bridge that transforms this product into a sum, mapping the probability space onto a linear information scale.

Shannon’s Entropy(From here we concern p(x) as probability distribution)

$H(x) = \sum p(x) \underbrace{[-\log p(x)]}_{\text{Surprise}}$

Shannon Entropy is simply the average amount of surprise."

It quantifies the expected value of information we get from observing a random variable. In generative modeling, managing this "average surprise" is the key to balancing reconstruction accuracy and latent space continuity.

Cross Entropy

$H(p, q) = \mathbb{E}_{x \sim p} [-\log q(x)] = -\sum p(x) \log q(x)$

Cross Entropy: Explaining $p$ through $q$

Cross Entropy measures the average surprise encountered when we use a model q to describe data that actually follows distribution p. If our model is inaccurate, we experience "extra" surprise. Therefore, minimizing cross entropy is equivalent to forcing our model to align its expectations with the physical reality of the data.

KL-Divergence