Limitations of the Empirical Fisher Approximation for Natural Gradient Descent

What?

This paper disambiguates between the Fisher information matrix and the empirical Fisher information (which is different to the “observed (Fisher) information” on Wikipedia).

In particular, it shows that the empirical Fisher information does generally not capture second-order information.

Using the empirical Fisher information instead of the Fisher information for natural gradient descent leads to pathologies and while the empirical Fisher information does approximate the Fisher information under certain conditions, these conditions are “unlikely to be met in practice.”

The paper provides another explanation for why the empirical Fisher information can be useful: it is an empirical estimate of the “gradient’s non-central second moment,” which is follows directly from its definition. The argument is interesting though.

Untitled

Why?

Different papers and sources use different definitions of the Fisher information matrix or the empirical Fisher information. This causes confusion.

(For example, Murphy’s “Machine Learning - A Probabilistic Perspective” directly defines the Fisher information as Hessian of negative log likelihood — and this example is mine. Note that the second edition will fix this.)

Other cited works in the paper sometimes use the empirical Fisher instead of the Fisher and vice-versa, also when implementing baselines.

So there seems to exist confusion about the two which is obstructing progress.

How?

Fisher information vs empirical Fisher information

For a simple generative model $p_\theta(z)$, the Fisher information is canonically defined as:

$$ \mathrm{F}(\theta):=\mathbb{E}{p{\theta}(z)}\left[\nabla_{\theta} \log p_{\theta}(z) \nabla_{\theta} \log p_{\theta}(z)^{T}\right]=\mathbb{E}{p{\theta}(z)}\left[-\nabla_{\theta}^{2} \log _{\theta} p(z)\right]. $$

It gets interesting when we only have a discriminative model $p_\theta(y \mid x)$ and we have joint distribution $p_\theta(x,y) = p(x) \, p_\theta(y \mid x)$. Assuming we have N samples:

$$ \mathrm{F}{\Pi{n} p_{\theta}(x, y)}(\theta)=N \mathbb{E}{x, y \sim p(x) p{\theta}(y \mid x)}\left[\nabla_{\theta} \log p_{\theta}(y \mid x) \nabla_{\theta} \log p_{\theta}(y \mid x)^{T}\right]. $$

Now, ambiguities arise because we usually do not have access to $p(x)$ and hence use empirical samples $x_1, \ldots, x_N$. We shall also call this the Fisher information (not an empirical Fisher!):

$$ \mathrm{F}{\Pi{n} p_{\theta}\left(y \mid x_{n}\right)}(\theta)=\sum_{n} \mathbb{E}{y \sim p{\theta}\left(y \mid x_{n}\right)}\left[\nabla_{\theta} \log p_{\theta}\left(y \mid x_{n}\right) \nabla_{\theta} \log p_{\theta}\left(y \mid x_{n}\right)^{T}\right]. $$