Maximium a Posteriori estimation

Introduction

In dire conditions, we ought to find just one result!

In MAP, we find exactly one x* that maximizes probability.

$$ x^* = \arg \max_{x\in \sum Z^n} P[X=x] $$

This works well, because we can sometimes do it analytically. In intractable exact inference problems, we can instead optimize for the ELBO value L.

Q: Why is this part of approximate inference then?

A: Well, in usage, when we shouldn't be finding just one x*, we still use MAP sometimes to aide our computation, such as in feature extraction. In the example, it is approximate because it doesn't give us the optimal q.

Assumptions

Specifically, we restrict the family of q to the Dirac (pronounced dai-RAK) distribution. For those of you in physics, this is the delta function used as a distribution. This lets us drop many terms unrelated to the Dirac distribution parameter mu:

$$ \mu^*= \arg\max_\mu\log p(h=\mu,v) $$

The learning procedure is thus similar to that of EM, where we alternative to maximize L. Here, however, we made a strong restriction on q to be of Dirac dist, whose entropy H can be very negative (recall in ELBO, L = H(q) + log(p)), thus the bound of L is infinitely loose.

Solution: Add noise to mu to make the bound meaningful again!!

Example: Sparse learning

MAP applied to sparse coding applied to ELBO

Feature extraction is very useful for unlabeled data. Instead of having a human with domain expert decide which features are which, we make the data itself decide what is important and representative — that is, we assume that there are some i.i.d features that independently caused the data we see. Sparse learning is a perfect example for this use case.

For the math in this example, you might prefer to read the book chapter or the 2007 paper(see resources) for a closer look. In summary (or to refresh your memory), we employ