Summaries, Math, and Definitions

Introduction

Seen as an extension of E-M, MAP. Recall that ELBO can be written as

$$ L(\theta, q, \vec{v}) = \log p(\vec{v}) - D_{KL}(q(\theta)|p(\theta|\vec{v})) $$

where D-KL refers to Kullbacl-Leibler divergence, which variational inference methods aim to minimize, so that the ELBO is as close to log p as possible.

*Note that this KL-divergence direction is the opposite of maximum likelihood but has good computational properties. As a result, our approximation here encourages q to have low probability where p has low probability.
Use Mean Field Approximation, meaning that q is assumed to be fully factorized. This means an explicitly parametrized, and that because

$$ q(\theta) = \prod^D_{i=1} q_i(\theta_i) $$

we can have a structured variational inference (see paper in resources) ****where
- Any graphical model can be imposed
- The complexity of the model can be controlled via the number of iterations.

*I strongly recommend reading from Slides 21 onwards, goes to sampling methods:

The beauty of the variational approach is that we do not need to specify a speciﬁc parametric form for q. We specify how it should factorize, but then the optimization problem determines the optimal probability distribution within those factorization constraints. For discrete latent variables, this just means that we use traditional optimization techniques to optimize a ﬁnite number of variables describing the q distribution. For continuous latent variables, this means that we use a branch of mathematics called calculus of variations to perform optimization over a space of functions and actually determine which function should be used to represent q.

Goodfellow, et. al

Challenges

Despite its potential to fit any PGM, its computational cost may be high; hand-tuning till analytical forms are shown requires human.
Specialized optimization strategies for different families of distributions; different graphic models may require different hand-made inference as soon as they get complicated.
Due to assumptions about p the training tends to have optimized q approaching the assumptions we first had. Whether or not the final values for q is p is questionable.
- There are some testing techniques that exist to deal with it, such as inferring the gap between q and p. See the workshops in Resources for the most upcoming research in the field.

Discrete Latent Variables

Define q as a lookup table over discrete states, and then optimize for q's parameters.
Use a fixed point equation to satisfy the optimizer speed requirement, effectively solving for h-hat, the lookup table values, in

$$ {\delta \over \delta \hat{h}_i} \mathbb{L} = 0 $$
Binary Sparse Coding Model walkthrough; see original paper here:

Binary_Sparse_Coding.pdf

Apologies if this part wasn't very clear in the talk: the math eventually leads to L being arithemetically computable. In the end, we can see sparse coding as an iterative autoencoder.

For more complicated model, damping is needed to perform block updates; convergence is no longer guaranteed, but works well in practice.