Exam Pattern: VI and ELBO Derivation
This note consolidates the standard exam workflow for Variational Inference (VI): given a model and variational family, derive the ELBO and optimize it.
Setup
Given:
- Model: joint $p(x, z \mid \theta)$ with observed $x$ and latent $z$
- Variational family: $q(z \mid \phi)$ (e.g., Mean-Field Variational Family)
- Goal: approximate the intractable posterior $p(z \mid x)$
Step 1: ELBO Derivation
Starting from the log marginal likelihood:
$$\log p(x) = \log \int p(x, z)\,dz$$Introduce $q(z)$ via importance weighting:
$$\log p(x) = \log \int \frac{p(x, z)}{q(z)}\,q(z)\,dz \geq \int q(z)\log\frac{p(x,z)}{q(z)}\,dz$$The inequality is Jensen’s inequality (since $\log$ is concave). The right-hand side is the ELBO:
$$\mathcal{L}(q) = \mathbb{E}_{q(z)}\left[\log p(x, z) - \log q(z)\right]$$Step 2: ELBO = log-evidence minus KL
$$\log p(x) = \mathcal{L}(q) + D_{\text{KL}}(q(z) \| p(z \mid x))$$Since $D_{\text{KL}} \geq 0$:
- $\mathcal{L}(q) \leq \log p(x)$ (ELBO is a lower bound)
- Maximizing $\mathcal{L}$ is equivalent to minimizing $D_{\text{KL}}(q \| p(z \mid x))$
Step 3: Decompose ELBO
$$\mathcal{L}(q) = \mathbb{E}_{q(z)}[\log p(x \mid z)] - D_{\text{KL}}(q(z) \| p(z))$$- First term: expected log-likelihood (reconstruction / fit to data)
- Second term: KL from variational posterior to prior (regularization / complexity penalty)
Step 4: Optimize
For mean-field $q(z) = \prod_d q_d(z_d)$, the optimal update for each factor is:
$$\log q_d^*(z_d) = \mathbb{E}_{q_{-d}}[\log p(x, z)] + \text{const}$$For parametric families (e.g., diagonal Gaussian), optimize $\phi$ via gradient ascent on $\mathcal{L}$, using the Reparameterization Trick for gradient estimation.
Common Exam Variations
Variation 1: Derive ELBO for a specific model
Given, e.g., a Gaussian prior $p(z) = \mathcal{N}(0, I)$ and Gaussian variational family $q(z) = \mathcal{N}(\mu, \text{diag}(\sigma^2))$:
$$D_{\text{KL}}(q \| p) = \frac{1}{2}\sum_{d=1}^D \left(\sigma_d^2 + \mu_d^2 - 1 - \log \sigma_d^2\right)$$Variation 2: Why not minimize $D_{\text{KL}}(p \| q)$ instead?
$D_{\text{KL}}(p(z|x) \| q(z))$ requires computing $p(z|x)$ which is intractable. $D_{\text{KL}}(q \| p)$ can be optimized via the ELBO which only requires sampling from $q$. See I-projection vs M-projection.
Variation 3: EM as special-case VI
When $q(z) = p(z \mid x, \theta^{(t)})$ (exact posterior), the ELBO equals the expected complete-data log-likelihood $Q(\theta \mid \theta^{(t)})$ up to a constant. Maximizing over $\theta$ gives the M-step.
Checklist for Exam
- Write $\log p(x) = \mathcal{L}(q) + D_{\text{KL}}(q \| p(z|x))$
- Derive ELBO via Jensen or direct manipulation
- Decompose into reconstruction + KL terms
- If mean-field: write the coordinate ascent update
- If parametric: mention reparameterization trick for gradients
- State that VI finds a local optimum of $\mathcal{L}$, not global