Exam Pattern: VI and ELBO Derivation

This note consolidates the standard exam workflow for Variational Inference (VI): given a model and variational family, derive the ELBO and optimize it.

Setup

Given:

Model: joint $p(x, z \mid \theta)$ with observed $x$ and latent $z$
Variational family: $q(z \mid \phi)$ (e.g., Mean-Field Variational Family)
Goal: approximate the intractable posterior $p(z \mid x)$

Step 1: ELBO Derivation

Starting from the log marginal likelihood:

$$\log p(x) = \log \int p(x, z)\,dz$$

Introduce $q(z)$ via importance weighting:

$$\log p(x) = \log \int \frac{p(x, z)}{q(z)}\,q(z)\,dz \geq \int q(z)\log\frac{p(x,z)}{q(z)}\,dz$$

The inequality is Jensen’s inequality (since $\log$ is concave). The right-hand side is the ELBO:

$$\mathcal{L}(q) = \mathbb{E}_{q(z)}\left[\log p(x, z) - \log q(z)\right]$$

Step 2: ELBO = log-evidence minus KL

$$\log p(x) = \mathcal{L}(q) + D_{\text{KL}}(q(z) \| p(z \mid x))$$

Since $D_{\text{KL}} \geq 0$:

$\mathcal{L}(q) \leq \log p(x)$ (ELBO is a lower bound)
Maximizing $\mathcal{L}$ is equivalent to minimizing $D_{\text{KL}}(q \| p(z \mid x))$

Step 3: Decompose ELBO

$$\mathcal{L}(q) = \mathbb{E}_{q(z)}[\log p(x \mid z)] - D_{\text{KL}}(q(z) \| p(z))$$

First term: expected log-likelihood (reconstruction / fit to data)
Second term: KL from variational posterior to prior (regularization / complexity penalty)

Step 4: Optimize

For mean-field $q(z) = \prod_d q_d(z_d)$, the optimal update for each factor is:

$$\log q_d^*(z_d) = \mathbb{E}_{q_{-d}}[\log p(x, z)] + \text{const}$$

For parametric families (e.g., diagonal Gaussian), optimize $\phi$ via gradient ascent on $\mathcal{L}$, using the Reparameterization Trick for gradient estimation.

Common Exam Variations

Variation 1: Derive ELBO for a specific model

Given, e.g., a Gaussian prior $p(z) = \mathcal{N}(0, I)$ and Gaussian variational family $q(z) = \mathcal{N}(\mu, \text{diag}(\sigma^2))$:

$$D_{\text{KL}}(q \| p) = \frac{1}{2}\sum_{d=1}^D \left(\sigma_d^2 + \mu_d^2 - 1 - \log \sigma_d^2\right)$$

Variation 2: Why not minimize $D_{\text{KL}}(p \| q)$ instead?

$D_{\text{KL}}(p(z|x) \| q(z))$ requires computing $p(z|x)$ which is intractable. $D_{\text{KL}}(q \| p)$ can be optimized via the ELBO which only requires sampling from $q$. See I-projection vs M-projection.

Variation 3: EM as special-case VI

When $q(z) = p(z \mid x, \theta^{(t)})$ (exact posterior), the ELBO equals the expected complete-data log-likelihood $Q(\theta \mid \theta^{(t)})$ up to a constant. Maximizing over $\theta$ gives the M-step.

Checklist for Exam

Write $\log p(x) = \mathcal{L}(q) + D_{\text{KL}}(q \| p(z|x))$
Derive ELBO via Jensen or direct manipulation
Decompose into reconstruction + KL terms
If mean-field: write the coordinate ascent update
If parametric: mention reparameterization trick for gradients
State that VI finds a local optimum of $\mathcal{L}$, not global

Exam Pattern: VI and ELBO Derivation#

Setup#

Step 1: ELBO Derivation#

Step 2: ELBO = log-evidence minus KL#

Step 3: Decompose ELBO#

Step 4: Optimize#

Common Exam Variations#

Variation 1: Derive ELBO for a specific model#

Variation 2: Why not minimize $D_{\text{KL}}(p \| q)$ instead?#

Variation 3: EM as special-case VI#

Checklist for Exam#

Related#