Problem A | Concept and proof: tractability vs. accuracy

Let $p(X, Z \mid \theta)$ be the joint distribution of observed data $X$ and latent variables $Z$, and let $p(Z \mid X, \theta)$ be the true posterior. In Variational Inference (VI), we seek an approximate distribution $q(Z)$ by maximizing the Evidence Lower Bound (ELBO):

$$\mathcal{L}(q, \theta) = \mathbb{E}_{q}[\log p(X, Z \mid \theta)] - \mathbb{E}_{q}[\log q(Z)]$$

(a) Show that maximizing the ELBO with respect to $q(Z)$ is mathematically equivalent to minimizing the Kullback–Leibler (KL) divergence between $q(Z)$ and the true posterior $p(Z \mid X, \theta)$. You must explicitly write out the relationship equation.

(b) Now, suppose we restrict our search space to the mean-field family, defined as $\mathcal{Q}_{\text{MF}} = \{q(Z) : q(Z) = \prod_{j=1}^M q_j(z_j)\}$. Let $q^* = \arg\max_{q \in \mathcal{Q}_{\text{MF}}} \mathcal{L}(q, \theta)$. Using the result from (a), mathematically explain why $q^*$ may fail to perfectly recover the true posterior $p(Z \mid X, \theta)$. Under what specific condition would $q^*(Z) = p(Z \mid X, \theta)$ hold true?


Problem B | Core derivation: single-factor coordinate update

Assume we are applying Mean-Field Variational Inference to a latent variable model with $M$ latent variables $Z = \{z_1, z_2, \dots, z_M\}$. We choose the fully factorized family:

$$q(Z) = \prod_{i=1}^M q_i(z_i)$$

(a) By substituting this factorized $q(Z)$ into the general ELBO definition $\mathcal{L}(q) = \mathbb{E}_{q}[\log p(X, Z)] - \mathbb{E}_{q}[\log q(Z)]$, isolate the terms that depend only on a specific factor $q_j(z_j)$.

(b) Derive the optimal update equation for $q_j(z_j)$ while keeping all other factors $q_{-j}$ fixed. Show that the optimal $q_j^*(z_j)$ satisfies:

$$\log q_j^*(z_j) = \mathbb{E}_{q_{-j}}[\log p(X, Z)] + \text{constant}$$

(Hint: You may use the fact that maximizing a functional of the form $\int q(x)f(x)\,dx - \int q(x)\log q(x)\,dx$ subject to $\int q(x)\,dx = 1$ yields $q^*(x) \propto \exp(f(x))$.)


Problem C | Structure and limitations: applying MF to a simple model

Consider a probabilistic model with a single observation $x$ and three latent variables $z_1, z_2, z_3$. The true generative process is defined by the following joint distribution:

$$p(x, z_1, z_2, z_3) = p(z_1) p(z_2) p(z_3 \mid z_1, z_2) p(x \mid z_3)$$

(a) Write down the corresponding Mean-Field variational family $q(Z)$ for this specific set of latent variables.

(b) Explain one major limitation of using this specific Mean-Field family to approximate the true posterior $p(z_1, z_2, z_3 \mid x)$. What specific dependency structure present in the true posterior is forcibly broken by your $q(Z)$ in (a)? Give a brief conceptual explanation.


Problem D | EM vs. VI: the “special case” connection

The Expectation–Maximization (EM) algorithm and Variational Inference (VI) both utilize the Evidence Lower Bound (ELBO) to handle models with latent variables $Z$.

(a) In the exact E-step of the standard EM algorithm, what is the exact choice for the variational distribution $q(Z)$ given the current parameters $\theta^{old}$? What is the value of the KL divergence $\mathrm{KL}(q(Z) \,\|\, p(Z \mid X, \theta^{old}))$ immediately after this step?

(b) Suppose the true posterior is computationally intractable, and we instead restrict $q(Z)$ to a Mean-Field family $\mathcal{Q}_{\text{MF}}$. How does this change the E-step (now called the Variational E-step)? Does the KL divergence $\mathrm{KL}(q(Z) \,\|\, p(Z \mid X, \theta^{old}))$ still evaluate to the same value as in (a)? Briefly explain why.


Problem E | Modern connection: Mean-Field in VAEs

In a standard Variational Autoencoder (VAE) with a $D$-dimensional continuous latent space ($z \in \mathbb{R}^D$), the approximate posterior is parameterized by an inference network (the encoder). Typically, this distribution is chosen to be a multivariate Gaussian with a diagonal covariance matrix:

$$q_\phi(z \mid x) = \mathcal{N}(z \,;\, \mu_\phi(x), \mathrm{diag}(\sigma_\phi^2(x)))$$

where $\mu_\phi(x)$ and $\sigma_\phi^2(x)$ are $D$-dimensional vectors output by a neural network.

(a) Prove or explain why this specific parametric form of $q_\phi(z \mid x)$ represents a Mean-Field approximation over the latent dimensions $z_1, \dots, z_D$.

(b) Because of this Mean-Field assumption, what kind of relationship between the latent dimensions (e.g., $z_1$ and $z_2$) can the encoder not capture, even if the true posterior $p_\theta(z \mid x)$ exhibits it?