Problem 1 | EM $\times$ ELBO $\times$ Latent Variable Model

Fix the notation first.

  • The dataset is $X = \{x_1, \dots, x_N\}$, where each $x_n \in \mathbb{R}$ is a scalar
  • The latent variables are $Z = \{z_1, \dots, z_N\}$, where each $z_n \in \{0, 1\}$ is a scalar binary indicator, not a one-hot vector
  • The parameter is $\theta = (\rho, \mu_0, \mu_1)$
  • $\sigma^2 > 0$ is a known constant, not a parameter to estimate
  • $\rho = P(z_n = 1)$, so $P(z_n = 0) = 1 - \rho$
  • The conditional distributions are
$$z_n \sim \text{Bernoulli}(\rho), \quad x_n \mid z_n = 0 \sim \mathcal{N}(\mu_0, \sigma^2), \quad x_n \mid z_n = 1 \sim \mathcal{N}(\mu_1, \sigma^2)$$
  • The samples are independent across $n = 1, \dots, N$

(a)
Write down the joint distribution $p(X, Z \mid \theta)$, and then write the complete-data log-likelihood

$$\log p(X, Z \mid \theta).$$

(b)
Let the current parameter be $\theta^{old} = (\rho^{old}, \mu_0^{old}, \mu_1^{old})$. Define the responsibility

$$r_n := P(z_n = 1 \mid x_n, \theta^{old}).$$

Derive an explicit formula for $r_n$.
Your final answer must be written in the form

$$r_n = \frac{\text{something}}{\text{something} + \text{something}}$$

not just up to proportionality.

(c)
Now consider the variational distribution

$$q(Z) = \prod_{n=1}^N q_n(z_n), \quad q_n(z_n = 1) = r_n, \quad q_n(z_n = 0) = 1 - r_n.$$

Write the ELBO

$$L(q, \theta) = \mathbb{E}_{q(Z)}[\log p(X, Z \mid \theta) - \log q(Z)].$$

You do not need to fully expand every constant, but you must rewrite it as a sum over $n$, and make the following three types of terms explicit:

  1. terms involving $r_n$ and $\rho$
  2. terms involving $r_n$ and the Gaussian means $\mu_0, \mu_1$
  3. the entropy term coming from $-\log q(Z)$

(d)
Now fix $q$, equivalently fix all $r_n$. Perform the M-step by maximizing the ELBO with respect to $\rho, \mu_0, \mu_1$, and derive the update formulas.
Your final answers must be simplified to

$$\rho^{new} = \dots, \quad \mu_0^{new} = \dots, \quad \mu_1^{new} = \dots$$

(e)
Explain why the following statement is correct:
“EM for this model can be viewed as a special case of VI.”
You are only allowed to organize your answer using these three keywords:

$$q(Z), \quad p(Z \mid X, \theta^{old}), \quad KL$$

Do not give a vague conceptual answer.


Solution

(a)

$$P(X, Z \mid \theta)$$$$= \prod_{n=1}^N P(x_n, z_n \mid \theta) \quad \text{, so } P(x, z \mid \theta)$$$$= \prod_{n=1}^N \rho^{z_n} (1-\rho)^{(1-z_n)} (\mathcal{N}(x_n \mid \mu_1, \sigma^2))^{z_n} (\mathcal{N}(x_n \mid \mu_0, \sigma^2))^{(1-z_n)}$$

So after Gaussian expanded and merged together,

$$\log(P(X, Z \mid \theta)) = \sum_{n=1}^N \left[ z_n \log \rho + (1-z_n)\log(1-\rho) \right] - \frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{n=1}^N \left[ z_n(x_n-\mu_1)^2 + (1-z_n)(x_n-\mu_0)^2 \right]$$

(b) 之前错因:加了“$\prod_{n=1}^N$”,对题意(问的是单变量)理解误差。

Fix arbitrary $n$. For $r_n$,

$$r_n = P(z_n=1 \mid x_n, \theta^{old}) \quad \leftarrow \text{Bayesian Posterior}$$$$= \frac{P(x_n \mid z_n=1, \theta^{old}) P(z_n=1 \mid \theta^{old})}{P(x_n \mid \theta^{old})}$$

$\leftarrow \text{Bayesian 中的分母本质为 Marginal Distribution.}$
$\leftarrow \text{为 normalization constant.}$

$$= \frac{\mathcal{N}(x_n \mid \mu_1^{old}, \sigma^2) \rho^{old}}{\sum_{z_n \in \{0,1\}} P(x_n, z_n \mid \theta^{old})}$$$$= \frac{\mathcal{N}(x_n \mid \mu_1^{old}, \sigma^2) \rho^{old}}{(1-\rho^{old})\mathcal{N}(x_n \mid \mu_0^{old}, \sigma^2) + \rho^{old}\mathcal{N}(x_n \mid \mu_1^{old}, \sigma^2)}$$$$= \frac{\rho^{old}\mathcal{N}(x_n \mid \mu_1^{old}, \sigma^2)}{(1-\rho^{old})\mathcal{N}(x_n \mid \mu_0^{old}, \sigma^2) + \rho^{old}\mathcal{N}(x_n \mid \mu_1^{old}, \sigma^2)}$$

(c)
For $L(q, \theta) = \mathbb{E}_{q(Z)}[\log P(X,Z \mid \theta) - \log q(Z)]$

For $q(Z) = \prod_{n=1}^N q_n(z_n)$

By (a),

$$\log(P(X, Z \mid \theta)) = \sum_{n=1}^N \left[ z_n \log \rho + (1-z_n)\log(1-\rho) \right] - \frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{n=1}^N \left[ z_n(x_n-\mu_1)^2 + (1-z_n)(x_n-\mu_0)^2 \right]$$$$\log q(Z) = \sum_{n=1}^N \log q_n(z_n) \quad , \quad L(q, \theta) = \sum_{Z=\{0,1\}} q(Z) \left[ \log P(X,Z \mid \theta) - \log q(Z) \right]$$

Substitute $\mathbb{E}_q(z_n) = r_n$ and $\mathbb{E}_q(1-z_n) = 1-r_n$ into $L(q, \theta)$

$$L(q, \theta) = \sum_{n=1}^N \left[ r_n \log \rho + (1-r_n)\log(1-\rho) \right] - \underbrace{\frac{N}{2}\log(2\pi\sigma^2)}_{\text{constant}} - \frac{1}{2\sigma^2} \sum_{n=1}^N \left[ r_n(x_n-\mu_1)^2 + (1-r_n)(x_n-\mu_0)^2 \right] - \sum_{n=1}^N \left[ r_n \log r_n + (1-r_n)\log(1-r_n) \right]$$

(d) M-step: $\theta^{new} = \arg\max_\theta L(q^{old}, \theta)$

$$= \arg\max_\theta L(\{ r_1, r_2, \dots \}; \theta) \quad \leftarrow \text{parametrized}$$$$\sum_{n=1}^N \left[ r_n \log \rho + (1-r_n)\log(1-\rho) \right] - \underbrace{\frac{N}{2}\log(2\pi\sigma^2)}_{\text{constant}} - \frac{1}{2\sigma^2} \sum_{n=1}^N \left[ r_n(x_n-\mu_1)^2 + (1-r_n)(x_n-\mu_0)^2 \right] - \underbrace{\sum_{n=1}^N \left[ r_n \log r_n + (1-r_n)\log(1-r_n) \right]}_{\text{constant for fixed } r_n, \sigma^2}$$

Goal: Maximize $\sum_{n=1}^N \left[ r_n \log \rho + (1-r_n)\log(1-\rho) \right] - \frac{1}{2\sigma^2} \sum_{n=1}^N \left[ r_n(x_n-\mu_1)^2 + (1-r_n)(x_n-\mu_0)^2 \right]$

For $\rho$, let $f'(\rho) = \sum_{n=1}^N \left[ \frac{r_n}{\rho} - \frac{1-r_n}{1-\rho} \right] = 0$

$$\Rightarrow \sum_{n=1}^N \frac{r_n}{\rho} = \sum_{n=1}^N \frac{1-r_n}{1-\rho}$$$$\frac{1}{\rho}\sum_{n=1}^N r_n = \frac{N}{1-\rho} - \sum_{n=1}^N \frac{r_n}{1-\rho}$$$$\rho^{new} = \frac{1}{N} \sum_{n=1}^N r_n$$

For $\mu_1$, let $\partial \left( \sum_{n=1}^N [ r_n(x_n-\mu_1)^2 ] + \text{Rest of the terms} \right) / \partial \mu_1 = 0$

$$= \sum_{n=1}^N \left( -2r_n(x_n-\mu_1) \right) = 0$$$$\sum \left[ -2r_n x_n + 2r_n \mu_1 \right] = 0$$$$2\mu_1 \sum r_n = 2\sum r_n x_n$$$$\mu_1 = \frac{\sum r_n x_n}{\sum r_n}$$

$\sum_{n=1}^N \left[ (1-r_n)(x_n-\mu_0)^2 \right]$
$\hookrightarrow$ let: For $\mu_0$, $\partial \left( \sum_{n=1}^N [ (1-r_n)(x_n-\mu_0)^2 ] \right) / \partial \mu_0 = 0$

$$\sum 2(r_n-1)(x_n-\mu_0) = 0$$$$\sum \left[ 2(r_n x_n - r_n \mu_0 - x_n + \mu_0) \right] = 0$$$$\mu_0 = 2\left[ -\sum r_n x_n + \mu_0 \sum r_n + \sum x_n \right]$$$$\Rightarrow \mu_0 = \frac{2\left[ \sum x_n - \sum r_n x_n \right]}{2\sum(1-r_n)}$$