Problem 1 | EM $\times$ ELBO $\times$ Latent Variable Model
Fix the notation first.
- The dataset is $X = \{x_1, \dots, x_N\}$, where each $x_n \in \mathbb{R}$ is a scalar
- The latent variables are $Z = \{z_1, \dots, z_N\}$, where each $z_n \in \{0, 1\}$ is a scalar binary indicator, not a one-hot vector
- The parameter is $\theta = (\rho, \mu_0, \mu_1)$
- $\sigma^2 > 0$ is a known constant, not a parameter to estimate
- $\rho = P(z_n = 1)$, so $P(z_n = 0) = 1 - \rho$
- The conditional distributions are
- The samples are independent across $n = 1, \dots, N$
(a)
Write down the joint distribution $p(X, Z \mid \theta)$, and then write the complete-data log-likelihood
(b)
Let the current parameter be $\theta^{old} = (\rho^{old}, \mu_0^{old}, \mu_1^{old})$. Define the responsibility
Derive an explicit formula for $r_n$.
Your final answer must be written in the form
not just up to proportionality.
(c)
Now consider the variational distribution
Write the ELBO
$$L(q, \theta) = \mathbb{E}_{q(Z)}[\log p(X, Z \mid \theta) - \log q(Z)].$$You do not need to fully expand every constant, but you must rewrite it as a sum over $n$, and make the following three types of terms explicit:
- terms involving $r_n$ and $\rho$
- terms involving $r_n$ and the Gaussian means $\mu_0, \mu_1$
- the entropy term coming from $-\log q(Z)$
(d)
Now fix $q$, equivalently fix all $r_n$. Perform the M-step by maximizing the ELBO with respect to $\rho, \mu_0, \mu_1$, and derive the update formulas.
Your final answers must be simplified to
(e)
Explain why the following statement is correct:
“EM for this model can be viewed as a special case of VI.”
You are only allowed to organize your answer using these three keywords:
Do not give a vague conceptual answer.
Solution
(a)
$$P(X, Z \mid \theta)$$$$= \prod_{n=1}^N P(x_n, z_n \mid \theta) \quad \text{, so } P(x, z \mid \theta)$$$$= \prod_{n=1}^N \rho^{z_n} (1-\rho)^{(1-z_n)} (\mathcal{N}(x_n \mid \mu_1, \sigma^2))^{z_n} (\mathcal{N}(x_n \mid \mu_0, \sigma^2))^{(1-z_n)}$$So after Gaussian expanded and merged together,
$$\log(P(X, Z \mid \theta)) = \sum_{n=1}^N \left[ z_n \log \rho + (1-z_n)\log(1-\rho) \right] - \frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{n=1}^N \left[ z_n(x_n-\mu_1)^2 + (1-z_n)(x_n-\mu_0)^2 \right]$$(b) 之前错因:加了“$\prod_{n=1}^N$”,对题意(问的是单变量)理解误差。
Fix arbitrary $n$. For $r_n$,
$$r_n = P(z_n=1 \mid x_n, \theta^{old}) \quad \leftarrow \text{Bayesian Posterior}$$$$= \frac{P(x_n \mid z_n=1, \theta^{old}) P(z_n=1 \mid \theta^{old})}{P(x_n \mid \theta^{old})}$$$\leftarrow \text{Bayesian 中的分母本质为 Marginal Distribution.}$
$\leftarrow \text{为 normalization constant.}$
(c)
For $L(q, \theta) = \mathbb{E}_{q(Z)}[\log P(X,Z \mid \theta) - \log q(Z)]$
For $q(Z) = \prod_{n=1}^N q_n(z_n)$
By (a),
$$\log(P(X, Z \mid \theta)) = \sum_{n=1}^N \left[ z_n \log \rho + (1-z_n)\log(1-\rho) \right] - \frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{n=1}^N \left[ z_n(x_n-\mu_1)^2 + (1-z_n)(x_n-\mu_0)^2 \right]$$$$\log q(Z) = \sum_{n=1}^N \log q_n(z_n) \quad , \quad L(q, \theta) = \sum_{Z=\{0,1\}} q(Z) \left[ \log P(X,Z \mid \theta) - \log q(Z) \right]$$Substitute $\mathbb{E}_q(z_n) = r_n$ and $\mathbb{E}_q(1-z_n) = 1-r_n$ into $L(q, \theta)$
$$L(q, \theta) = \sum_{n=1}^N \left[ r_n \log \rho + (1-r_n)\log(1-\rho) \right] - \underbrace{\frac{N}{2}\log(2\pi\sigma^2)}_{\text{constant}} - \frac{1}{2\sigma^2} \sum_{n=1}^N \left[ r_n(x_n-\mu_1)^2 + (1-r_n)(x_n-\mu_0)^2 \right] - \sum_{n=1}^N \left[ r_n \log r_n + (1-r_n)\log(1-r_n) \right]$$(d) M-step: $\theta^{new} = \arg\max_\theta L(q^{old}, \theta)$
$$= \arg\max_\theta L(\{ r_1, r_2, \dots \}; \theta) \quad \leftarrow \text{parametrized}$$$$\sum_{n=1}^N \left[ r_n \log \rho + (1-r_n)\log(1-\rho) \right] - \underbrace{\frac{N}{2}\log(2\pi\sigma^2)}_{\text{constant}} - \frac{1}{2\sigma^2} \sum_{n=1}^N \left[ r_n(x_n-\mu_1)^2 + (1-r_n)(x_n-\mu_0)^2 \right] - \underbrace{\sum_{n=1}^N \left[ r_n \log r_n + (1-r_n)\log(1-r_n) \right]}_{\text{constant for fixed } r_n, \sigma^2}$$Goal: Maximize $\sum_{n=1}^N \left[ r_n \log \rho + (1-r_n)\log(1-\rho) \right] - \frac{1}{2\sigma^2} \sum_{n=1}^N \left[ r_n(x_n-\mu_1)^2 + (1-r_n)(x_n-\mu_0)^2 \right]$
For $\rho$, let $f'(\rho) = \sum_{n=1}^N \left[ \frac{r_n}{\rho} - \frac{1-r_n}{1-\rho} \right] = 0$
$$\Rightarrow \sum_{n=1}^N \frac{r_n}{\rho} = \sum_{n=1}^N \frac{1-r_n}{1-\rho}$$$$\frac{1}{\rho}\sum_{n=1}^N r_n = \frac{N}{1-\rho} - \sum_{n=1}^N \frac{r_n}{1-\rho}$$$$\rho^{new} = \frac{1}{N} \sum_{n=1}^N r_n$$For $\mu_1$, let $\partial \left( \sum_{n=1}^N [ r_n(x_n-\mu_1)^2 ] + \text{Rest of the terms} \right) / \partial \mu_1 = 0$
$$= \sum_{n=1}^N \left( -2r_n(x_n-\mu_1) \right) = 0$$$$\sum \left[ -2r_n x_n + 2r_n \mu_1 \right] = 0$$$$2\mu_1 \sum r_n = 2\sum r_n x_n$$$$\mu_1 = \frac{\sum r_n x_n}{\sum r_n}$$$\sum_{n=1}^N \left[ (1-r_n)(x_n-\mu_0)^2 \right]$
$\hookrightarrow$ let: For $\mu_0$, $\partial \left( \sum_{n=1}^N [ (1-r_n)(x_n-\mu_0)^2 ] \right) / \partial \mu_0 = 0$