Exam Pattern: EM on Gaussian Mixture
This note consolidates the full EM procedure for a Gaussian Mixture Model (GMM) — the most likely long-answer exam question for STA414.
Setup
Given: $N$ observations $\{x_1, \dots, x_N\}$, $K$ mixture components, parameters $\theta = \{\pi_k, \mu_k, \Sigma_k\}_{k=1}^K$.
Log-likelihood:
$$\ell(\theta) = \sum_{n=1}^N \log \left(\sum_{k=1}^K \pi_k \,\mathcal{N}_m(x_n \mid \mu_k, \Sigma_k)\right)$$E-Step: Compute Responsibilities
$$r_{nk} = p(z_n = k \mid x_n, \theta^{(t)}) = \frac{\pi_k^{(t)} \,\mathcal{N}_m(x_n \mid \mu_k^{(t)}, \Sigma_k^{(t)})}{\sum_{j=1}^K \pi_j^{(t)} \,\mathcal{N}_m(x_n \mid \mu_j^{(t)}, \Sigma_j^{(t)})}$$Define effective counts: $N_k = \sum_{n=1}^N r_{nk}$
M-Step: Update Parameters
Means:
$$\mu_k^{(t+1)} = \frac{1}{N_k}\sum_{n=1}^N r_{nk}\,x_n$$Covariances:
$$\Sigma_k^{(t+1)} = \frac{1}{N_k}\sum_{n=1}^N r_{nk}\,(x_n - \mu_k^{(t+1)})(x_n - \mu_k^{(t+1)})^\top$$Mixing weights:
$$\pi_k^{(t+1)} = \frac{N_k}{N}$$Common Exam Variations
Variation 1: Derive the M-step for $\mu_k$
Set $\frac{\partial}{\partial \mu_k}Q(\theta \mid \theta^{(t)}) = 0$ where $Q$ is the expected complete-data log-likelihood:
$$Q = \sum_{n=1}^N \sum_{k=1}^K r_{nk}\left[\log \pi_k - \frac{m}{2}\log(2\pi) - \frac{1}{2}\log|\Sigma_k| - \frac{1}{2}(x_n - \mu_k)^\top \Sigma_k^{-1}(x_n - \mu_k)\right]$$Differentiating w.r.t. $\mu_k$:
$$\sum_{n=1}^N r_{nk}\,\Sigma_k^{-1}(x_n - \mu_k) = 0 \implies \mu_k = \frac{\sum_n r_{nk}\,x_n}{\sum_n r_{nk}}$$Variation 2: What happens as $K=1$?
EM reduces to computing the sample mean and sample covariance — no iteration needed.
Variation 3: Connection to K-Means
When $\Sigma_k = \varepsilon I$ and $\varepsilon \to 0$, responsibilities become hard (0/1) assignments, and EM becomes the K-Means Algorithm.
Checklist for Exam
- Write the generative model (latent $z$, then $x \mid z$)
- Write the log-likelihood with the sum inside the log
- Write the responsibility formula (E-step)
- Derive or state the M-step updates for $\mu_k$, $\Sigma_k$, $\pi_k$
- State monotonicity: $\ell(\theta^{(t+1)}) \geq \ell(\theta^{(t)})$
- Mention convergence to local optimum, not global