K-Means Algorithm
K-Means is a clustering algorithm that partitions $N$ data points into $K$ clusters by alternating between assignment and refitting steps.
1. Objective
Minimize the distortion (sum of squared distances to cluster centers):
$$J = \sum_{n=1}^N \sum_{k=1}^K r_{nk} \|x_n - \mu_k\|^2$$where $r_{nk} \in \{0,1\}$ is a hard assignment indicator and $\mu_k$ is the center of cluster $k$.
2. Algorithm
Initialize cluster centers $\mu_1, \dots, \mu_K$. Repeat until convergence:
Assignment step: Assign each point to the nearest center:
$$r_{nk} = \begin{cases} 1 & \text{if } k = \arg\min_j \|x_n - \mu_j\|^2 \\ 0 & \text{otherwise} \end{cases}$$Refitting step: Update each center to the mean of its assigned points:
$$\mu_k = \frac{\sum_{n=1}^N r_{nk}\,x_n}{\sum_{n=1}^N r_{nk}}$$3. Properties
- Each step decreases or maintains $J$, guaranteeing convergence to a local minimum
- The solution depends on initialization
- Common initialization: K-means++ (choose initial centers spread apart)
4. Relation to EM for GMMs
K-Means is the limiting case of Expectation-Maximization (EM) applied to a Gaussian Mixture Model (GMM) where:
- All covariances are $\Sigma_k = \varepsilon I$
- $\varepsilon \to 0$
In this limit:
- Soft responsibilities $r_{nk} \in [0,1]$ become hard assignments $r_{nk} \in \{0,1\}$
- The E-step becomes the assignment step
- The M-step becomes the refitting step