K-Means Algorithm

K-Means is a clustering algorithm that partitions $N$ data points into $K$ clusters by alternating between assignment and refitting steps.

1. Objective

Minimize the distortion (sum of squared distances to cluster centers):

$$J = \sum_{n=1}^N \sum_{k=1}^K r_{nk} \|x_n - \mu_k\|^2$$

where $r_{nk} \in \{0,1\}$ is a hard assignment indicator and $\mu_k$ is the center of cluster $k$.

2. Algorithm

Initialize cluster centers $\mu_1, \dots, \mu_K$. Repeat until convergence:

Assignment step: Assign each point to the nearest center:

$$r_{nk} = \begin{cases} 1 & \text{if } k = \arg\min_j \|x_n - \mu_j\|^2 \\ 0 & \text{otherwise} \end{cases}$$

Refitting step: Update each center to the mean of its assigned points:

$$\mu_k = \frac{\sum_{n=1}^N r_{nk}\,x_n}{\sum_{n=1}^N r_{nk}}$$

3. Properties

  • Each step decreases or maintains $J$, guaranteeing convergence to a local minimum
  • The solution depends on initialization
  • Common initialization: K-means++ (choose initial centers spread apart)

4. Relation to EM for GMMs

K-Means is the limiting case of Expectation-Maximization (EM) applied to a Gaussian Mixture Model (GMM) where:

  • All covariances are $\Sigma_k = \varepsilon I$
  • $\varepsilon \to 0$

In this limit:

  • Soft responsibilities $r_{nk} \in [0,1]$ become hard assignments $r_{nk} \in \{0,1\}$
  • The E-step becomes the assignment step
  • The M-step becomes the refitting step