Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) estimates an unknown parameter by choosing the value that makes the observed data most probable under the model.

1. Definition

Given i.i.d. observations $x_1,\dots,x_n$ from a model $p(x|\theta)$, the likelihood function is

$$L(\theta) = \prod_{i=1}^n p(x_i|\theta)$$

The MLE is

$$\hat{\theta}_{\text{MLE}} = \arg\max_\theta L(\theta)$$

Since $\log(\cdot)$ is strictly increasing, this is equivalent to maximizing the log-likelihood

$$\ell(\theta) = \log L(\theta) = \sum_{i=1}^n \log p(x_i|\theta)$$

For differentiable models, MLE is typically obtained by solving

$$\frac{\partial \ell(\theta)}{\partial \theta} = 0$$

and checking that the solution gives a maximum.

MLE selects the parameter under which the observed sample is most likely to have been generated.

Consistency: Under regularity conditions, $\hat{\theta}_{\text{MLE}} \to \theta_0$ as $n \to \infty$.
Asymptotic Normality: For large $n$,

$$\sqrt{n}(\hat{\theta}_{\text{MLE}}-\theta_0) \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1})$$

where $I(\theta)$ is the Fisher Information.

Invariance: If $\hat{\theta}_{\text{MLE}}$ is the MLE of $\theta$, then the MLE of $g(\theta)$ is $g(\hat{\theta}_{\text{MLE}})$.

For latent-variable models, direct maximization of $\ell(\theta)$ is often difficult, which motivates methods such as EM and VI.