Factor Analysis Model
Factor analysis assumes that “Observed” random vector $X = \begin{pmatrix} X_1 \\ \vdots \\ X_p \end{pmatrix}$ is generated from “latent” random vector $Z = \begin{pmatrix} Z_1 \\ \vdots \\ Z_r \end{pmatrix}$
$$ \begin{aligned} X_1 - \mu_1 &= \ell_{11} Z_1 + \dots + \ell_{1r} Z_r + \varepsilon_1 \\ \vdots \quad & \qquad \qquad \vdots \\ X_p - \mu_p &= \ell_{p1} Z_1 + \dots + \ell_{pr} Z_r + \varepsilon_p \end{aligned} \quad = \quad \underset{\substack{\text{common} \\ \text{part}}}{\boxed{\begin{matrix} C_1 \\ \vdots \\ C_p \end{matrix}}} + \underset{\substack{\text{unique} \\ \text{part}}}{\boxed{\begin{matrix} \varepsilon_1 \\ \vdots \\ \varepsilon_p \end{matrix}}} $$Comment we assume that $\mu = \begin{pmatrix} \mu_1 \\ \vdots \\ \mu_p \end{pmatrix} = 0$ otherwise $X := X - \mu$.
Example
- $X_1 \dots X_p$ are test scores on various topics
- $Z_1$ verbal intelligence
- $Z_2$ mathematical intelligence
Graphical representation of FA
- LATENT FACTORS: $Z_1, \dots, Z_r$
- LOADINGS: $\ell_{ij}$
- OBSERVED (DATA): $X_1, \dots, X_p$
- ERRORS: $\varepsilon_1, \dots, \varepsilon_p$
$$ X = \overset{C}{\boxed{\color{orange}{L} \color{green}{Z}}} + \varepsilon $$$$ X = \begin{pmatrix} X_1 \\ \vdots \\ X_p \end{pmatrix} \quad C = \begin{pmatrix} C_1 \\ \vdots \\ C_p \end{pmatrix} \quad Z = \begin{pmatrix} \color{green}{Z_1} \\ \vdots \\ \color{green}{Z_r} \end{pmatrix} $$$$ \varepsilon = \begin{pmatrix} \varepsilon_1 \\ \vdots \\ \varepsilon_p \end{pmatrix} \quad L = \begin{pmatrix} \color{orange}{\ell_{11}} & \dots & \color{orange}{\ell_{1r}} \\ & \dots & \\ \color{orange}{\ell_{p1}} & \dots & \color{orange}{\ell_{pr}} \end{pmatrix} $$
Orthogonal Factor model
$$ X = L Z + \varepsilon $$$E(Z) = 0 \qquad var(Z) = I_r$ latent factors $Z_1 \dots Z_r$ are uncorrelated
$E(\varepsilon) = 0 \qquad var(\varepsilon) = \Psi = diag(\Psi_1, \dots, \Psi_p)$ errors $\varepsilon_1 \dots \varepsilon_p$ are uncorrelated
$cov(Z, \varepsilon) = 0$ $Z_i$ and $\varepsilon_j$ are uncorrelated
$var(X) = \Sigma = LL^T + \Psi$
- $cov(X, Z) = L$
Unknown parameters: $L, \Psi$
If we know $\Sigma$, the goal would be to decompose it as $\Sigma = LL^T + \Psi$
Estimation (via PCA)
- If $\Psi$ is known, then $L$ can be found via eigendecomposition of $\Sigma^* = \Sigma - \Psi$
- If we know $L$, then $\Psi = \begin{pmatrix} \Sigma_{11}-h_1 & & \\ & \ddots & \\ & & \Sigma_{pp}-h_p \end{pmatrix}$ where $h_i = \sum_{j=1}^r \ell_{ij}^2$ is the communality for $X_i$
Algorithm
Input: $\Sigma \in \mathbb{R}^{p \times p}$, rank $r$
Step 1: $\Sigma = V \Lambda V^T \Rightarrow L = (\sqrt{\lambda_1}V_1 \dots \sqrt{\lambda_r}V_r)$
Step 2: $h_i = \sum_{j=1}^r \ell_{ij}^2 \Rightarrow \Psi = \text{diag}(\Sigma_{11}-h_1, \dots, \Sigma_{pp}-h_p)$
Output: loadings $L$, variances $\Psi$.
In practice:
- we estimate $\Sigma$ from data (by $S$)
- we do not know $\Psi$, so at step 1 we assume that $\Psi = 0 \iff \Sigma^* = \Sigma$, then update $\Psi$
- $\Sigma^*$ is not guaranteed to be PSD and has rank $r$, so step 1 $\Sigma^* \simeq LL^T$
Properties
- The solution is not unique
- Estimation can be obtained from the correlation matrix $R$ instead
Estimation via MLE
Assume that $Z \sim N_r(0, I_r)$ and $\varepsilon \sim N_p(0, \Psi)$ then $X \sim N_p(0, LL^T + \Psi)$
$$ \begin{aligned} X &= LZ + \varepsilon \\ E[X] &= L E[Z] + E[\varepsilon] = 0 \\ Var(X) &= L Var(Z) L^T + Var(\varepsilon) \\ &= L I_r L^T + \Psi \\ &= LL^T + \Psi \end{aligned} $$Suppose we observe $x_1 \dots x_n \sim N_p(0, LL^T + \Psi)$
Log-likelihood is $\ell(x_1 \dots x_n; L, \Psi) = -\frac{n}{2} [\log \det (LL^T + \Psi) + tr((LL^T + \Psi)^{-1} S)] + \dots$
$$ \begin{aligned} L(x_1 \dots x_n; L, \Psi) &= \prod_{i=1}^n f_i(x_i) = \prod_{i=1}^n \frac{1}{(2\pi)^{p/2} (\det(LL^T+\Psi))^{1/2}} e^{-\frac{1}{2}x_i^T(LL^T+\Psi)^{-1}x_i} \\ \ell(x_1 \dots x_n; M, \Sigma) &= \log L(x_1 \dots x_n; M, \Sigma) \\ &= -\frac{np}{2} \log(2\pi) - \frac{n}{2} \log \det(LL^T+\Psi) - \frac{1}{2} \sum_{i=1}^n x_i^T (LL^T+\Psi)^{-1}x_i \\ \sum_{i=1}^n tr(x_i^T (LL^T+\Psi)^{-1}x_i) &= \sum_{i=1}^n tr((LL^T+\Psi)^{-1} x_i x_i^T) = n \, tr((LL^T+\Psi)^{-1} S) \end{aligned} $$Then optimal $L$ and $\Psi$ are solutions to $\underset{L \in \mathbb{R}^{p \times r}, \, \Psi \text{ is diag}}{minimize} \log \det (LL^T + \Psi) + tr((LL^T + \Psi)^{-1}S)$
This can be solved via alternating algorithm (like EM).
Variance explained
$$ \begin{aligned} \Sigma_{11} &= \ell_{11}^2 + \dots + \ell_{1r}^2 + \Psi_1 \\ \Sigma_{22} &= \ell_{21}^2 + \dots + \ell_{2r}^2 + \Psi_2 \\ &\vdots \\ \Sigma_{pp} &= \ell_{p1}^2 + \dots + \ell_{pr}^2 + \Psi_p \end{aligned} $$Total variance
$$\sum_{i=1}^p \Sigma_{ii} = tr(\Sigma)$$Each $var(x_i) = \Sigma_{ii}$
Each Commonality
$$h_i = \sum_{j=1}^r \ell_{ij}^2$$Contribution to total variance by $j$-th factor
$$\sum_{i=1}^p \ell_{ij}^2$$and by all factors
$$\sum_{j=1}^r \sum_{i=1}^p \ell_{ij}^2 = \sum_{i=1}^p h_i$$
Selecting number of factors
Pick maximum $r$ such that $\lambda_1 \dots \lambda_r > 0$
Use PCA scree plot
Evaluate how well we approximate $\Sigma$, e.g. $||\Sigma - (LL^T+\Psi)||_F^2$. Or goodness of fit test.
Use proportion of variance explained by each factor and construct scree plot.
Identifiability
Factor model is not identifiable
$$ \begin{aligned} \text{Let } Q &\text{ be orthogonal } (QQ^T=I) \\ X &= LZ + \varepsilon \\ &= L Q Q^T Z + \varepsilon \\ &= (LQ) (Q^T Z) + \varepsilon \\ &= L^* Z^* + \varepsilon \\ Var(Z^*) &= Q^T Var(Z) Q = Q^T I Q = I \end{aligned} $$so we need additional constraint.
Diagonalization: find $Q$ such that $\tilde{L}^T \Psi^{-1} \tilde{L}$ is diagonal
$$ \tilde{L}^T \Psi^{-1} \tilde{L} = \Delta \quad (\text{Diagonal Matrix}) $$Varimax method: find transformation $Q$ that makes $\tilde{L}$ “sparse”, i.e. many values close to 0, few large loadings.
Varimax Criterion
Compute $\tilde{L} = LQ$ Compute communalities $\tilde{h}_i = \sum_{j=1}^r \tilde{\ell}_{ij}^2$ Compute scaled squared loadings $PVE_{ij} = \frac{\tilde{\ell}_{ij}^2}{\tilde{h}_i}$
$$ \begin{aligned} V &= \frac{1}{p} \sum_{j=1}^r \left( \sum_{i=1}^p PVE_{ij}^2 - \frac{1}{p} \left( \sum_{i=1}^p PVE_{ij} \right)^2 \right) \\ &\propto \sum_{j=1}^r \left( \substack{\text{Variance of PVE} \\ \text{for } j\text{th factor}} \right) \end{aligned} $$Scores
Our MLE model is $X \simeq LZ + \varepsilon$ where $Z \sim N_r(0, I)$, $\varepsilon \sim N_p(0, \Psi)$
given $x_1 \dots x_n \in \mathbb{R}^p$ find $z_1 \dots z_n \in \mathbb{R}^r$ that follow the model.
- We can use weighted regression (Bartlett)
- Comment: if we fit FA via PCA we can use unweighted regression ($\Psi = I$)