Factor Analysis Model

Factor analysis assumes that “Observed” random vector $X = \begin{pmatrix} X_1 \\ \vdots \\ X_p \end{pmatrix}$ is generated from “latent” random vector $Z = \begin{pmatrix} Z_1 \\ \vdots \\ Z_r \end{pmatrix}$

$$ \begin{aligned} X_1 - \mu_1 &= \ell_{11} Z_1 + \dots + \ell_{1r} Z_r + \varepsilon_1 \\ \vdots \quad & \qquad \qquad \vdots \\ X_p - \mu_p &= \ell_{p1} Z_1 + \dots + \ell_{pr} Z_r + \varepsilon_p \end{aligned} \quad = \quad \underset{\substack{\text{common} \\ \text{part}}}{\boxed{\begin{matrix} C_1 \\ \vdots \\ C_p \end{matrix}}} + \underset{\substack{\text{unique} \\ \text{part}}}{\boxed{\begin{matrix} \varepsilon_1 \\ \vdots \\ \varepsilon_p \end{matrix}}} $$

Comment we assume that $\mu = \begin{pmatrix} \mu_1 \\ \vdots \\ \mu_p \end{pmatrix} = 0$ otherwise $X := X - \mu$.

Example

  • $X_1 \dots X_p$ are test scores on various topics
  • $Z_1$ verbal intelligence
  • $Z_2$ mathematical intelligence
$$ \underset{\substack{\text{Score} \\ \text{on ith} \\ \text{test}}}{X_i} = \color{orange}{\ell_{i1}} \underset{\color{green}{\text{VI}}}{Z_1} + \color{orange}{\ell_{i2}} \underset{\color{green}{\text{MI}}}{Z_2} + \underset{\substack{\text{Variation} \\ \text{in the score} \\ \text{across students}}}{\varepsilon_i} $$

Graphical representation of FA

  • LATENT FACTORS: $Z_1, \dots, Z_r$
  • LOADINGS: $\ell_{ij}$
  • OBSERVED (DATA): $X_1, \dots, X_p$
  • ERRORS: $\varepsilon_1, \dots, \varepsilon_p$
$$ \begin{aligned} X_1 &= \ell_{11} Z_1 + \dots + \ell_{1r} Z_r + \varepsilon_1 \\ \vdots \\ X_p &= \ell_{p1} Z_1 + \dots + \ell_{pr} Z_r + \varepsilon_p \end{aligned} $$
$$ X = \overset{C}{\boxed{\color{orange}{L} \color{green}{Z}}} + \varepsilon $$$$ X = \begin{pmatrix} X_1 \\ \vdots \\ X_p \end{pmatrix} \quad C = \begin{pmatrix} C_1 \\ \vdots \\ C_p \end{pmatrix} \quad Z = \begin{pmatrix} \color{green}{Z_1} \\ \vdots \\ \color{green}{Z_r} \end{pmatrix} $$$$ \varepsilon = \begin{pmatrix} \varepsilon_1 \\ \vdots \\ \varepsilon_p \end{pmatrix} \quad L = \begin{pmatrix} \color{orange}{\ell_{11}} & \dots & \color{orange}{\ell_{1r}} \\ & \dots & \\ \color{orange}{\ell_{p1}} & \dots & \color{orange}{\ell_{pr}} \end{pmatrix} $$

Orthogonal Factor model

$$ X = L Z + \varepsilon $$
  • $E(Z) = 0 \qquad var(Z) = I_r$ latent factors $Z_1 \dots Z_r$ are uncorrelated

  • $E(\varepsilon) = 0 \qquad var(\varepsilon) = \Psi = diag(\Psi_1, \dots, \Psi_p)$ errors $\varepsilon_1 \dots \varepsilon_p$ are uncorrelated

  • $cov(Z, \varepsilon) = 0$ $Z_i$ and $\varepsilon_j$ are uncorrelated

  • $var(X) = \Sigma = LL^T + \Psi$

$$ \begin{aligned} \Sigma &= E[(X-\mu)(X-\mu)^T] \\ &= E[(LZ+\varepsilon)(LZ+\varepsilon)^T] \\ &= E[LZ Z^T L^T + LZ\varepsilon^T + \varepsilon Z^T L^T + \varepsilon \varepsilon^T] \\ &= L \underbrace{E[ZZ^T]}_{I} L^T + L \underbrace{E[Z\varepsilon^T]}_{0} + \underbrace{E[\varepsilon Z^T]}_{0} L^T + \underbrace{E[\varepsilon \varepsilon^T]}_{\Psi} \\ &= LL^T + \Psi \end{aligned} $$
  • $cov(X, Z) = L$
$$ \begin{aligned} cov(X, Z) &= E[(X-\mu)Z^T] \\ &= E[(LZ+\varepsilon)Z^T] \\ &= L \underbrace{E[ZZ^T]}_{I} + \underbrace{E[\varepsilon Z^T]}_{0} \\ &= L \end{aligned} $$

Unknown parameters: $L, \Psi$

If we know $\Sigma$, the goal would be to decompose it as $\Sigma = LL^T + \Psi$

Estimation (via PCA)

  • If $\Psi$ is known, then $L$ can be found via eigendecomposition of $\Sigma^* = \Sigma - \Psi$
$$ \begin{aligned} \Sigma^* &= V \Lambda V^T \quad (\text{Spectral Decomposition}) \\ &\approx V_r \Lambda_r V_r^T \\ &= (V_r \Lambda_r^{1/2}) (V_r \Lambda_r^{1/2})^T \\ \Rightarrow L &= V_r \Lambda_r^{1/2} = (\sqrt{\lambda_1}v_1, \dots, \sqrt{\lambda_r}v_r) \end{aligned} $$
  • If we know $L$, then $\Psi = \begin{pmatrix} \Sigma_{11}-h_1 & & \\ & \ddots & \\ & & \Sigma_{pp}-h_p \end{pmatrix}$ where $h_i = \sum_{j=1}^r \ell_{ij}^2$ is the communality for $X_i$
$$ \begin{aligned} \Sigma &= LL^T + \Psi \\ \text{diag}(\Sigma) &= \text{diag}(LL^T) + \text{diag}(\Psi) \\ \Sigma_{ii} &= (LL^T)_{ii} + \Psi_{ii} \\ \Sigma_{ii} &= \sum_{j=1}^r \ell_{ij}^2 + \Psi_{ii} \\ \Rightarrow \Psi_{ii} &= \Sigma_{ii} - h_i \end{aligned} $$

Algorithm

Input: $\Sigma \in \mathbb{R}^{p \times p}$, rank $r$

Step 1: $\Sigma = V \Lambda V^T \Rightarrow L = (\sqrt{\lambda_1}V_1 \dots \sqrt{\lambda_r}V_r)$

Step 2: $h_i = \sum_{j=1}^r \ell_{ij}^2 \Rightarrow \Psi = \text{diag}(\Sigma_{11}-h_1, \dots, \Sigma_{pp}-h_p)$

Output: loadings $L$, variances $\Psi$.


In practice:

  • we estimate $\Sigma$ from data (by $S$)
  • we do not know $\Psi$, so at step 1 we assume that $\Psi = 0 \iff \Sigma^* = \Sigma$, then update $\Psi$
  • $\Sigma^*$ is not guaranteed to be PSD and has rank $r$, so step 1 $\Sigma^* \simeq LL^T$

Properties

  • The solution is not unique
$$ \begin{aligned} \text{Let } Q &\text{ be an orthogonal matrix } (QQ^T = I) \\ \Sigma &= L L^T + \Psi \\ &= L I L^T + \Psi \\ &= L (Q Q^T) L^T + \Psi \\ &= (LQ) (LQ)^T + \Psi \\ \Rightarrow L^* &= LQ \quad (\text{New valid loadings}) \end{aligned} $$
  • Estimation can be obtained from the correlation matrix $R$ instead
$$ \begin{aligned} R &= D^{-1/2} \Sigma D^{-1/2} \quad \text{where } D = \text{diag}(\Sigma_{11}, \dots, \Sigma_{pp}) \\ R &= D^{-1/2} (LL^T + \Psi) D^{-1/2} \\ R &= (D^{-1/2}L)(D^{-1/2}L)^T + D^{-1/2}\Psi D^{-1/2} \\ \Rightarrow L_R &= D^{-1/2}L \\ \Rightarrow \Psi_R &= D^{-1}\Psi \end{aligned} $$

Estimation via MLE

Assume that $Z \sim N_r(0, I_r)$ and $\varepsilon \sim N_p(0, \Psi)$ then $X \sim N_p(0, LL^T + \Psi)$

$$ \begin{aligned} X &= LZ + \varepsilon \\ E[X] &= L E[Z] + E[\varepsilon] = 0 \\ Var(X) &= L Var(Z) L^T + Var(\varepsilon) \\ &= L I_r L^T + \Psi \\ &= LL^T + \Psi \end{aligned} $$

Suppose we observe $x_1 \dots x_n \sim N_p(0, LL^T + \Psi)$

Log-likelihood is $\ell(x_1 \dots x_n; L, \Psi) = -\frac{n}{2} [\log \det (LL^T + \Psi) + tr((LL^T + \Psi)^{-1} S)] + \dots$

$$ \begin{aligned} L(x_1 \dots x_n; L, \Psi) &= \prod_{i=1}^n f_i(x_i) = \prod_{i=1}^n \frac{1}{(2\pi)^{p/2} (\det(LL^T+\Psi))^{1/2}} e^{-\frac{1}{2}x_i^T(LL^T+\Psi)^{-1}x_i} \\ \ell(x_1 \dots x_n; M, \Sigma) &= \log L(x_1 \dots x_n; M, \Sigma) \\ &= -\frac{np}{2} \log(2\pi) - \frac{n}{2} \log \det(LL^T+\Psi) - \frac{1}{2} \sum_{i=1}^n x_i^T (LL^T+\Psi)^{-1}x_i \\ \sum_{i=1}^n tr(x_i^T (LL^T+\Psi)^{-1}x_i) &= \sum_{i=1}^n tr((LL^T+\Psi)^{-1} x_i x_i^T) = n \, tr((LL^T+\Psi)^{-1} S) \end{aligned} $$

Then optimal $L$ and $\Psi$ are solutions to $\underset{L \in \mathbb{R}^{p \times r}, \, \Psi \text{ is diag}}{minimize} \log \det (LL^T + \Psi) + tr((LL^T + \Psi)^{-1}S)$

This can be solved via alternating algorithm (like EM).

Variance explained

$$ \begin{aligned} \Sigma_{11} &= \ell_{11}^2 + \dots + \ell_{1r}^2 + \Psi_1 \\ \Sigma_{22} &= \ell_{21}^2 + \dots + \ell_{2r}^2 + \Psi_2 \\ &\vdots \\ \Sigma_{pp} &= \ell_{p1}^2 + \dots + \ell_{pr}^2 + \Psi_p \end{aligned} $$
  • Total variance

    $$\sum_{i=1}^p \Sigma_{ii} = tr(\Sigma)$$
  • Each $var(x_i) = \Sigma_{ii}$

  • Each Commonality

    $$h_i = \sum_{j=1}^r \ell_{ij}^2$$
  • Contribution to total variance by $j$-th factor

    $$\sum_{i=1}^p \ell_{ij}^2$$
  • and by all factors

    $$\sum_{j=1}^r \sum_{i=1}^p \ell_{ij}^2 = \sum_{i=1}^p h_i$$

Selecting number of factors

  1. Pick maximum $r$ such that $\lambda_1 \dots \lambda_r > 0$

  2. Use PCA scree plot

  3. Evaluate how well we approximate $\Sigma$, e.g. $||\Sigma - (LL^T+\Psi)||_F^2$. Or goodness of fit test.

  4. Use proportion of variance explained by each factor and construct scree plot.


Identifiability

Factor model is not identifiable

$$ \begin{aligned} \text{Let } Q &\text{ be orthogonal } (QQ^T=I) \\ X &= LZ + \varepsilon \\ &= L Q Q^T Z + \varepsilon \\ &= (LQ) (Q^T Z) + \varepsilon \\ &= L^* Z^* + \varepsilon \\ Var(Z^*) &= Q^T Var(Z) Q = Q^T I Q = I \end{aligned} $$

so we need additional constraint.

Diagonalization: find $Q$ such that $\tilde{L}^T \Psi^{-1} \tilde{L}$ is diagonal

$$ \tilde{L}^T \Psi^{-1} \tilde{L} = \Delta \quad (\text{Diagonal Matrix}) $$

Varimax method: find transformation $Q$ that makes $\tilde{L}$ “sparse”, i.e. many values close to 0, few large loadings.


Varimax Criterion

Compute $\tilde{L} = LQ$ Compute communalities $\tilde{h}_i = \sum_{j=1}^r \tilde{\ell}_{ij}^2$ Compute scaled squared loadings $PVE_{ij} = \frac{\tilde{\ell}_{ij}^2}{\tilde{h}_i}$

$$ \begin{aligned} V &= \frac{1}{p} \sum_{j=1}^r \left( \sum_{i=1}^p PVE_{ij}^2 - \frac{1}{p} \left( \sum_{i=1}^p PVE_{ij} \right)^2 \right) \\ &\propto \sum_{j=1}^r \left( \substack{\text{Variance of PVE} \\ \text{for } j\text{th factor}} \right) \end{aligned} $$

Scores

Our MLE model is $X \simeq LZ + \varepsilon$ where $Z \sim N_r(0, I)$, $\varepsilon \sim N_p(0, \Psi)$

given $x_1 \dots x_n \in \mathbb{R}^p$ find $z_1 \dots z_n \in \mathbb{R}^r$ that follow the model.

  • We can use weighted regression (Bartlett)
$$ \begin{aligned} \hat{z} &= \underset{z}{\text{argmin }} (x - Lz)^T \Psi^{-1} (x - Lz) \\ \frac{\partial}{\partial z} (\dots) &= -2L^T \Psi^{-1} (x - Lz) = 0 \\ L^T \Psi^{-1} L z &= L^T \Psi^{-1} x \\ \Rightarrow \hat{z} &= (L^T \Psi^{-1} L)^{-1} L^T \Psi^{-1} x \end{aligned} $$
  • Comment: if we fit FA via PCA we can use unweighted regression ($\Psi = I$)
$$ \begin{aligned} \hat{z} &= \underset{z}{\text{argmin }} ||x - Lz||^2 \\ \Rightarrow \hat{z} &= (L^T L)^{-1} L^T x \end{aligned} $$