OVERVIEW:The PCA Workflow: From $X$ to $Z$
Principal component analysis (PCA) assumes that manifold = r-dimensional affine subspace (plane), where an r-dimensional manifold is defined by a point a and an orthogonal basis v₁, … vᵣ.
This outlines the complete process of Principal Component Analysis, from the raw data $X$ to the new, low-dimensional data $Z$.
1. Preparation Stage
Goal: To calculate the sample covariance matrix $S$.
Variables Used:
- $X$: The $n \times p$ raw data matrix (n samples, p features).
- $n$: Number of samples.
- $p$: Number of original features (dimensions).
- $\bar{x}$: The $p \times 1$ sample mean vector.
- $S$: The $p \times p$ sample covariance matrix.
Formulas & Actions:
Step 1.1: Compute the Mean Vector $\bar{x}$ First, we compute the mean of each feature:
$$ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i $$Step 1.2: Center the Data This is a critical step. We subtract the mean vector from every sample.
$$ X_{\text{centered}} \leftarrow X - \mathbf{1} \bar{x}^T $$(where $\mathbf{1}$ is an $n \times 1$ vector of ones. From now on, we assume $X$ refers to this centered matrix). After this step, the mean of $X$ is the zero vector.
Step 1.3: Compute the Covariance Matrix $S$ Because $X$ is now centered ($\bar{x}=0$), we can use the simplified formula:
$$ S = \frac{1}{n} X^T X $$(Note: Some implementations use $1/(n-1)$ for an unbiased estimate, but $1/n$ is the MLE and is common for PCA).
2. Core Calculation Stage
Goal: To find the principal components (directions of maximum variance).
Variables Used:
- $S$: The $p \times p$ covariance matrix from Stage 1.
- $v$: A $p \times 1$ vector, representing an eigenvector (a principal component).
- $\lambda$: A scalar, representing an eigenvalue (the variance along that component).
Formulas & Actions:
Step 2.1: Find Eigenvectors and Eigenvalues We solve the “eigen-decomposition” equation for the covariance matrix $S$:
$$ S v = \lambda v $$- This is not a simple calculation; it’s a matrix decomposition.
- The solution will yield $p$ eigenvectors ($v_1, \dots, v_p$) and $p$ corresponding eigenvalues ($\lambda_1, \dots, \lambda_p$).
Step 2.2: Sort the Components This is the key to PCA. We sort the eigenvectors $v$ in descending order based on their corresponding eigenvalues $\lambda$.
$$ (\lambda_1, v_1), (\lambda_2, v_2), \dots, (\lambda_p, v_p) $$…such that $\lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_p$.
- $v_1$ is now the First Principal Component (PC1). It is the direction of maximum variance ($\lambda_1$).
- $v_2$ is the Second Principal Component (PC2), and so on.
3. Final Result Stage (Projection)
Goal: To create the new, low-dimensional data $Z$ by projecting $X$ onto the chosen components.
Variables Used:
- $X$: The $n \times p$ centered data from Stage 1.
- $v_1, \dots, v_p$: The sorted principal components from Stage 2.
- $r$: The new number of dimensions we want (where $r < p$).
- $W$: The $p \times r$ projection matrix.
- $Z$: The $n \times r$ final, new data (our desired result).
Formulas & Actions:
Step 3.1: Choose the New Dimensionality $r$
- How to choose? We look at the eigenvalues. The “Proportion of Variance Explained” by the first $r$ components is: $$ \text{PVE}_r = \frac{\lambda_1 + \lambda_2 + \dots + \lambda_r}{\lambda_1 + \lambda_2 + \dots + \lambda_p} $$
- We typically choose $r$ such that $\text{PVE}_r \ge 95\%$ (or some other high threshold).
Step 3.2: Construct the Projection Matrix $W$ We create $W$ by stacking the first $r$ principal components (our sorted $v$ vectors) side-by-side as columns:
$$ W = \begin{bmatrix} | & | & & | \\ v_1 & v_2 & \dots & v_r \\ | & | & & | \end{bmatrix} $$Step 3.3: Project $X$ to get $Z$ This is the final transformation. We multiply our $n \times p$ data $X$ by our $p \times r$ projection matrix $W$:
$$ Z = X W $$- Result: $Z$ is our new $n \times r$ data matrix.
- The first column of $Z$ is the set of 1D coordinates for all $n$ samples on PC1.
- The second column of $Z$ is the set of 1D coordinates for all $n$ samples on PC2, and so on.
Detailed Theories
PCA: Maximum Variance View
Assume that the $n \times p$ data matrix $X$ is column-centered:
$$ X = \begin{pmatrix} x_1^T \\ \vdots \\ x_n^T \end{pmatrix} \in \mathbb{R}^{n \times p} $$This means the sample mean $\bar{x}$ is the zero vector:
$$ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i = 0 $$With this assumption, the sample covariance matrix $S$ (using $1/n$ as the denominator) simplifies to:
$$ S = \frac{1}{n} X^T X $$Note: Why does $S = \frac{1}{n} X^T X$?
This formula is a simplified special case. It holds only if the data matrix $X$ has been column-centered, meaning the sample mean $\bar{x}$ is the zero vector ($\bar{x}=0$).
The derivation is as follows:
The general definition of the sample covariance matrix (with a $1/n$ denominator) is:
$$ S = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})(x_i - \bar{x})^T $$We substitute our assumption that $\bar{x} = 0$:
$$ S = \frac{1}{n} \sum_{i=1}^n (x_i - 0)(x_i - 0)^T = \frac{1}{n} \sum_{i=1}^n x_i x_i^T $$The “sum of outer products” $\sum_{i=1}^n x_i x_i^T$ is precisely the definition of the matrix multiplication $X^T X$, where $X^T$ is the $p \times n$ matrix whose columns are $x_i$:
$$ X^T X = \begin{pmatrix} | & & | \\ x_1 & \dots & x_n \\ | & & | \end{pmatrix} \begin{pmatrix} \text{---} & x_1^T & \text{---} \\ & \vdots & \\ \text{---} & x_n^T & \text{---} \end{pmatrix} = \sum_{i=1}^n x_i x_i^T $$Therefore, the formula simplifies to:
$$ S = \frac{1}{n} X^T X $$
Projection from p-dimensions to 1-dimension
We want to find a 1-dimensional subspace (a line through the origin), defined by a unit direction vector $v$, onto which we project the data.
- $x_i$ is the original data point (a $p \times 1$ vector).
- $\hat{x}_i$ is the projection of $x_i$ onto the line defined by $v$.
- $z_i$ is the new coordinate of that projected point on the 1-D line (a scalar).
The relationship is:
$$ \hat{x}_i = z_i \cdot v $$The new coordinate $z_i$ is computed by the dot product of $x_i$ and $v$:
$$ z_i = \langle x_i, v \rangle = x_i^T v $$Projection Operator
The $p \times p$ projection matrix $P$ is $P = vv^T$. The projection $\hat{x}_i$ is:
$$ \hat{x}_i = P x_i = (v v^T) x_i = v (v^T x_i) = v \langle x_i, v \rangle = z_i v $$(This is consistent with the note’s $z_i \cdot v$)
The New 1D Coordinate Vector $z$
We can stack all $n$ new coordinates into an $n \times 1$ vector $z$:
$$ z = \begin{pmatrix} z_1 \\ \vdots \\ z_n \end{pmatrix} $$This can be computed efficiently as a matrix-vector product:
$$ z = X v $$Expanding this, we see:
$$ z = \begin{pmatrix} x_1^T \\ \vdots \\ x_n^T \end{pmatrix} v = \begin{pmatrix} x_1^T v \\ \vdots \\ x_n^T v \end{pmatrix} = \begin{pmatrix} \langle x_1, v \rangle \\ \vdots \\ \langle x_n, v \rangle \end{pmatrix} $$PC1: Finding the “best” $v$
A “good $v$ will maximize the spread of the projections”. “Spread” is statistically defined as variance.
Goal: Find $v$ that maximizes the variance of $z$, denoted $S_z^2$.
Deriving the New Variance $S_z^2$
Let’s compute the mean $\bar{z}$ and variance $S_z^2$ of the new $z$ vector.
1. Mean $\bar{z}$: Since $X$ is centered ($\bar{x} = 0$), the mean of $z$ will also be 0:
$$ \bar{z} = \frac{1}{n} \mathbf{1}^T z = \frac{1}{n} \mathbf{1}^T (Xv) $$$$ \bar{z} = \left( \frac{1}{n} \mathbf{1}^T X \right) v = \bar{x}^T v = 0^T v = 0 $$2. Variance $S_z^2$: Since $\bar{z}=0$, the variance $S_z^2$ is the mean of the squared values:
$$ S_z^2 = \frac{1}{n} z^T z $$Substitute $z = Xv$:
$$ S_z^2 = \frac{1}{n} (Xv)^T (Xv) = \frac{1}{n} (v^T X^T) (Xv) = \frac{1}{n} v^T X^T X v $$Regroup the terms:
$$ S_z^2 = v^T \left( \frac{1}{n} X^T X \right) v $$We know $S = \frac{1}{n} X^T X$ (the original covariance matrix), so:
$$ S_z^2 = v^T S v $$This key formula connects the projected 1D variance ($S_z^2$) to the original $p \times p$ covariance matrix $S$ and the direction $v$.
The Final Optimization Problem
The first principal component (PC1) is the vector $v$ that solves this optimization problem:
Maximize ( $v \in \mathbb{R}^p$ ):
$$ v^T S v $$Subject to:
$$ \|v\| = 1 $$(The constraint $\|v\|=1$ is required because we only care about the direction of $v$. Without it, we could make $v^T S v$ arbitrarily large by simply increasing the length of $v$.)
个人注解:选择 @@||v|| = 1@@ 只是为了方便和标准化,它能让我们固定长度后去找到那个唯一的“单位方向向量”作为主成分。如果不通过固定长度去控制变量,则可能遇到问题无解的情况。
Detour: Method of Lagrange Multipliers
Consider $x \in \mathbb{R}^p$ and two functions $f: \mathbb{R}^p \to \mathbb{R}$ and $g: \mathbb{R}^p \to \mathbb{R}$.
We want to maximize $f(x)$ subject to $g(x) = 0$.
Form Lagrangian $\mathcal{L}(x, \lambda)$:
$$ \mathcal{L}(x, \lambda) = f(x) + \lambda g(x) $$where $\lambda$ is the Lagrange multiplier.
Find critical points of $\mathcal{L}$: We solve for $\nabla \mathcal{L}(x, \lambda) = 0$.
Consider all critical points $(x^*, \lambda^*)$ of $\mathcal{L}$. Find the $x^*$ that results in the largest $f(x^*)$.
For multiple constraints
Maximize $f(x)$ subject to:
$$ \begin{cases} g_1(x) = 0 \\ \vdots \\ g_k(x) = 0 \end{cases} $$Form Lagrangian:
$$ \mathcal{L}(x, \lambda_1, \dots, \lambda_k) = f(x) + \lambda_1 g_1(x) + \dots + \lambda_k g_k(x) $$Solve $\nabla \mathcal{L}(x, \lambda_1, \dots, \lambda_k) = 0$ and find all critical points $x^*, \lambda_1^*, \dots, \lambda_k^*$.
Use the $x^*$ with the largest value of $f(x^*)$.
PC1: Solution via Lagrange Multipliers
The conclusion from the optimization is: PC1 (which is $v$) is the top eigenvector of $S$. The corresponding variance $S_z^2$ (which is $v^T S v$) is $\lambda$, the top eigenvalue of $S$.
Derivation
We start with the final optimization problem:
$$ \text{maximize } v^T S v \quad \text{subject to } \|v\|^2 = 1 $$(where $v \in \mathbb{R}^p$)
We use the Method of Lagrange Multipliers. Let $f(v) = v^T S v$ and $g(v) = \|v\|^2 - 1 = v^T v - 1 = 0$.
1. Form the Lagrangian:
$$ \mathcal{L}(v, \lambda) = f(v) - \lambda g(v) $$$$ \mathcal{L}(v, \lambda) = v^T S v - \lambda(v^T v - 1) $$2. Find the gradient w.r.t. $v$ and set to zero:
$$ \nabla_v \mathcal{L}(v, \lambda) = 0 $$Using matrix calculus rules ($\nabla_v(v^T A v) = 2Av$ for symmetric $A$, and $\nabla_v(v^T v) = 2v$):
$$ \nabla_v \mathcal{L} = 2Sv - 2\lambda v $$Setting the gradient to zero:
$$ 2Sv - 2\lambda v = 0 \quad \implies \quad Sv = \lambda v $$3. Interpret the result: The equation $Sv = \lambda v$ is the definition of an eigenvector. This means the optimal vector $v$ must be an eigenvector of the covariance matrix $S$.
4. Find which eigenvector is optimal: Our original goal was to maximize $f(v) = v^T S v$. We can now substitute the result $Sv = \lambda v$ into this objective function:
$$ \text{var}(z) = v^T S v = v^T (Sv) $$$$ = v^T (\lambda v) = \lambda (v^T v) $$From our constraint, we know that $\|v\|^2 = v^T v = 1$.
$$ \text{var}(z) = \lambda \cdot 1 = \lambda $$Conclusion: To maximize the variance ($v^T S v$), we must maximize the eigenvalue ($\lambda$).
Therefore, the first principal component (PC1) is the eigenvector $v$ that corresponds to the maximum eigenvalue of $S$.
Terminology
$v_1 \in \mathbb{R}^p$ is the PC direction
$z_{(1)} = \begin{pmatrix} z_1 \\ \vdots \\ z_n \end{pmatrix} = X v_1 \in \mathbb{R}^n$ is the Score Vector
$\lambda_1 = v_1^T S v_1 \ge 0$ is the variance explained
$\hat{X}_1 = \begin{pmatrix} \hat{x}_1^T \\ \vdots \\ \hat{x}_n^T \end{pmatrix} = X v_1 v_1^T \in \mathbb{R}^{n \times p}$ are the projections onto the PC1 space
PC2 (Second Principal Component)
Goal: Find new direction $v$ with $\|v\|=1$ that maximizes $S_z^2$ and is perpendicular to $v_1$.
The optimization problem is:
maximize $v^T S v$ subject to $\|v\| = 1$ and $v \perp v_1$
PC2: Finding the Second Principal Component
From PC1, we know the goal is to find a direction $v$ that maximizes the projected variance $S_z^2 = v^T S v$. For PC2, we add a new constraint.
The Optimization Problem (PC2):
Find the new direction $v$ with $\|v\|=1$ that maximizes $S_z^2$ and is perpendicular to $v_1$.
maximize $v^T S v$ subject to $\|v\| = 1$ and $v \perp v_1$ (Note: $v \perp v_1$ can be written as $v^T v_1 = 0$)
Solution via Lagrange Multipliers
The conclusion is: PC2 ($v_2$) is the second largest eigenvector of $S$. The corresponding variance $s_z^2 = \lambda_2$ is the second largest eigenvalue of $S$.
Derivation
We have one objective $f(v) = v^T S v$ and two constraints:
- $g_1(v) = \|v\|^2 - 1 = v^T v - 1 = 0$
- $g_2(v) = v^T v_1 = 0$
1. Form the Lagrangian: We introduce two multipliers, $\lambda$ and $\mu$.
$$ \mathcal{L}(v, \lambda, \mu) = v^T S v - \lambda(v^T v - 1) - \mu(v^T v_1) $$2. Find the gradient w.r.t. $v$ and set to zero:
$$ \nabla_v \mathcal{L} = 2Sv - 2\lambda v - \mu v_1 = 0 $$3. Solve for the multipliers: (The note shows this derivation): Left-multiply the gradient equation by $v_1^T$:
$$ v_1^T (2Sv - 2\lambda v - \mu v_1) = v_1^T 0 = 0 $$$$ 2 v_1^T S v - 2\lambda (v_1^T v) - \mu (v_1^T v_1) = 0 $$Using the constraints ($v_1^T v = 0$, $v_1^T v_1 = 1$) and the fact that $v_1$ is an eigenvector ($v_1^T S = \lambda_1 v_1^T$), this simplifies to $\mu = 0$.
4. Interpret the result: Since $\mu=0$, the gradient equation becomes:
$$ 2Sv - 2\lambda v = 0 \quad \implies \quad Sv = \lambda v $$This means PC2, like PC1, must be an eigenvector of S.
5. Find which eigenvector:
Our goal is to maximize $\text{var}(z) = v^T S v = \lambda$.
We are constrained by $v^T v_1 = 0$.
A property of symmetric matrices (like $S$) is that all their eigenvectors are mutually orthogonal. This means $v_2, v_3, \dots, v_p$ all automatically satisfy the constraint $v^T v_1 = 0$.
Our problem is now: “From the set of allowed eigenvectors $\{v_2, \dots, v_p\}$, which one has the largest eigenvalue $\lambda$?” Since we sorted them by $\lambda_1 \ge \lambda_2 \ge \dots$, the answer is $v_2$.
Conclusion:
$$ v \text{ corresponds to the second largest e.value} \implies v = v_2 $$The second PC has direction $v_2$, score $z_{(2)} = X v_2$, and explains $\lambda_2 = v_2^T S v_2$ variance of the data.
PCr: The $r$-th Principal Component
The logic for finding PC2 can be generalized to find any $r$-th component.
The Optimization Problem (PCr): Find the new direction $v$ that maximizes variance, given that $v$ is orthogonal to all previously found components.
maximize $v^T S v$ subject to $\|v\| = 1$ and $v \perp v_1, \dots, v \perp v_{r-1}$
The Solution: The solution $v$ is $v_r$, the $r$-th largest eigenvector of $S$. The corresponding variance $S_z^2$ is $\lambda_r$, the $r$-th largest eigenvalue of $S$.
The $(PC_1, \dots, PC_r)$-space
Once we have computed the first $r$ principal components, we can define the $r$-dimensional PC-space.
Directions $V_{(r)}$: This is the projection matrix, formed by stacking the first $r$ eigenvectors (PC directions) as columns:
$$ V_{(r)} = \begin{bmatrix} | & | & & | \\ v_1 & v_2 & \dots & v_r \\ | & | & & | \end{bmatrix} \in \mathbb{R}^{p \times r} $$Scores $Z_{(r)}$: This is the final, new, low-dimensional data. It is computed by projecting the centered data $X$ onto the new $r$-dimensional space.
$$ Z_{(r)} = \begin{pmatrix} z_{(1)} & z_{(2)} & \dots & z_{(r)} \end{pmatrix} = X \cdot V_{(r)} \in \mathbb{R}^{n \times r} $$- This $n \times r$ matrix is our new dataset. Each row is a sample, and each column is a new (PC) feature.
Projections $\hat{X}_{(r)}$: This is the reconstructed data, or the “shadow” of the original data in the $p$-dimensional space.
$$ \hat{X}_{(r)} = \begin{pmatrix} \hat{x}_1^T \\ \vdots \\ \hat{x}_n^T \end{pmatrix} = X V_{(r)} V_{(r)}^T \in \mathbb{R}^{n \times p} $$- This $n \times p$ matrix represents how much of the original data is “explained” by the first $r$ components.
Example: The $(PC_1, PC_2)$-space ($r=2$)
This is the most common case, used for 2D visualization.
- Directions: $V_{(2)} = (v_1, v_2) \in \mathbb{R}^{p \times 2}$
- Scores: $Z_{(2)} = (z_{(1)}, z_{(2)}) = X V_{(2)} \in \mathbb{R}^{n \times 2}$
- Projections: $\hat{X}_{(2)} = X V_{(2)} V_{(2)}^T \in \mathbb{R}^{n \times p}$
Practical aspects of PCA
1. PCA is sensitive to scaling
PCA is a variance-maximizing algorithm, so it is highly sensitive to the scale of the original features.
Problem: If features have different units (e.g., meters vs. kilograms), their variances will be vastly different. PCA will incorrectly assign more importance to features with larger numerical variance.
Example:
weight (kg)vsHeight (m): Variances might be comparable.weight (kg)vsHeight (cm): Changing meters to centimeters (e.g., 1.8 to 180) squares the scale, inflating the variance of height by $100^2 = 10,000$. PCA will now be dominated by the height feature.
Solution: When necessary, standardize data columns.
- Before running PCA, center each feature (column $f_i$) and divide by its standard deviation $s_i$.
- $$ X = (f_1 \dots f_p) \quad \longrightarrow \quad X_{\text{standardized}} = \left( \frac{f_1}{s_1} \dots \frac{f_p}{s_p} \right) $$
- This ensures all features have a variance of 1 and are treated “fairly”.
2. How to pick $r$?
The choice of $r$ (the number of components to keep) depends on your goal.
For visualization:
- Pick $r=2$ (for a 2D scatter plot) or $r=3$ (for a 3D scatter plot).
For data compression / de-noising:
- Compute the Proportion of Variance Explained (PVE).
Total variance in the data is the sum of all eigenvalues:
$$ \text{Total Variance} = \lambda_1 + \lambda_2 + \dots + \lambda_p $$Variance Explained (VE) by each PC is its eigenvalue:
- VE by $PC_1 = \lambda_1$
- …
- VE by $PC_p = \lambda_p$
Proportion of Variance Explained (PVE) by each PC is:
- PVE by $PC_1 = \frac{\lambda_1}{\lambda_1 + \dots + \lambda_p}$
- …
- PVE by $PC_p = \frac{\lambda_p}{\lambda_1 + \dots + \lambda_p}$
To pick $r$, we compute the cumulative PVE and choose the smallest $r$ such that:
$$ \frac{\lambda_1 + \dots + \lambda_r}{\lambda_1 + \dots + \lambda_p} \ge \text{Threshold (e.g., 0.95 or 0.99)} $$Chapter 2: Probabilistic View at PCA
Probabilistic View at PCA
Here, we consider $X$ as a $p \times 1$ random vector, not a data matrix.
$$ X = \begin{pmatrix} x_1 \\ \vdots \\ x_p \end{pmatrix} \text{is a random vector} $$(We assume $\mathbb{E}[X] = 0$, so $var(X) = \mathbb{E}[XX^T] = \Sigma$)
PC1
Find direction $v \in \mathbb{R}^p$ with $\|v\|=1$ such that $z = X^T v$ has maximum variance.
This is the optimization problem:
$$ \underset{v \in \mathbb{R}^p}{\text{maximize}} \quad \text{var}(X^T v) \quad \text{subject to} \quad \|v\|=1 $$By the properties of variance of random vectors:
$$ \text{var}(z) = \text{var}(X^T v) = \text{var}(v^T X) = v^T \text{var}(X) v = v^T \Sigma v $$The problem becomes:
$$ \text{maximize} \quad v^T \Sigma v \quad \text{subject to} \quad \|v\|=1 $$The solution $v$ is the top eigenvector of $\Sigma$, i.e., $v = v_1$. The variance is:
$$ \text{var}(z) = v_1^T \Sigma v_1 = \lambda_1 $$(The note uses $d_1$ for the eigenvalue $\lambda_1$)
PCr
Find direction $v \in \mathbb{R}^p$ with $\|v\|=1$ such that $z = X^T v$ has maximum variance… …and $v \perp v_1, \dots, v \perp v_{r-1}$ (Note: The image contains a typo $v_r$, which should be $v_{r-1}$).
The solution is the $r$-th largest eigenvector of $\Sigma$.
PCA: Minimum Reconstruction Error View
This is an alternative, but equivalent, formulation of PCA.
Goal: Find an $r$-dimensional plane that approximates “well” the data points $x_1, \dots, x_n$. “Well” means minimizing the total squared distance from the points to the plane.
1. Defining the $r$-dimensional Plane
To define this plane, we need:
- A point $a \in \mathbb{R}^p$ (which acts as the “center” or “anchor” of the plane).
- An orthogonal basis $v_1, \dots, v_r \in \mathbb{R}^p$ (the $r$ directions that define the plane).
Any point $\hat{x}$ on this plane can be represented as:
$$ \hat{x} = a + z_1 v_1 + \dots + z_r v_r $$where $z_1, \dots, z_r \in \mathbb{R}$ are the coordinates of $\hat{x}$ within the plane.
2. The Reconstruction Error
In particular, we want to find the “shadow” or projection $\hat{x}_i$ for each original data point $x_i$. This $\hat{x}_i$ is the reconstruction of $x_i$ using only $r$ dimensions.
The formula for the $i$-th reconstruction is:
$$ \hat{x}_i = a + z_{i1} v_1 + \dots + z_{ir} v_r $$We can write this more compactly using matrix notation:
- Let $V = (v_1, \dots, v_r) \in \mathbb{R}^{p \times r}$ be the matrix of basis vectors.
- Let the constraint be that $V$ is orthogonal: $V^T V = I$.
- Let $\beta_i = \begin{pmatrix} z_{i1} \\ \vdots \\ z_{ir} \end{pmatrix} \in \mathbb{R}^r$ be the new $r$-dimensional score vector for the $i$-th point.
The reconstruction formula becomes:
$$ \hat{x}_i = a + V \beta_i $$3. The Optimization Problem
The goal of $(PC_1, \dots, PC_r)$ is to minimize the reconstruction error.
The Reconstruction Error (RE) is the total sum of squared distances between each original point $x_i$ and its reconstruction $\hat{x}_i$:
$$ RE(a, V, \beta_1, \dots, \beta_n) = \sum_{i=1}^n \|x_i - \hat{x}_i\|^2 $$Substituting the formula for $\hat{x}_i$, the PCA optimization problem becomes:
$$ \underset{a, V, \beta_i}{\text{minimize}} \quad \sum_{i=1}^n \|x_i - a - V \beta_i\|^2 $$subject to $V^T V = I$.
(It can be proven that the solution to this problem is $a = \bar{x}$, $V$ is the matrix of the top $r$ eigenvectors of $S$, and $\beta_i = V^T (x_i - \bar{x})$)
Solution to the Minimum Reconstruction Error Problem
The optimization problem was:
$$ \underset{a, V, \beta_i}{\text{minimize}} \quad \sum_{i=1}^n \|x_i - a - V \beta_i\|^2 \quad \text{subject to} \quad V^T V = I $$This problem can be solved in steps, which reveals its connection to the “Maximum Variance” view.
Step 1: Find Optimal $a$ and $\beta_i$ (given $V$)
It can be proven that the optimal $a$ and $\beta_i$ are:
Optimal $a = \bar{x}$ (the sample mean) (The note states this without proof, as it’s a standard result).
Optimal $\beta_i = V^T (x_i - \bar{x}) = V^T \tilde{x}_i$ (This is the score vector for the $i$-th point, found by projecting the centered data $\tilde{x}_i$ onto the $V$ basis. As the note mentions, this is the solution to a regression problem: $\beta_i = (V^T V)^{-1} V^T \tilde{x}_i = V^T \tilde{x}_i$ since $V^T V = I$).
Step 2: Rewrite the Reconstruction Error (RE)
Now we substitute $a = \bar{x}$ and $\beta_i = V^T \tilde{x}_i$ back into the RE formula. (Note: $\hat{x}_i = a + V\beta_i = \bar{x} + V V^T \tilde{x}_i$. So $x_i - \hat{x}_i = \tilde{x}_i - V V^T \tilde{x}_i = (I - VV^T)\tilde{x}_i$). We will now use $x_i$ to denote the centered data $\tilde{x}_i$ (as the note does for simplicity).
The RE becomes:
$$ RE = \sum_{i=1}^n \|x_i - V\beta_i\|^2 = \sum_{i=1}^n \|x_i - VV^T x_i\|^2 = \sum_{i=1}^n \|(I - VV^T)x_i\|^2 $$Using the properties of trace (i.e., $\|z\|^2 = z^T z = \text{tr}(zz^T)$ and $\text{tr}(ABC) = \text{tr}(CAB)$):
$$ \|(I - VV^T)x_i\|^2 = x_i^T (I - VV^T)^T (I - VV^T) x_i $$(Since $(I - VV^T)$ is a projection matrix, it is symmetric and idempotent, so $(I-P)^T(I-P) = (I-P)$)
$$ = x_i^T (I - VV^T) x_i = \text{tr}\left( x_i^T (I - VV^T) x_i \right) $$$$ = \text{tr}\left( (I - VV^T) x_i x_i^T \right) $$Now sum over $i$ and move the sum inside the trace:
$$ RE = \sum_{i=1}^n \text{tr}\left( (I - VV^T) x_i x_i^T \right) = \text{tr}\left( (I - VV^T) \sum_{i=1}^n x_i x_i^T \right) $$Since $S = \frac{1}{n} \sum x_i x_i^T$ (for centered data), $\sum x_i x_i^T = nS$.
$$ RE = \text{tr}\left( (I - VV^T) (nS) \right) = n \cdot \text{tr}\left( (I - VV^T) S \right) $$Step 3: Simplify and Connect to Maximum Variance
This is the final step. We use the linearity of trace ($\text{tr}(A-B) = \text{tr}(A) - \text{tr}(B)$):
$$ RE = n \cdot \text{tr}(S - VV^T S) = n \cdot \text{tr}(S) - n \cdot \text{tr}(VV^T S) $$Using the cyclic property of trace ($\text{tr}(ABC) = \text{tr}(CAB)$):
$$ \text{tr}(VV^T S) = \text{tr}( (V^T) (S) (V) ) = \text{tr}(V^T S V) $$So the RE is:
$$ RE = n \cdot \text{tr}(S) - n \cdot \text{tr}(V^T S V) $$Our goal is to minimize $RE$.
- $n \cdot \text{tr}(S)$ is the total variance of the data, which is a constant.
- Therefore, minimizing $RE$ is equivalent to maximizing the term being subtracted: $$ \text{maximize} \quad \text{tr}(V^T S V) $$ Expanding $V = (v_1, \dots, v_r)$, we get: $$ \text{maximize} \quad \text{tr}(V^T S V) = v_1^T S v_1 + \dots + v_r^T S v_r $$ This is the exact same optimization problem as the “Maximum Variance” view.
Conclusion: The optimal $v_1, \dots, v_r$ are the eigenvectors of $S$ corresponding to the $r$ largest eigenvalues.
Final Approximation
The PCA reconstruction of a point $x$ is $\hat{x}$, which is its projection onto the $r$-dimensional plane.
$$ x \approx \hat{x} = \bar{x} + z_1 v_1 + \dots + z_r v_r $$PCA via SVD
This provides an alternative and often more stable computational method for PCA.
Assume that the $n \times p$ data matrix $X$ is centered.
We can compute the Singular Value Decomposition (SVD) of $X$:
$$ X = U D V^T $$(Where $U$ is $n \times p$, $D$ is $p \times p$, and $V^T$ is $p \times p$)
Now, let’s look at the covariance matrix $S$:
$$ S = \frac{1}{n} X^T X = \frac{1}{n} (U D V^T)^T (U D V^T) $$$$ S = \frac{1}{n} (V D^T U^T) (U D V^T) $$Since $U$ is column-orthogonal, $U^T U = I$:
$$ S = \frac{1}{n} V (D^T D) V^T = V \left( \frac{D^2}{n} \right) V^T $$We also know the Eigendecomposition (EVD) of $S$ is $S = V \Lambda V^T$. By comparing these two forms, we find the key relationships:
The Principal Component Directions $V_{(r)}$ are the top $r$ right singular vectors of $X$. (The eigenvectors of $S$ are the right singular vectors of $X$).
Variance explained by each component is $\lambda_i = \frac{d_i^2}{n}$, where $d_i$ is the $i$-th singular value of $X$.
PCA Products via SVD
This gives us a very direct way to compute the PCA results:
The Scores $Z_{(r)}$: The scores $Z_{(r)} = X V_{(r)}$. We can substitute the SVD of $X$:
$$ Z_{(r)} = X V_{(r)} = (U D V^T) V_{(r)} $$The note shows the derivation of this, which simplifies (because $V^T V_{(r)}$ is an identity matrix $I$ in the top $r$ block):
$$ Z_{(r)} = U_{(r)} D_{(r)} $$(Where $U_{(r)}$ and $D_{(r)}$ are the first $r$ columns/diagonal entries of $U$ and $D$). This means the scores are simply the scaled left singular vectors.
The Projections $\hat{X}_{(r)}$ (Reconstruction): The $r$-rank approximation of $X$ (which is the PCA projection) is:
$$ \hat{X}_{(r)} = U_{(r)} D_{(r)} V_{(r)}^T $$
Interpretation
The full SVD of $X$ is a sum of $p$ components:
$$ X = d_1 u_1 v_1^T + \dots + d_r u_r v_r^T + d_{r+1} u_{r+1} v_{r+1}^T + \dots + d_p u_p v_p^T $$PCA via SVD states that $X$ is approximately its reconstruction $\hat{X}_{(r)}$:
$$ X = \underbrace{\hat{X}_{(r)}}_{\text{signal}} + \underbrace{\text{"noise"}}_{\text{remaining components}} $$PCA: Low-Rank Matrix Approximation View
This fourth view of PCA connects all previous concepts (variance, reconstruction, SVD) through the language of linear algebra.
1. “r-dim Plane” is “Rank-r Matrix”
First, let’s establish that a set of points lying on an $r$-dimensional plane is equivalent to a low-rank matrix.
- Assume a set of $n$ points $\hat{x}_1, \dots, \hat{x}_n$ belong to an $r$-dimensional plane (that passes through the origin).
- This means each point $\hat{x}_i$ is a linear combination of $r$ orthogonal basis vectors $v_1, \dots, v_r$: $$ \hat{x}_i = z_{i1} v_1 + \dots + z_{ir} v_r $$
- We can write the transpose of this: $$ \hat{x}_i^T = (z_{i1} \dots z_{ir}) \begin{pmatrix} \text{---} & v_1^T & \text{---} \\ & \vdots & \\ \text{---} & v_r^T & \text{---} \end{pmatrix} $$
- If we stack all $n$ points into the $n \times p$ reconstruction matrix $\hat{X}$: $$ \hat{X} = \begin{pmatrix} \hat{x}_1^T \\ \vdots \\ \hat{x}_n^T \end{pmatrix} = \underbrace{\begin{pmatrix} z_{11} & \dots & z_{1r} \\ \vdots & & \vdots \\ z_{n1} & \dots & z_{nr} \end{pmatrix}}_{Z \text{ ($n \times r$)}} \underbrace{\begin{pmatrix} \text{---} & v_1^T & \text{---} \\ & \vdots & \\ \text{---} & v_r^T & \text{---} \end{pmatrix}}_{V^T \text{ ($r \times p$)}} $$
- This shows that $\hat{X} = Z V^T$. By the properties of linear algebra, the rank of a matrix product is at most the smallest rank of the inputs.
- Therefore, we have a key insight: $$ \text{rank}(\hat{X}) \le r $$ A set of points lying on an $r$-dim plane is a rank-$r$ matrix.
2. The Optimization Problem
The goal of $(PC_1, \dots, PC_r)$ is to find a low-rank $\hat{X}$ that approximates $X$.
This is formulated as the following optimization problem:
$$ \underset{\hat{X} \in \mathbb{R}^{n \times p}}{\text{minimize}} \quad \|X - \hat{X}\|_F^2 \quad \text{subject to} \quad \text{rank}(\hat{X}) \le r $$This means: “Find the best rank-$r$ matrix approximation $\hat{X}$ for the data $X$.”
- The objective function $\|X - \hat{X}\|_F^2$ is the reconstruction error, measured by the Frobenius norm squared.
- The Frobenius norm is the sum of squared distances of all elements: $$ \|X - \hat{X}\|_F^2 = \sum_{i=1}^n \sum_{j=1}^p (x_{ij} - \hat{x}_{ij})^2 $$
- This is mathematically equivalent to the sum of squared Euclidean distances for each sample (row), which is exactly the reconstruction error we saw earlier: $$ = \sum_{i=1}^n \|x_i - \hat{x}_i\|^2 $$
(The Eckart-Young theorem proves that the solution $\hat{X}$ to this problem is the $\hat{X}_{(r)}$ obtained from the SVD of $X$, thus unifying all views of PCA.)
PCA: Low-Rank Matrix Approximation View
This fourth view of PCA connects all previous concepts (variance, reconstruction, SVD) through the language of linear algebra.
1. “r-dim Plane” is “Rank-r Matrix”
First, let’s establish that a set of points lying on an $r$-dimensional plane is equivalent to a low-rank matrix.
- Assume a set of $n$ points $\hat{x}_1, \dots, \hat{x}_n$ belong to an $r$-dimensional plane (that passes through the origin).
- This means each point $\hat{x}_i$ is a linear combination of $r$ orthogonal basis vectors $v_1, \dots, v_r$: $$ \hat{x}_i = z_{i1} v_1 + \dots + z_{ir} v_r $$
- We can write the transpose of this: $$ \hat{x}_i^T = (z_{i1} \dots z_{ir}) \begin{pmatrix} \text{---} & v_1^T & \text{---} \\ & \vdots & \\ \text{---} & v_r^T & \text{---} \end{pmatrix} $$
- If we stack all $n$ points into the $n \times p$ reconstruction matrix $\hat{X}$: $$ \hat{X} = \begin{pmatrix} \hat{x}_1^T \\ \vdots \\ \hat{x}_n^T \end{pmatrix} = \underbrace{\begin{pmatrix} z_{11} & \dots & z_{1r} \\ \vdots & & \vdots \\ z_{n1} & \dots & z_{nr} \end{pmatrix}}_{Z \text{ ($n \times r$)}} \underbrace{\begin{pmatrix} \text{---} & v_1^T & \text{---} \\ & \vdots & \\ \text{---} & v_r^T & \text{---} \end{pmatrix}}_{V^T \text{ ($r \times p$)}} $$
- This shows that $\hat{X} = Z V^T$. By the properties of linear algebra, the rank of a matrix product is at most the smallest rank of the inputs.
- Therefore, we have a key insight: $$ \text{rank}(\hat{X}) \le r $$ A set of points lying on an $r$-dim plane is a rank-$r$ matrix.
2. The Optimization Problem
The goal of $(PC_1, \dots, PC_r)$ is to find a low-rank $\hat{X}$ that approximates $X$.
This is formulated as the following optimization problem:
$$ \underset{\hat{X} \in \mathbb{R}^{n \times p}}{\text{minimize}} \quad \|X - \hat{X}\|_F^2 \quad \text{subject to} \quad \text{rank}(\hat{X}) \le r $$This means: “Find the best rank-$r$ matrix approximation $\hat{X}$ for the data $X$.”
- The objective function $\|X - \hat{X}\|_F^2$ is the reconstruction error, measured by the Frobenius norm squared.
- The Frobenius norm is the sum of squared distances of all elements: $$ \|X - \hat{X}\|_F^2 = \sum_{i=1}^n \sum_{j=1}^p (x_{ij} - \hat{x}_{ij})^2 $$
- This is mathematically equivalent to the sum of squared Euclidean distances for each sample (row), which is exactly the reconstruction error we saw earlier: $$ = \sum_{i=1}^n \|x_i - \hat{x}_i\|^2 $$
(The Eckart-Young theorem proves that the solution $\hat{X}$ to this problem is the $\hat{X}_{(r)}$ obtained from the SVD of $X$, thus unifying all views of PCA.)
PCA for Matrix Completion: The Hard-Impute Algorithm
This topic applies the low-rank approximation view of PCA to solve the problem of missing values in a dataset.
1. Problem Setup: Missing Values
Assume $X \in \mathbb{R}^{n \times p}$ has missing values (e.g., NA).
- We denote $\Omega \subseteq \{1...n\} \times \{1...p\}$ as the set of observed entries in $X$.
- We define $W \in \mathbb{R}^{n \times p}$ as the mask matrix. $$ W_{ij} = \begin{cases} 1 & \text{if } (i,j) \in \Omega \\ 0 & \text{if } (i,j) \notin \Omega \end{cases} $$
Example:
$$ X = \begin{pmatrix} \text{NA} & 1 \\ 0 & \text{NA} \\ \text{NA} & 2 \end{pmatrix} $$The set of observed entries is:
$$ \Omega = \{ (1,2), (2,1), (3,2) \} $$The corresponding mask matrix $W$ is:
$$ W = \begin{pmatrix} 0 & 1 \\ 1 & 0 \\ 0 & 1 \end{pmatrix} $$2. Reconstruction Error with Missing Data
We can only measure the reconstruction error on the entries we observed ($\Omega$).
The error is the sum of squared differences only over $\Omega$:
$$ \sum_{(i,j) \in \Omega} (X_{ij} - \hat{X}_{ij})^2 $$This can be expressed using the mask matrix $W$ and the Frobenius norm, where * denotes the element-wise product (Hadamard product):
The error sum is equivalent to:
$$ \sum_{(i,j) \in \Omega} (X_{ij} - \hat{X}_{ij})^2 = \| W * (X - \hat{X}) \|_F^2 $$This is because:
$$ \sum_{i=1}^n \sum_{j=1}^p W_{ij} \cdot (X_{ij} - \hat{X}_{ij})^2 = \sum_{i=1}^n \sum_{j=1}^p ( (W * (X - \hat{X}))_{ij} )^2 $$The $W_{ij}=0$ terms “zero out” the error for missing entries, leaving only the error from $\Omega$.
3. Hard-Impute Algorithm
This algorithm iteratively solves the matrix completion problem. The goal is to find a low-rank matrix $\hat{X}$ that minimizes the observed reconstruction error.
Optimization Problem:
$$ \underset{\hat{X}}{\text{minimize}} \quad \|W * (X - \hat{X})\|_F^2 \quad \text{subject to} \quad \text{rank}(\hat{X}) = r $$Algorithm:
- Input: Matrix $X \in \mathbb{R}^{n \times p}$ (with NAs), rank $r$.
- Initialization: A starting guess for $\hat{X} \in \mathbb{R}^{n \times p}$.
Repeat until convergence:
Step 1 (Impute): Fill in the missing values of $X$ with the current guess from $\hat{X}$.
$$ Y = W * X + (1-W) * \hat{X} $$(This creates a “completed” matrix $Y$ by taking observed values from $X$ and imputed values from $\hat{X}$)
Step 2 (Project): Find the best rank-$r$ approximation of the completed matrix $Y$.
$$ \text{Compute SVD of Y:} \quad Y = U D V^T $$$$ \text{Update } \hat{X}: \quad \hat{X} = U_{(r)} D_{(r)} V_{(r)}^T $$(This new $\hat{X}$ is the best rank-r approximation of $Y$, and it becomes the guess for the next iteration).
- Output: The final converged matrix $\hat{X}$.
Hard-impute steps (in words):
- Impute missing values in $X$ with the elements from the current estimate $\hat{X}$.
- Make the resulting matrix $\hat{X}$ low-rank (by finding its best rank-r SVD approximation).
4. Example of One Iteration
Let’s trace one loop of the algorithm.
$$ X = \begin{pmatrix} \text{NA} & 1 \\ 0 & \text{NA} \\ \text{NA} & 2 \end{pmatrix}, \quad W = \begin{pmatrix} 0 & 1 \\ 1 & 0 \\ 0 & 1 \end{pmatrix}, \quad \text{Current } \hat{X} = \begin{pmatrix} a & b \\ c & d \\ e & f \end{pmatrix} $$Step 1: $Y = W * X + (1-W) * \hat{X}$
- We need $(1-W)$: $$ (1-W) = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 0 \end{pmatrix} $$
- Now compute $Y$:
$$
Y = \underbrace{\begin{pmatrix} 0 & 1 \\ 0 & 0 \\ 0 & 2 \end{pmatrix}}_{W * X} + \underbrace{\begin{pmatrix} a & 0 \\ 0 & d \\ e & 0 \end{pmatrix}}_{(1-W) * \hat{X}} = \begin{pmatrix} a & 1 \\ c & d \\ e & 2 \end{pmatrix}
$$
(Note: The note has a typo,
(a, 0, e)should be(a, c, e). My $Y$ is the correct imputation based on the formula.)
Step 2: $\hat{X} = U_{(r)} D_{(r)} V_{(r)}^T$
- We take the matrix $Y = \begin{pmatrix} a & 1 \\ c & d \\ e & 2 \end{pmatrix}$
- We compute its SVD: $Y = U D V^T$
- We find its best rank-$r$ approximation (e.g., $r=1$): $$ \hat{X}_{\text{new}} = U_{(1)} D_{(1)} V_{(1)}^T = \begin{pmatrix} a' & b' \\ c' & d' \\ e' & f' \end{pmatrix} $$
- This $\hat{X}_{\text{new}}$ is now fed back into Step 1 for the next iteration.
Chapter 3: Non-Linear Methods
PCA is a Linear Method
The reason PCA is considered a linear method is because the new “score” $z_i$ (the coordinate on a principal component $v$) is a linear combination of the original $p$ features ($f_1, \dots, f_p$, or $x_1, \dots, x_p$).
If $x = (f_1, \dots, f_p)^T$ is a single data point and $v = (v_1, \dots, v_p)^T$ is a principal component:
$$ z = v^T x = v_1 f_1 + v_2 f_2 + \dots + v_p f_p $$Pros:
- easy to implement (relies on standard EVD or SVD)
- easy to interpret (the weights $v_j$ in the vector $v$ tell you how much original feature $f_j$ contributes to that component)
Cons:
- has limited flexibility… (it assumes the data lies on a flat $r$-dimensional plane, and fails on “curved” manifolds)
Modifications and Non-linear Methods
Modifications:
sparse PCAkernel PCA(This is the non-linear extension we will look at)
Non-linear methods (different algorithms):
tSNEUMAP
Kernel PCA
This note introduces a non-linear extension to PCA, for cases where linear PCA fails.
- The Problem: Standard PCA finds a linear projection (PC1). On data that is not linearly separable (like two separable clusters or concentric circles), PC1 fails to capture the structure and mixes the classes.
- Top-Left Diagram: Two clusters, PC1 projects them onto the same line.
- Bottom-Left Diagram: Concentric circles, PC1 projects them onto the same line.
Idea: Transform Feature Space
The core idea is to transform the data into a higher-dimensional space where it becomes linearly separable.
$$ f_1, \dots, f_p \longrightarrow \Phi_1(f_1, \dots, f_p), \dots, \Phi_q(f_1, \dots, f_p) \quad (\text{where } q > p) $$$$ \Leftrightarrow \quad x_i \in \mathbb{R}^p \longrightarrow \Phi(x_i) \in \mathbb{R}^q $$…and then run PCA on the transformed data $\Phi(x_1), \dots, \Phi(x_n)$.
Example: The “Magic” of Feature Transformation
This example shows how the transformation works for the concentric circles data.
$$ \text{Example:} \quad (f_1, f_2) \to (f_1, f_2, f_1^2 + f_2^2) $$- Left Diagram: The original 2D data (concentric circles).
- Right Diagram: The new 3D data. The new 3rd dimension ($f_1^2 + f_2^2$) is the squared radius. The inner (blue) circle has a low radius, the outer (red) circle has a high radius. The data is now linearly separable by a 2D plane.
PCA on $\Phi(x_1), \dots, \Phi(x_n)$ will compute scores. For the $i$-th component:
$$ Z_{(i)} = v_{i1}f_1 + v_{i2}f_2 + v_{i3}(f_1^2+f_2^2) $$This is non-linear in the original features $f_1, f_2$.
The Kernel Trick
The Problem: The new feature space $q$ can be computationally expensive (or even infinitely large) to compute.
The Solution: The PCA algorithm can be reformulated to only use an $n \times n$ inner product matrix $K$ (or Gram matrix).
$$ \text{Denote the transformed data by } \Phi = \begin{pmatrix} \text{---} & \Phi(x_1)^T & \text{---} \\ & \vdots & \\ \text{---} & \Phi(x_n)^T & \text{---} \end{pmatrix} \in \mathbb{R}^{n \times q} $$$$ \text{Let } K = \Phi \Phi^T \in \mathbb{R}^{n \times n} \text{ be the inner product matrix.} $$$$ K_{ij} = \langle \Phi(x_i), \Phi(x_j) \rangle $$A kernel function $\mathcal{K}$ is a “shortcut” that computes this inner product cheaply in the original space:
$$ \mathcal{K}(x_i, x_j) = \langle \Phi(x_i), \Phi(x_j) \rangle $$Example: kernel function $\Rightarrow$ implicit feature mapping
Quadratic:
$$ \mathcal{K}(a, b) = (1 + \langle a, b \rangle)^2 = 1 + \sum a_i^2 + \sum b_i^2 + \dots + 2 a_i b_j + \dots $$The simple calculation on the left is equivalent to an inner product in a very high-dimensional space $\Phi$:
$$ \Phi: (f_1, \dots, f_p) \to (1, f_1^2, \dots, f_p^2, \sqrt{2}f_1, \dots, \sqrt{2}f_p, \sqrt{2}f_1 f_2, \dots) $$Polynomial:
$$ \mathcal{K}(x, y) = (\langle x, y \rangle + d)^\gamma $$Radial (RBF):
$$ \mathcal{K}(x, y) = e^{-\gamma ||x-y||^2} $$(This implicitly maps to an infinite-dimensional space).
Kernel PCA: The Algorithm
The core insight of Kernel PCA is: you don’t need $\Phi$ to compute PCA scores, only $K$.
This works by running PCA in the $n$-dimensional “sample space” (using $K$) instead of the $q$-dimensional “feature space” (using $\Phi$).
The Algorithm
Input:
- Data $X \in \mathbb{R}^{n \times p}$
- Kernel function $\mathcal{K}(a, b)$
- Target dimension $r$
Step 1: Compute the Kernel Matrix $K$
- Compute the $n \times n$ matrix $K$ where: $$ K_{ij} = \mathcal{K}(x_i, x_j) $$
Step 2: Center the Kernel Matrix
- (This is the crucial step to ensure the (implicit) data $\Phi$ is centered).
- We must not use $K$, but a centered version $\tilde{K}$.
- Define the $n \times n$ centering matrix $C$: $$ C = I - \frac{1}{n}\mathbf{1}\mathbf{1}^T $$
- Compute the centered Gram matrix $\tilde{K}$: $$ \tilde{K} = C K C $$
Step 3: Compute Eigen-decomposition of $\tilde{K}$
- Find the eigenvectors and eigenvalues of the $n \times n$ matrix $\tilde{K}$. $$ \tilde{K} U = U \Lambda $$
- Keep the top $r$ eigenvectors $u_1, \dots, u_r$ (which form $U_{(r)}$) and their corresponding eigenvalues $\lambda_1, \dots, \lambda_r$ (which form $\Lambda_{(r)}$).
- (Note: The note’s $d_i^2$ corresponds to $\lambda_i$)
Output: PC Scores $Z_{(r)}$
- The new $r$-dimensional scores are the eigenvectors, scaled by the square root of their eigenvalues.
- Let $D_{(r)} = \sqrt{\Lambda_{(r)}}$ (a diagonal matrix of $d_i = \sqrt{\lambda_i}$).
- The $n \times r$ score matrix $Z_{(r)}$ is: $$ Z_{(r)} = U_{(r)} D_{(r)} $$
- (The $i$-th PC score vector is $z_{(i)} = \sqrt{\lambda_i} u_i = d_i u_i$)
Mathematical Justification (Why this works)
This algorithm works by finding an equivalence between the SVD of $\Phi$ and the EVD of $K$.
- Assume $\Phi$ is centered. (Achieved by using $\tilde{K}$).
- Goal: We want the scores $Z = \Phi V$ (from standard PCA: data $\times$ directions).
- SVD of $\Phi$: $\Phi = U D V^T$
- As we saw in “PCA via SVD”, the scores are $Z = U D$.
- EVD of $K$: $$ K = \Phi \Phi^T = (U D V^T)(\Phi)^T $$ $$ = (U D V^T)(V D^T U^T) = U D (V^T V) D^T U^T = U D D^T U^T $$ $$ K = U D^2 U^T $$
- This is the key: $K = U (\Lambda) U^T$, where $\Lambda = D^2$.
- The eigenvectors $U$ of $K$ are the left singular vectors $U$ of $\Phi$.
- The eigenvalues $\Lambda$ of $K$ are the squared singular values $D^2$ of $\Phi$.
- Conclusion:
- By finding the eigenvectors $U$ and eigenvalues $\Lambda$ of $K$ (which we can compute), we have found…
- …the $U$ and $D = \sqrt{\Lambda}$ needed to compute…
- …the scores $Z = U D$.
Local Dimension Reduction: UMAP
UMAP (Uniform Manifold Approximation and Projection) is a powerful, modern non-linear dimension reduction method, often used for visualization. It is not an extension of PCA, but a different approach.
The UMAP Algorithm (Conceptual)
Given high-dimensional data $x_1, \dots, x_n \in \mathbb{R}^p$.
Construct High-Dimensional Graph:
- UMAP constructs a weighted K-neighbour graph in the original $p$-dimensional space.
- This graph represents the “topological structure” (i.e., “who is a neighbor to whom”) of the high-dim data.
- (Uses a technique called “fuzzy simplicial complex”).
Construct Low-Dimensional Graph:
- UMAP initializes a $r$-dimensional embedding (e.g., $r=2$) $z_1, \dots, z_n \in \mathbb{R}^r$.
- It constructs a similar weighted K-neighbour graph for these points in the low-dim space.
Optimize:
- UMAP iteratively adjusts the positions of $z_1, \dots, z_n$ to make the low-dimensional graph as “similar” as possible to the high-dimensional graph.
- (It uses cross-entropy as a cost function to measure this similarity).
Two Main Parameters:
n-neighbors: The number of nearest neighbors to consider. This is the most important parameter.- A small value (e.g., 5) forces UMAP to focus on local structure.
- A large value (e.g., 50) allows UMAP to preserve more global structure.
min-dist: Controls how tightly UMAP packs neighbors together in the final embedding.- This is mostly a cosmetic parameter for the visualization.
Comparing to tSNE, UMAP is faster and better at preserving more global structure.
UMAP vs. PCA
Pros (+) of UMAP:
- Non-linear: Its primary advantage. It can find complex, curved manifolds that PCA (which is linear) would miss.
- Excellent for Visualization: Captures local neighborhood structure very well, revealing clusters.
Cons (-) of UMAP:
- Slower than PCA (but faster than t-SNE).
- Can struggle with noise.
- Less Interpretable (Critical Warning):
- There is no meaning to the UMAP coordinate axes (the X and Y axes are arbitrary).
- The size of a cluster in a UMAP plot is not interpretable.
- The distances between clusters are not interpretable.
- UMAP only preserves local neighbor relationships.
Practical Aspects
Sometimes it’s better to combine PCA & UMAP. This is a common and highly effective workflow:
Filter $\to$ Apply PCA $\to$ Apply UMAP
- Why? Use PCA first as a fast, linear de-noising step (e.g., to reduce 1000 dimensions to 50).
- Then, use UMAP to perform the “hard” non-linear reduction (e.g., from 50 dimensions to 2).
- This is often faster and produces cleaner results than using UMAP on the full, noisy, high-dimensional data.
Canonical Correlation Analysis (CCA)
Goal: Study the association between two sets of variables (two data matrices), $X$ and $Y$.
- $X = \begin{pmatrix} x_1^T \\ \vdots \\ x_n^T \end{pmatrix} \in \mathbb{R}^{n \times p}$ ($n$ samples, $p$ features)
- $Y = \begin{pmatrix} y_1^T \\ \vdots \\ y_n^T \end{pmatrix} \in \mathbb{R}^{n \times q}$ ($n$ samples, $q$ features)
The Problem: From 1D to Multi-dimensional
If $p=q=1$ (i.e., both $X$ and $Y$ are single vectors):
- Association is measured via the Sample Correlation $r_{xy}$.
- $y = ax + b$
- $a > 0 \implies r_{xy} = 1$
- $a < 0 \implies r_{xy} = -1$
- Unknown if linear relationship exists or not
- $r_{xy} = 0 \implies$ no association
If $p > 1$ and $q > 1$ (the multivariate case):
- We could compute a $p \times q$ cross-correlation matrix $R_{xy}$.
- Let $x = (f_1, \dots, f_p)$ and $y = (g_1, \dots, g_q)$ be the sets of features.
- The matrix $R_{xy}$ has entries $(R_{xy})_{ij} = r_{f_i, g_j}$.
- This matrix is large and difficult to interpret.
The motivating question for CCA:
$$ \text{How to aggregate information in } R_{xy}? $$(CCA aims to find a few “canonical correlations” that summarize the main relationships, instead of looking at all $p \times q$ individual correlations.)
Canonical Correlation Analysis (CCA)
CCA 1: The Problem Definition
Assume that $X \in \mathbb{R}^{n \times p}$ and $Y \in \mathbb{R}^{n \times q}$ are column-centered.
The goal of the first canonical correlation (CCA 1) is to: Find $u \in \mathbb{R}^p$ and $v \in \mathbb{R}^q$ such that $a = Xu$ and $b = Yv$ are maximally correlated.
We aim to:
$$ \underset{u \in \mathbb{R}^p, v \in \mathbb{R}^q}{\text{maximize}} \quad r_{ab} $$- $a = Xu = \begin{pmatrix} a_1 \\ \vdots \\ a_n \end{pmatrix}$ summarizes $X$ into a 1D score.
- $b = Yv = \begin{pmatrix} b_1 \\ \vdots \\ b_n \end{pmatrix}$ summarizes $Y$ into a 1D score.
- $r_{ab}$ summarizes the association between $X$ and $Y$.
CCA Solution: Transformation of the Problem
The correlation $r_{ab}$ can be written using the covariance matrices of $X$ ($S_x$), $Y$ ($S_y$), and the cross-covariance $S_{xy}$:
$$ r_{ab} = \frac{S_{ab}}{\sqrt{S_a^2 S_b^2}} = \frac{u^T S_{xy} v}{\sqrt{(u^T S_x u) (v^T S_y v)}} $$- $S_{ab} = u^T S_{xy} v$
- $S_a^2 = u^T S_x u$
- $S_b^2 = v^T S_y v$
The correlation $r_{ab}$ is invariant to the scale of $u$ and $v$. (e.g., if $\tilde{u} = \alpha u$ and $\tilde{v} = \beta v$, $r_{\tilde{a}\tilde{b}} = r_{ab}$).
This allows us to restate the CCA problem as a constrained optimization problem. We can fix the denominator to 1 and maximize the numerator:
$$ \underset{u \in \mathbb{R}^p, v \in \mathbb{R}^q}{\text{maximize}} \quad u^T S_{xy} v $$$$ \text{subject to} \quad u^T S_x u = 1 \quad \text{and} \quad v^T S_y v = 1 $$This is still hard to solve. We “whiten” the variables by setting:
$$ \tilde{u} = S_x^{1/2} u \quad \text{and} \quad \tilde{v} = S_y^{1/2} v $$This implies:
$$ u = S_x^{-1/2} \tilde{u} \quad \text{and} \quad v = S_y^{-1/2} \tilde{v} $$The constraints now become:
$$ u^T S_x u = (S_x^{-1/2} \tilde{u})^T S_x (S_x^{-1/2} \tilde{u}) = \tilde{u}^T S_x^{-1/2} S_x S_x^{-1/2} \tilde{u} = \tilde{u}^T I \tilde{u} = \|\tilde{u}\|^2 = 1 $$$$ v^T S_y v = \dots = \|\tilde{v}\|^2 = 1 $$The objective function becomes:
$$ u^T S_{xy} v = (S_x^{-1/2} \tilde{u})^T S_{xy} (S_y^{-1/2} \tilde{v}) = \tilde{u}^T \left( S_x^{-1/2} S_{xy} S_y^{-1/2} \right) \tilde{v} $$Let’s define $\tilde{S}_{xy} = S_x^{-1/2} S_{xy} S_y^{-1/2}$.
The equivalent problem is:
$$ \underset{\tilde{u}, \tilde{v}}{\text{maximize}} \quad \tilde{u}^T \tilde{S}_{xy} \tilde{v} $$$$ \text{subject to} \quad \|\tilde{u}\|^2 = 1 \quad \text{and} \quad \|\tilde{v}\|^2 = 1 $$Detour: SVD via Optimization (The Tool)
This note proves how to solve this exact form of problem. Given a matrix $A \in \mathbb{R}^{n \times p}$ with SVD $A = U D V^T$. The solution to the problem:
$$ \underset{u \in \mathbb{R}^n, v \in \mathbb{R}^p}{\text{maximize}} \quad u^T A v \quad \text{subject to} \quad \|u\| = 1, \|v\| = 1 $$is:
- $u = u_1$ (the top left singular vector of $A$)
- $v = v_1$ (the top right singular vector of $A$)
- The optimal $u^T A v = d_1$ (the top singular value of $A$).
(The note proves this using Lagrange multipliers, showing $v$ is an eigenvector of $A^T A$ (i.e., a right singular vector) and $u$ is an eigenvector of $A A^T$ (i.e., a left singular vector), and the max value is $d_1$ when $i=j=1$.)
Final Solution
We now apply this “tool” to our “transformed CCA problem”.
- Problem: maximize @@\tilde{u}^T \tilde{S}_{xy} \tilde{v}@@
- Tool: The solution is the top singular vectors of the matrix $\tilde{S}_{xy}$.
Therefore:
- Optimal $\tilde{u} = \tilde{u}_1$ and Optimal $\tilde{v} = \tilde{v}_1$ are the top left and right singular vectors of $\tilde{S}_{xy} = S_x^{-1/2} S_{xy} S_y^{-1/2}$.
These are the solutions for the transformed problem. To get the final $u$ and $v$ for our original problem, we “un-whiten” them:
- Optimal $u = S_x^{-1/2} \tilde{u}_1$
- Optimal $v = S_y^{-1/2} \tilde{v}_1$
CCA 2: Finding the Second Canonical Correlation
The Problem: We want to find a new pair of vectors $u \in \mathbb{R}^p$ and $v \in \mathbb{R}^q$ such that their scores $a = Xu$ and $b = Yv$ are maximally correlated, but with new constraints:
- $a$ is uncorrelated with $a_1 = X u_1$
- $b$ is uncorrelated with $b_1 = Y v_1$
This is equivalent to the optimization problem:
$$ \underset{u \in \mathbb{R}^p, v \in \mathbb{R}^q}{\text{maximize}} \quad r_{ab} \quad \text{subject to} \quad r_{a a_1} = 0 \text{ and } r_{b b_1} = 0 $$CCAr: Finding the $r$-th Canonical Correlation
We can generalize this to find the $r$-th pair of vectors, $(u_r, v_r)$.
The Optimization Problem: Find $u$ and $v$ that…
$$ \underset{u \in \mathbb{R}^p, v \in \mathbb{R}^q}{\text{maximize}} \quad u^T S_{xy} v $$…subject to the standard constraints:
$$ u^T S_x u = 1 \quad \text{and} \quad v^T S_y v = 1 $$…and subject to the new orthogonality constraints:
$$ u^T S_x u_i = 0 \quad \text{for } i = 1, \dots, r-1 $$$$ v^T S_y v_i = 0 \quad \text{for } i = 1, \dots, r-1 $$The Solution via SVD
This problem seems complex, but the “whitening” transformation makes it simple.
1. Restate the problem (for CCA2):
The problem Maximize @@u^T S_{xy} v@@ with constraints @@u^T S_x u = v^T S_y v = 1@@ and @@u^T S_x u_1 = v^T S_y v_1 = 0@@…
2. Transform the problem:
This problem can be shown to be equivalent to:
$$ \underset{\tilde{u}, \tilde{v}}{\text{maximize}} \quad \tilde{u}^T \tilde{S}_{xy} \tilde{v} $$Subject to
@@ \quad \|\tilde{u}\|^2 = 1, \|\tilde{v}\|^2 = 1 \quad \text{and} \quad \tilde{u}^T \tilde{u}_1 = 0, \tilde{v}^T \tilde{v}_1 = 0 @@
where @@\tilde{S}_{xy} = S_x^{-1/2} S_{xy} S_y^{-1/2}@@, @@\tilde{u} = S_x^{1/2} u@@, and @@\tilde{u}_1 = S_x^{1/2} u_1@@.
3. The SVD Magic:
We know the solution to maximize @@ \tilde{u}^T \tilde{S}_{xy} \tilde{v}@@ is the top singular vectors (@@\tilde{u}_1@@, @@\tilde{v}_1@@).
The SVD naturally produces an orthogonal basis. The second singular vectors (@@\tilde{u}_2@@, @@\tilde{v}_2@@) are guaranteed to be orthogonal to the first (i.e., @@\tilde{u}_2^T \tilde{u}_1 = 0@@ and @@\tilde{v}_2^T \tilde{v}_1 = 0@@).
Therefore, the second largest singular vectors (@@\tilde{u}_2@@, @@\tilde{v}_2@@) are the optimal solution that automatically satisfies the new constraints.
4. General Solution:
Thus @@ \tilde{u} = \tilde{u}_r, \tilde{v} = \tilde{v}_r @@ are the $r$-th largest left and right singular vectors of @@\tilde{S}_{xy}@@
And the optimal @@u_r@@ and @@v_r@@ are found by “un-whitening”:
$$ u_r = S_x^{-1/2} \tilde{u}_r, \quad v_r = S_y^{-1/2} \tilde{v}_r $$
CCA Computations (The Algorithm for $CCA_1 \dots CCA_r$)
- Input: data $X \in \mathbb{R}^{n \times p}$ and $Y \in \mathbb{R}^{n \times q}$
- Step 1: Compute $S_x, S_y, S_{xy}$ (from centered data).
- Step 2: Compute the “whitened” cross-covariance matrix: $$ \tilde{S}_{xy} = S_x^{-1/2} S_{xy} S_y^{-1/2} $$
- Step 3: Find the SVD of $\tilde{S}_{xy}$ to get the top $r$ singular vectors: $$ \tilde{S}_{xy} \to (\tilde{u}_1 \dots \tilde{u}_r) \text{ and } (\tilde{v}_1 \dots \tilde{v}_r) $$
- Step 4: Transform back (un-whiten) to get the canonical directions: $$ u_i = S_x^{-1/2} \tilde{u}_i, \quad v_i = S_y^{-1/2} \tilde{v}_i \quad \text{for } i=1 \dots r $$
- Output: The pairs $(u_1, v_1), \dots, (u_r, v_r)$
Terminology
- $(u_i, v_i)$ is called the $i$-th pair of canonical directions.
- $(a_i = X u_i, b_i = Y v_i)$ are called the canonical score vectors.
- $r_{a_i b_i} = u_i^T S_{xy} v_i$ is called the $i$-th canonical correlation. (This value is simply $d_i$, the $i$-th singular value of $\tilde{S}_{xy}$).