OVERVIEW:The PCA Workflow: From $X$ to $Z$

Principal component analysis (PCA) assumes that manifold = r-dimensional affine subspace (plane), where an r-dimensional manifold is defined by a point a and an orthogonal basis v₁, … vᵣ.

This outlines the complete process of Principal Component Analysis, from the raw data $X$ to the new, low-dimensional data $Z$.


1. Preparation Stage

Goal: To calculate the sample covariance matrix $S$.

Variables Used:

  • $X$: The $n \times p$ raw data matrix (n samples, p features).
  • $n$: Number of samples.
  • $p$: Number of original features (dimensions).
  • $\bar{x}$: The $p \times 1$ sample mean vector.
  • $S$: The $p \times p$ sample covariance matrix.

Formulas & Actions:

Step 1.1: Compute the Mean Vector $\bar{x}$ First, we compute the mean of each feature:

$$ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i $$

Step 1.2: Center the Data This is a critical step. We subtract the mean vector from every sample.

$$ X_{\text{centered}} \leftarrow X - \mathbf{1} \bar{x}^T $$

(where $\mathbf{1}$ is an $n \times 1$ vector of ones. From now on, we assume $X$ refers to this centered matrix). After this step, the mean of $X$ is the zero vector.

Step 1.3: Compute the Covariance Matrix $S$ Because $X$ is now centered ($\bar{x}=0$), we can use the simplified formula:

$$ S = \frac{1}{n} X^T X $$

(Note: Some implementations use $1/(n-1)$ for an unbiased estimate, but $1/n$ is the MLE and is common for PCA).


2. Core Calculation Stage

Goal: To find the principal components (directions of maximum variance).

Variables Used:

  • $S$: The $p \times p$ covariance matrix from Stage 1.
  • $v$: A $p \times 1$ vector, representing an eigenvector (a principal component).
  • $\lambda$: A scalar, representing an eigenvalue (the variance along that component).

Formulas & Actions:

Step 2.1: Find Eigenvectors and Eigenvalues We solve the “eigen-decomposition” equation for the covariance matrix $S$:

$$ S v = \lambda v $$
  • This is not a simple calculation; it’s a matrix decomposition.
  • The solution will yield $p$ eigenvectors ($v_1, \dots, v_p$) and $p$ corresponding eigenvalues ($\lambda_1, \dots, \lambda_p$).

Step 2.2: Sort the Components This is the key to PCA. We sort the eigenvectors $v$ in descending order based on their corresponding eigenvalues $\lambda$.

$$ (\lambda_1, v_1), (\lambda_2, v_2), \dots, (\lambda_p, v_p) $$

…such that $\lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_p$.

  • $v_1$ is now the First Principal Component (PC1). It is the direction of maximum variance ($\lambda_1$).
  • $v_2$ is the Second Principal Component (PC2), and so on.

3. Final Result Stage (Projection)

Goal: To create the new, low-dimensional data $Z$ by projecting $X$ onto the chosen components.

Variables Used:

  • $X$: The $n \times p$ centered data from Stage 1.
  • $v_1, \dots, v_p$: The sorted principal components from Stage 2.
  • $r$: The new number of dimensions we want (where $r < p$).
  • $W$: The $p \times r$ projection matrix.
  • $Z$: The $n \times r$ final, new data (our desired result).

Formulas & Actions:

Step 3.1: Choose the New Dimensionality $r$

  • How to choose? We look at the eigenvalues. The “Proportion of Variance Explained” by the first $r$ components is: $$ \text{PVE}_r = \frac{\lambda_1 + \lambda_2 + \dots + \lambda_r}{\lambda_1 + \lambda_2 + \dots + \lambda_p} $$
  • We typically choose $r$ such that $\text{PVE}_r \ge 95\%$ (or some other high threshold).

Step 3.2: Construct the Projection Matrix $W$ We create $W$ by stacking the first $r$ principal components (our sorted $v$ vectors) side-by-side as columns:

$$ W = \begin{bmatrix} | & | & & | \\ v_1 & v_2 & \dots & v_r \\ | & | & & | \end{bmatrix} $$

Step 3.3: Project $X$ to get $Z$ This is the final transformation. We multiply our $n \times p$ data $X$ by our $p \times r$ projection matrix $W$:

$$ Z = X W $$
  • Result: $Z$ is our new $n \times r$ data matrix.
    • The first column of $Z$ is the set of 1D coordinates for all $n$ samples on PC1.
    • The second column of $Z$ is the set of 1D coordinates for all $n$ samples on PC2, and so on.

Detailed Theories


PCA: Maximum Variance View

Assume that the $n \times p$ data matrix $X$ is column-centered:

$$ X = \begin{pmatrix} x_1^T \\ \vdots \\ x_n^T \end{pmatrix} \in \mathbb{R}^{n \times p} $$

This means the sample mean $\bar{x}$ is the zero vector:

$$ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i = 0 $$

With this assumption, the sample covariance matrix $S$ (using $1/n$ as the denominator) simplifies to:

$$ S = \frac{1}{n} X^T X $$

Note: Why does $S = \frac{1}{n} X^T X$?

This formula is a simplified special case. It holds only if the data matrix $X$ has been column-centered, meaning the sample mean $\bar{x}$ is the zero vector ($\bar{x}=0$).

The derivation is as follows:

  1. The general definition of the sample covariance matrix (with a $1/n$ denominator) is:

    $$ S = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})(x_i - \bar{x})^T $$
  2. We substitute our assumption that $\bar{x} = 0$:

    $$ S = \frac{1}{n} \sum_{i=1}^n (x_i - 0)(x_i - 0)^T = \frac{1}{n} \sum_{i=1}^n x_i x_i^T $$
  3. The “sum of outer products” $\sum_{i=1}^n x_i x_i^T$ is precisely the definition of the matrix multiplication $X^T X$, where $X^T$ is the $p \times n$ matrix whose columns are $x_i$:

    $$ X^T X = \begin{pmatrix} | & & | \\ x_1 & \dots & x_n \\ | & & | \end{pmatrix} \begin{pmatrix} \text{---} & x_1^T & \text{---} \\ & \vdots & \\ \text{---} & x_n^T & \text{---} \end{pmatrix} = \sum_{i=1}^n x_i x_i^T $$
  4. Therefore, the formula simplifies to:

    $$ S = \frac{1}{n} X^T X $$

Projection from p-dimensions to 1-dimension

We want to find a 1-dimensional subspace (a line through the origin), defined by a unit direction vector $v$, onto which we project the data.

  • $x_i$ is the original data point (a $p \times 1$ vector).
  • $\hat{x}_i$ is the projection of $x_i$ onto the line defined by $v$.
  • $z_i$ is the new coordinate of that projected point on the 1-D line (a scalar).

The relationship is:

$$ \hat{x}_i = z_i \cdot v $$

The new coordinate $z_i$ is computed by the dot product of $x_i$ and $v$:

$$ z_i = \langle x_i, v \rangle = x_i^T v $$

Projection Operator

The $p \times p$ projection matrix $P$ is $P = vv^T$. The projection $\hat{x}_i$ is:

$$ \hat{x}_i = P x_i = (v v^T) x_i = v (v^T x_i) = v \langle x_i, v \rangle = z_i v $$

(This is consistent with the note’s $z_i \cdot v$)


The New 1D Coordinate Vector $z$

We can stack all $n$ new coordinates into an $n \times 1$ vector $z$:

$$ z = \begin{pmatrix} z_1 \\ \vdots \\ z_n \end{pmatrix} $$

This can be computed efficiently as a matrix-vector product:

$$ z = X v $$

Expanding this, we see:

$$ z = \begin{pmatrix} x_1^T \\ \vdots \\ x_n^T \end{pmatrix} v = \begin{pmatrix} x_1^T v \\ \vdots \\ x_n^T v \end{pmatrix} = \begin{pmatrix} \langle x_1, v \rangle \\ \vdots \\ \langle x_n, v \rangle \end{pmatrix} $$

PC1: Finding the “best” $v$

A “good $v$ will maximize the spread of the projections”. “Spread” is statistically defined as variance.

Goal: Find $v$ that maximizes the variance of $z$, denoted $S_z^2$.


Deriving the New Variance $S_z^2$

Let’s compute the mean $\bar{z}$ and variance $S_z^2$ of the new $z$ vector.

1. Mean $\bar{z}$: Since $X$ is centered ($\bar{x} = 0$), the mean of $z$ will also be 0:

$$ \bar{z} = \frac{1}{n} \mathbf{1}^T z = \frac{1}{n} \mathbf{1}^T (Xv) $$

$$ \bar{z} = \left( \frac{1}{n} \mathbf{1}^T X \right) v = \bar{x}^T v = 0^T v = 0 $$

2. Variance $S_z^2$: Since $\bar{z}=0$, the variance $S_z^2$ is the mean of the squared values:

$$ S_z^2 = \frac{1}{n} z^T z $$

Substitute $z = Xv$:

$$ S_z^2 = \frac{1}{n} (Xv)^T (Xv) = \frac{1}{n} (v^T X^T) (Xv) = \frac{1}{n} v^T X^T X v $$

Regroup the terms:

$$ S_z^2 = v^T \left( \frac{1}{n} X^T X \right) v $$

We know $S = \frac{1}{n} X^T X$ (the original covariance matrix), so:

$$ S_z^2 = v^T S v $$

This key formula connects the projected 1D variance ($S_z^2$) to the original $p \times p$ covariance matrix $S$ and the direction $v$.


The Final Optimization Problem

The first principal component (PC1) is the vector $v$ that solves this optimization problem:

Maximize ( $v \in \mathbb{R}^p$ ):

$$ v^T S v $$

Subject to:

$$ \|v\| = 1 $$

(The constraint $\|v\|=1$ is required because we only care about the direction of $v$. Without it, we could make $v^T S v$ arbitrarily large by simply increasing the length of $v$.)

个人注解:选择 @@||v|| = 1@@ 只是为了方便和标准化,它能让我们固定长度后去找到那个唯一的“单位方向向量”作为主成分。如果不通过固定长度去控制变量,则可能遇到问题无解的情况。


Detour: Method of Lagrange Multipliers

Consider $x \in \mathbb{R}^p$ and two functions $f: \mathbb{R}^p \to \mathbb{R}$ and $g: \mathbb{R}^p \to \mathbb{R}$.

We want to maximize $f(x)$ subject to $g(x) = 0$.

  1. Form Lagrangian $\mathcal{L}(x, \lambda)$:

    $$ \mathcal{L}(x, \lambda) = f(x) + \lambda g(x) $$

    where $\lambda$ is the Lagrange multiplier.

  2. Find critical points of $\mathcal{L}$: We solve for $\nabla \mathcal{L}(x, \lambda) = 0$.

  3. Consider all critical points $(x^*, \lambda^*)$ of $\mathcal{L}$. Find the $x^*$ that results in the largest $f(x^*)$.


For multiple constraints

Maximize $f(x)$ subject to:

$$ \begin{cases} g_1(x) = 0 \\ \vdots \\ g_k(x) = 0 \end{cases} $$
  1. Form Lagrangian:

    $$ \mathcal{L}(x, \lambda_1, \dots, \lambda_k) = f(x) + \lambda_1 g_1(x) + \dots + \lambda_k g_k(x) $$
  2. Solve $\nabla \mathcal{L}(x, \lambda_1, \dots, \lambda_k) = 0$ and find all critical points $x^*, \lambda_1^*, \dots, \lambda_k^*$.

  3. Use the $x^*$ with the largest value of $f(x^*)$.


PC1: Solution via Lagrange Multipliers

The conclusion from the optimization is: PC1 (which is $v$) is the top eigenvector of $S$. The corresponding variance $S_z^2$ (which is $v^T S v$) is $\lambda$, the top eigenvalue of $S$.


Derivation

We start with the final optimization problem:

$$ \text{maximize } v^T S v \quad \text{subject to } \|v\|^2 = 1 $$

(where $v \in \mathbb{R}^p$)

We use the Method of Lagrange Multipliers. Let $f(v) = v^T S v$ and $g(v) = \|v\|^2 - 1 = v^T v - 1 = 0$.

1. Form the Lagrangian:

$$ \mathcal{L}(v, \lambda) = f(v) - \lambda g(v) $$

$$ \mathcal{L}(v, \lambda) = v^T S v - \lambda(v^T v - 1) $$

2. Find the gradient w.r.t. $v$ and set to zero:

$$ \nabla_v \mathcal{L}(v, \lambda) = 0 $$

Using matrix calculus rules ($\nabla_v(v^T A v) = 2Av$ for symmetric $A$, and $\nabla_v(v^T v) = 2v$):

$$ \nabla_v \mathcal{L} = 2Sv - 2\lambda v $$

Setting the gradient to zero:

$$ 2Sv - 2\lambda v = 0 \quad \implies \quad Sv = \lambda v $$

3. Interpret the result: The equation $Sv = \lambda v$ is the definition of an eigenvector. This means the optimal vector $v$ must be an eigenvector of the covariance matrix $S$.

4. Find which eigenvector is optimal: Our original goal was to maximize $f(v) = v^T S v$. We can now substitute the result $Sv = \lambda v$ into this objective function:

$$ \text{var}(z) = v^T S v = v^T (Sv) $$

$$ = v^T (\lambda v) = \lambda (v^T v) $$

From our constraint, we know that $\|v\|^2 = v^T v = 1$.

$$ \text{var}(z) = \lambda \cdot 1 = \lambda $$

Conclusion: To maximize the variance ($v^T S v$), we must maximize the eigenvalue ($\lambda$).

Therefore, the first principal component (PC1) is the eigenvector $v$ that corresponds to the maximum eigenvalue of $S$.


Terminology

  • $v_1 \in \mathbb{R}^p$ is the PC direction

  • $z_{(1)} = \begin{pmatrix} z_1 \\ \vdots \\ z_n \end{pmatrix} = X v_1 \in \mathbb{R}^n$ is the Score Vector

  • $\lambda_1 = v_1^T S v_1 \ge 0$ is the variance explained

  • $\hat{X}_1 = \begin{pmatrix} \hat{x}_1^T \\ \vdots \\ \hat{x}_n^T \end{pmatrix} = X v_1 v_1^T \in \mathbb{R}^{n \times p}$ are the projections onto the PC1 space


PC2 (Second Principal Component)

Goal: Find new direction $v$ with $\|v\|=1$ that maximizes $S_z^2$ and is perpendicular to $v_1$.

The optimization problem is:

maximize $v^T S v$ subject to $\|v\| = 1$ and $v \perp v_1$


PC2: Finding the Second Principal Component

From PC1, we know the goal is to find a direction $v$ that maximizes the projected variance $S_z^2 = v^T S v$. For PC2, we add a new constraint.

The Optimization Problem (PC2):

Find the new direction $v$ with $\|v\|=1$ that maximizes $S_z^2$ and is perpendicular to $v_1$.

maximize $v^T S v$ subject to $\|v\| = 1$ and $v \perp v_1$ (Note: $v \perp v_1$ can be written as $v^T v_1 = 0$)


Solution via Lagrange Multipliers

The conclusion is: PC2 ($v_2$) is the second largest eigenvector of $S$. The corresponding variance $s_z^2 = \lambda_2$ is the second largest eigenvalue of $S$.


Derivation

We have one objective $f(v) = v^T S v$ and two constraints:

  1. $g_1(v) = \|v\|^2 - 1 = v^T v - 1 = 0$
  2. $g_2(v) = v^T v_1 = 0$

1. Form the Lagrangian: We introduce two multipliers, $\lambda$ and $\mu$.

$$ \mathcal{L}(v, \lambda, \mu) = v^T S v - \lambda(v^T v - 1) - \mu(v^T v_1) $$

2. Find the gradient w.r.t. $v$ and set to zero:

$$ \nabla_v \mathcal{L} = 2Sv - 2\lambda v - \mu v_1 = 0 $$

3. Solve for the multipliers: (The note shows this derivation): Left-multiply the gradient equation by $v_1^T$:

$$ v_1^T (2Sv - 2\lambda v - \mu v_1) = v_1^T 0 = 0 $$

$$ 2 v_1^T S v - 2\lambda (v_1^T v) - \mu (v_1^T v_1) = 0 $$

Using the constraints ($v_1^T v = 0$, $v_1^T v_1 = 1$) and the fact that $v_1$ is an eigenvector ($v_1^T S = \lambda_1 v_1^T$), this simplifies to $\mu = 0$.

4. Interpret the result: Since $\mu=0$, the gradient equation becomes:

$$ 2Sv - 2\lambda v = 0 \quad \implies \quad Sv = \lambda v $$

This means PC2, like PC1, must be an eigenvector of S.

5. Find which eigenvector: Our goal is to maximize $\text{var}(z) = v^T S v = \lambda$. We are constrained by $v^T v_1 = 0$.

A property of symmetric matrices (like $S$) is that all their eigenvectors are mutually orthogonal. This means $v_2, v_3, \dots, v_p$ all automatically satisfy the constraint $v^T v_1 = 0$.

Our problem is now: “From the set of allowed eigenvectors $\{v_2, \dots, v_p\}$, which one has the largest eigenvalue $\lambda$?” Since we sorted them by $\lambda_1 \ge \lambda_2 \ge \dots$, the answer is $v_2$.

Conclusion:

$$ v \text{ corresponds to the second largest e.value} \implies v = v_2 $$

The second PC has direction $v_2$, score $z_{(2)} = X v_2$, and explains $\lambda_2 = v_2^T S v_2$ variance of the data.


PCr: The $r$-th Principal Component

The logic for finding PC2 can be generalized to find any $r$-th component.

The Optimization Problem (PCr): Find the new direction $v$ that maximizes variance, given that $v$ is orthogonal to all previously found components.

maximize $v^T S v$ subject to $\|v\| = 1$ and $v \perp v_1, \dots, v \perp v_{r-1}$

The Solution: The solution $v$ is $v_r$, the $r$-th largest eigenvector of $S$. The corresponding variance $S_z^2$ is $\lambda_r$, the $r$-th largest eigenvalue of $S$.


The $(PC_1, \dots, PC_r)$-space

Once we have computed the first $r$ principal components, we can define the $r$-dimensional PC-space.

  • Directions $V_{(r)}$: This is the projection matrix, formed by stacking the first $r$ eigenvectors (PC directions) as columns:

    $$ V_{(r)} = \begin{bmatrix} | & | & & | \\ v_1 & v_2 & \dots & v_r \\ | & | & & | \end{bmatrix} \in \mathbb{R}^{p \times r} $$
  • Scores $Z_{(r)}$: This is the final, new, low-dimensional data. It is computed by projecting the centered data $X$ onto the new $r$-dimensional space.

    $$ Z_{(r)} = \begin{pmatrix} z_{(1)} & z_{(2)} & \dots & z_{(r)} \end{pmatrix} = X \cdot V_{(r)} \in \mathbb{R}^{n \times r} $$
    • This $n \times r$ matrix is our new dataset. Each row is a sample, and each column is a new (PC) feature.
  • Projections $\hat{X}_{(r)}$: This is the reconstructed data, or the “shadow” of the original data in the $p$-dimensional space.

    $$ \hat{X}_{(r)} = \begin{pmatrix} \hat{x}_1^T \\ \vdots \\ \hat{x}_n^T \end{pmatrix} = X V_{(r)} V_{(r)}^T \in \mathbb{R}^{n \times p} $$
    • This $n \times p$ matrix represents how much of the original data is “explained” by the first $r$ components.

Example: The $(PC_1, PC_2)$-space ($r=2$)

This is the most common case, used for 2D visualization.

  • Directions: $V_{(2)} = (v_1, v_2) \in \mathbb{R}^{p \times 2}$
  • Scores: $Z_{(2)} = (z_{(1)}, z_{(2)}) = X V_{(2)} \in \mathbb{R}^{n \times 2}$
  • Projections: $\hat{X}_{(2)} = X V_{(2)} V_{(2)}^T \in \mathbb{R}^{n \times p}$

Practical aspects of PCA

1. PCA is sensitive to scaling

PCA is a variance-maximizing algorithm, so it is highly sensitive to the scale of the original features.

  • Problem: If features have different units (e.g., meters vs. kilograms), their variances will be vastly different. PCA will incorrectly assign more importance to features with larger numerical variance.

  • Example:

    • weight (kg) vs Height (m): Variances might be comparable.
    • weight (kg) vs Height (cm): Changing meters to centimeters (e.g., 1.8 to 180) squares the scale, inflating the variance of height by $100^2 = 10,000$. PCA will now be dominated by the height feature.
  • Solution: When necessary, standardize data columns.

    • Before running PCA, center each feature (column $f_i$) and divide by its standard deviation $s_i$.
    • $$ X = (f_1 \dots f_p) \quad \longrightarrow \quad X_{\text{standardized}} = \left( \frac{f_1}{s_1} \dots \frac{f_p}{s_p} \right) $$
    • This ensures all features have a variance of 1 and are treated “fairly”.

2. How to pick $r$?

The choice of $r$ (the number of components to keep) depends on your goal.

  • For visualization:

    • Pick $r=2$ (for a 2D scatter plot) or $r=3$ (for a 3D scatter plot).
  • For data compression / de-noising:

    • Compute the Proportion of Variance Explained (PVE).

Total variance in the data is the sum of all eigenvalues:

$$ \text{Total Variance} = \lambda_1 + \lambda_2 + \dots + \lambda_p $$

Variance Explained (VE) by each PC is its eigenvalue:

  • VE by $PC_1 = \lambda_1$
  • VE by $PC_p = \lambda_p$

Proportion of Variance Explained (PVE) by each PC is:

  • PVE by $PC_1 = \frac{\lambda_1}{\lambda_1 + \dots + \lambda_p}$
  • PVE by $PC_p = \frac{\lambda_p}{\lambda_1 + \dots + \lambda_p}$

To pick $r$, we compute the cumulative PVE and choose the smallest $r$ such that:

$$ \frac{\lambda_1 + \dots + \lambda_r}{\lambda_1 + \dots + \lambda_p} \ge \text{Threshold (e.g., 0.95 or 0.99)} $$

Chapter 2: Probabilistic View at PCA


Probabilistic View at PCA

Here, we consider $X$ as a $p \times 1$ random vector, not a data matrix.

$$ X = \begin{pmatrix} x_1 \\ \vdots \\ x_p \end{pmatrix} \text{is a random vector} $$

(We assume $\mathbb{E}[X] = 0$, so $var(X) = \mathbb{E}[XX^T] = \Sigma$)


PC1

Find direction $v \in \mathbb{R}^p$ with $\|v\|=1$ such that $z = X^T v$ has maximum variance.

This is the optimization problem:

$$ \underset{v \in \mathbb{R}^p}{\text{maximize}} \quad \text{var}(X^T v) \quad \text{subject to} \quad \|v\|=1 $$

By the properties of variance of random vectors:

$$ \text{var}(z) = \text{var}(X^T v) = \text{var}(v^T X) = v^T \text{var}(X) v = v^T \Sigma v $$

The problem becomes:

$$ \text{maximize} \quad v^T \Sigma v \quad \text{subject to} \quad \|v\|=1 $$

The solution $v$ is the top eigenvector of $\Sigma$, i.e., $v = v_1$. The variance is:

$$ \text{var}(z) = v_1^T \Sigma v_1 = \lambda_1 $$

(The note uses $d_1$ for the eigenvalue $\lambda_1$)


PCr

Find direction $v \in \mathbb{R}^p$ with $\|v\|=1$ such that $z = X^T v$ has maximum variance… …and $v \perp v_1, \dots, v \perp v_{r-1}$ (Note: The image contains a typo $v_r$, which should be $v_{r-1}$).

The solution is the $r$-th largest eigenvector of $\Sigma$.


PCA: Minimum Reconstruction Error View

This is an alternative, but equivalent, formulation of PCA.

Goal: Find an $r$-dimensional plane that approximates “well” the data points $x_1, \dots, x_n$. “Well” means minimizing the total squared distance from the points to the plane.


1. Defining the $r$-dimensional Plane

To define this plane, we need:

  • A point $a \in \mathbb{R}^p$ (which acts as the “center” or “anchor” of the plane).
  • An orthogonal basis $v_1, \dots, v_r \in \mathbb{R}^p$ (the $r$ directions that define the plane).

Any point $\hat{x}$ on this plane can be represented as:

$$ \hat{x} = a + z_1 v_1 + \dots + z_r v_r $$

where $z_1, \dots, z_r \in \mathbb{R}$ are the coordinates of $\hat{x}$ within the plane.


2. The Reconstruction Error

In particular, we want to find the “shadow” or projection $\hat{x}_i$ for each original data point $x_i$. This $\hat{x}_i$ is the reconstruction of $x_i$ using only $r$ dimensions.

The formula for the $i$-th reconstruction is:

$$ \hat{x}_i = a + z_{i1} v_1 + \dots + z_{ir} v_r $$

We can write this more compactly using matrix notation:

  • Let $V = (v_1, \dots, v_r) \in \mathbb{R}^{p \times r}$ be the matrix of basis vectors.
  • Let the constraint be that $V$ is orthogonal: $V^T V = I$.
  • Let $\beta_i = \begin{pmatrix} z_{i1} \\ \vdots \\ z_{ir} \end{pmatrix} \in \mathbb{R}^r$ be the new $r$-dimensional score vector for the $i$-th point.

The reconstruction formula becomes:

$$ \hat{x}_i = a + V \beta_i $$

3. The Optimization Problem

The goal of $(PC_1, \dots, PC_r)$ is to minimize the reconstruction error.

The Reconstruction Error (RE) is the total sum of squared distances between each original point $x_i$ and its reconstruction $\hat{x}_i$:

$$ RE(a, V, \beta_1, \dots, \beta_n) = \sum_{i=1}^n \|x_i - \hat{x}_i\|^2 $$

Substituting the formula for $\hat{x}_i$, the PCA optimization problem becomes:

$$ \underset{a, V, \beta_i}{\text{minimize}} \quad \sum_{i=1}^n \|x_i - a - V \beta_i\|^2 $$

subject to $V^T V = I$.

(It can be proven that the solution to this problem is $a = \bar{x}$, $V$ is the matrix of the top $r$ eigenvectors of $S$, and $\beta_i = V^T (x_i - \bar{x})$)


Solution to the Minimum Reconstruction Error Problem

The optimization problem was:

$$ \underset{a, V, \beta_i}{\text{minimize}} \quad \sum_{i=1}^n \|x_i - a - V \beta_i\|^2 \quad \text{subject to} \quad V^T V = I $$

This problem can be solved in steps, which reveals its connection to the “Maximum Variance” view.


Step 1: Find Optimal $a$ and $\beta_i$ (given $V$)

It can be proven that the optimal $a$ and $\beta_i$ are:

  • Optimal $a = \bar{x}$ (the sample mean) (The note states this without proof, as it’s a standard result).

  • Optimal $\beta_i = V^T (x_i - \bar{x}) = V^T \tilde{x}_i$ (This is the score vector for the $i$-th point, found by projecting the centered data $\tilde{x}_i$ onto the $V$ basis. As the note mentions, this is the solution to a regression problem: $\beta_i = (V^T V)^{-1} V^T \tilde{x}_i = V^T \tilde{x}_i$ since $V^T V = I$).


Step 2: Rewrite the Reconstruction Error (RE)

Now we substitute $a = \bar{x}$ and $\beta_i = V^T \tilde{x}_i$ back into the RE formula. (Note: $\hat{x}_i = a + V\beta_i = \bar{x} + V V^T \tilde{x}_i$. So $x_i - \hat{x}_i = \tilde{x}_i - V V^T \tilde{x}_i = (I - VV^T)\tilde{x}_i$). We will now use $x_i$ to denote the centered data $\tilde{x}_i$ (as the note does for simplicity).

The RE becomes:

$$ RE = \sum_{i=1}^n \|x_i - V\beta_i\|^2 = \sum_{i=1}^n \|x_i - VV^T x_i\|^2 = \sum_{i=1}^n \|(I - VV^T)x_i\|^2 $$

Using the properties of trace (i.e., $\|z\|^2 = z^T z = \text{tr}(zz^T)$ and $\text{tr}(ABC) = \text{tr}(CAB)$):

$$ \|(I - VV^T)x_i\|^2 = x_i^T (I - VV^T)^T (I - VV^T) x_i $$

(Since $(I - VV^T)$ is a projection matrix, it is symmetric and idempotent, so $(I-P)^T(I-P) = (I-P)$)

$$ = x_i^T (I - VV^T) x_i = \text{tr}\left( x_i^T (I - VV^T) x_i \right) $$

$$ = \text{tr}\left( (I - VV^T) x_i x_i^T \right) $$

Now sum over $i$ and move the sum inside the trace:

$$ RE = \sum_{i=1}^n \text{tr}\left( (I - VV^T) x_i x_i^T \right) = \text{tr}\left( (I - VV^T) \sum_{i=1}^n x_i x_i^T \right) $$

Since $S = \frac{1}{n} \sum x_i x_i^T$ (for centered data), $\sum x_i x_i^T = nS$.

$$ RE = \text{tr}\left( (I - VV^T) (nS) \right) = n \cdot \text{tr}\left( (I - VV^T) S \right) $$

Step 3: Simplify and Connect to Maximum Variance

This is the final step. We use the linearity of trace ($\text{tr}(A-B) = \text{tr}(A) - \text{tr}(B)$):

$$ RE = n \cdot \text{tr}(S - VV^T S) = n \cdot \text{tr}(S) - n \cdot \text{tr}(VV^T S) $$

Using the cyclic property of trace ($\text{tr}(ABC) = \text{tr}(CAB)$):

$$ \text{tr}(VV^T S) = \text{tr}( (V^T) (S) (V) ) = \text{tr}(V^T S V) $$

So the RE is:

$$ RE = n \cdot \text{tr}(S) - n \cdot \text{tr}(V^T S V) $$

Our goal is to minimize $RE$.

  • $n \cdot \text{tr}(S)$ is the total variance of the data, which is a constant.
  • Therefore, minimizing $RE$ is equivalent to maximizing the term being subtracted: $$ \text{maximize} \quad \text{tr}(V^T S V) $$ Expanding $V = (v_1, \dots, v_r)$, we get: $$ \text{maximize} \quad \text{tr}(V^T S V) = v_1^T S v_1 + \dots + v_r^T S v_r $$ This is the exact same optimization problem as the “Maximum Variance” view.

Conclusion: The optimal $v_1, \dots, v_r$ are the eigenvectors of $S$ corresponding to the $r$ largest eigenvalues.


Final Approximation

The PCA reconstruction of a point $x$ is $\hat{x}$, which is its projection onto the $r$-dimensional plane.

$$ x \approx \hat{x} = \bar{x} + z_1 v_1 + \dots + z_r v_r $$

PCA via SVD

This provides an alternative and often more stable computational method for PCA.

Assume that the $n \times p$ data matrix $X$ is centered.

We can compute the Singular Value Decomposition (SVD) of $X$:

$$ X = U D V^T $$

(Where $U$ is $n \times p$, $D$ is $p \times p$, and $V^T$ is $p \times p$)

Now, let’s look at the covariance matrix $S$:

$$ S = \frac{1}{n} X^T X = \frac{1}{n} (U D V^T)^T (U D V^T) $$

$$ S = \frac{1}{n} (V D^T U^T) (U D V^T) $$

Since $U$ is column-orthogonal, $U^T U = I$:

$$ S = \frac{1}{n} V (D^T D) V^T = V \left( \frac{D^2}{n} \right) V^T $$

We also know the Eigendecomposition (EVD) of $S$ is $S = V \Lambda V^T$. By comparing these two forms, we find the key relationships:

  1. The Principal Component Directions $V_{(r)}$ are the top $r$ right singular vectors of $X$. (The eigenvectors of $S$ are the right singular vectors of $X$).

  2. Variance explained by each component is $\lambda_i = \frac{d_i^2}{n}$, where $d_i$ is the $i$-th singular value of $X$.


PCA Products via SVD

This gives us a very direct way to compute the PCA results:

  • The Scores $Z_{(r)}$: The scores $Z_{(r)} = X V_{(r)}$. We can substitute the SVD of $X$:

    $$ Z_{(r)} = X V_{(r)} = (U D V^T) V_{(r)} $$

    The note shows the derivation of this, which simplifies (because $V^T V_{(r)}$ is an identity matrix $I$ in the top $r$ block):

    $$ Z_{(r)} = U_{(r)} D_{(r)} $$

    (Where $U_{(r)}$ and $D_{(r)}$ are the first $r$ columns/diagonal entries of $U$ and $D$). This means the scores are simply the scaled left singular vectors.

  • The Projections $\hat{X}_{(r)}$ (Reconstruction): The $r$-rank approximation of $X$ (which is the PCA projection) is:

    $$ \hat{X}_{(r)} = U_{(r)} D_{(r)} V_{(r)}^T $$

Interpretation

The full SVD of $X$ is a sum of $p$ components:

$$ X = d_1 u_1 v_1^T + \dots + d_r u_r v_r^T + d_{r+1} u_{r+1} v_{r+1}^T + \dots + d_p u_p v_p^T $$

PCA via SVD states that $X$ is approximately its reconstruction $\hat{X}_{(r)}$:

$$ X = \underbrace{\hat{X}_{(r)}}_{\text{signal}} + \underbrace{\text{"noise"}}_{\text{remaining components}} $$

PCA: Low-Rank Matrix Approximation View

This fourth view of PCA connects all previous concepts (variance, reconstruction, SVD) through the language of linear algebra.


1. “r-dim Plane” is “Rank-r Matrix”

First, let’s establish that a set of points lying on an $r$-dimensional plane is equivalent to a low-rank matrix.

  • Assume a set of $n$ points $\hat{x}_1, \dots, \hat{x}_n$ belong to an $r$-dimensional plane (that passes through the origin).
  • This means each point $\hat{x}_i$ is a linear combination of $r$ orthogonal basis vectors $v_1, \dots, v_r$: $$ \hat{x}_i = z_{i1} v_1 + \dots + z_{ir} v_r $$
  • We can write the transpose of this: $$ \hat{x}_i^T = (z_{i1} \dots z_{ir}) \begin{pmatrix} \text{---} & v_1^T & \text{---} \\ & \vdots & \\ \text{---} & v_r^T & \text{---} \end{pmatrix} $$
  • If we stack all $n$ points into the $n \times p$ reconstruction matrix $\hat{X}$: $$ \hat{X} = \begin{pmatrix} \hat{x}_1^T \\ \vdots \\ \hat{x}_n^T \end{pmatrix} = \underbrace{\begin{pmatrix} z_{11} & \dots & z_{1r} \\ \vdots & & \vdots \\ z_{n1} & \dots & z_{nr} \end{pmatrix}}_{Z \text{ ($n \times r$)}} \underbrace{\begin{pmatrix} \text{---} & v_1^T & \text{---} \\ & \vdots & \\ \text{---} & v_r^T & \text{---} \end{pmatrix}}_{V^T \text{ ($r \times p$)}} $$
  • This shows that $\hat{X} = Z V^T$. By the properties of linear algebra, the rank of a matrix product is at most the smallest rank of the inputs.
  • Therefore, we have a key insight: $$ \text{rank}(\hat{X}) \le r $$ A set of points lying on an $r$-dim plane is a rank-$r$ matrix.

2. The Optimization Problem

The goal of $(PC_1, \dots, PC_r)$ is to find a low-rank $\hat{X}$ that approximates $X$.

This is formulated as the following optimization problem:

$$ \underset{\hat{X} \in \mathbb{R}^{n \times p}}{\text{minimize}} \quad \|X - \hat{X}\|_F^2 \quad \text{subject to} \quad \text{rank}(\hat{X}) \le r $$

This means: “Find the best rank-$r$ matrix approximation $\hat{X}$ for the data $X$.”

  • The objective function $\|X - \hat{X}\|_F^2$ is the reconstruction error, measured by the Frobenius norm squared.
  • The Frobenius norm is the sum of squared distances of all elements: $$ \|X - \hat{X}\|_F^2 = \sum_{i=1}^n \sum_{j=1}^p (x_{ij} - \hat{x}_{ij})^2 $$
  • This is mathematically equivalent to the sum of squared Euclidean distances for each sample (row), which is exactly the reconstruction error we saw earlier: $$ = \sum_{i=1}^n \|x_i - \hat{x}_i\|^2 $$

(The Eckart-Young theorem proves that the solution $\hat{X}$ to this problem is the $\hat{X}_{(r)}$ obtained from the SVD of $X$, thus unifying all views of PCA.)


PCA: Low-Rank Matrix Approximation View

This fourth view of PCA connects all previous concepts (variance, reconstruction, SVD) through the language of linear algebra.


1. “r-dim Plane” is “Rank-r Matrix”

First, let’s establish that a set of points lying on an $r$-dimensional plane is equivalent to a low-rank matrix.

  • Assume a set of $n$ points $\hat{x}_1, \dots, \hat{x}_n$ belong to an $r$-dimensional plane (that passes through the origin).
  • This means each point $\hat{x}_i$ is a linear combination of $r$ orthogonal basis vectors $v_1, \dots, v_r$: $$ \hat{x}_i = z_{i1} v_1 + \dots + z_{ir} v_r $$
  • We can write the transpose of this: $$ \hat{x}_i^T = (z_{i1} \dots z_{ir}) \begin{pmatrix} \text{---} & v_1^T & \text{---} \\ & \vdots & \\ \text{---} & v_r^T & \text{---} \end{pmatrix} $$
  • If we stack all $n$ points into the $n \times p$ reconstruction matrix $\hat{X}$: $$ \hat{X} = \begin{pmatrix} \hat{x}_1^T \\ \vdots \\ \hat{x}_n^T \end{pmatrix} = \underbrace{\begin{pmatrix} z_{11} & \dots & z_{1r} \\ \vdots & & \vdots \\ z_{n1} & \dots & z_{nr} \end{pmatrix}}_{Z \text{ ($n \times r$)}} \underbrace{\begin{pmatrix} \text{---} & v_1^T & \text{---} \\ & \vdots & \\ \text{---} & v_r^T & \text{---} \end{pmatrix}}_{V^T \text{ ($r \times p$)}} $$
  • This shows that $\hat{X} = Z V^T$. By the properties of linear algebra, the rank of a matrix product is at most the smallest rank of the inputs.
  • Therefore, we have a key insight: $$ \text{rank}(\hat{X}) \le r $$ A set of points lying on an $r$-dim plane is a rank-$r$ matrix.

2. The Optimization Problem

The goal of $(PC_1, \dots, PC_r)$ is to find a low-rank $\hat{X}$ that approximates $X$.

This is formulated as the following optimization problem:

$$ \underset{\hat{X} \in \mathbb{R}^{n \times p}}{\text{minimize}} \quad \|X - \hat{X}\|_F^2 \quad \text{subject to} \quad \text{rank}(\hat{X}) \le r $$

This means: “Find the best rank-$r$ matrix approximation $\hat{X}$ for the data $X$.”

  • The objective function $\|X - \hat{X}\|_F^2$ is the reconstruction error, measured by the Frobenius norm squared.
  • The Frobenius norm is the sum of squared distances of all elements: $$ \|X - \hat{X}\|_F^2 = \sum_{i=1}^n \sum_{j=1}^p (x_{ij} - \hat{x}_{ij})^2 $$
  • This is mathematically equivalent to the sum of squared Euclidean distances for each sample (row), which is exactly the reconstruction error we saw earlier: $$ = \sum_{i=1}^n \|x_i - \hat{x}_i\|^2 $$

(The Eckart-Young theorem proves that the solution $\hat{X}$ to this problem is the $\hat{X}_{(r)}$ obtained from the SVD of $X$, thus unifying all views of PCA.)

PCA for Matrix Completion: The Hard-Impute Algorithm

This topic applies the low-rank approximation view of PCA to solve the problem of missing values in a dataset.


1. Problem Setup: Missing Values

Assume $X \in \mathbb{R}^{n \times p}$ has missing values (e.g., NA).

  • We denote $\Omega \subseteq \{1...n\} \times \{1...p\}$ as the set of observed entries in $X$.
  • We define $W \in \mathbb{R}^{n \times p}$ as the mask matrix. $$ W_{ij} = \begin{cases} 1 & \text{if } (i,j) \in \Omega \\ 0 & \text{if } (i,j) \notin \Omega \end{cases} $$

Example:

$$ X = \begin{pmatrix} \text{NA} & 1 \\ 0 & \text{NA} \\ \text{NA} & 2 \end{pmatrix} $$

The set of observed entries is:

$$ \Omega = \{ (1,2), (2,1), (3,2) \} $$

The corresponding mask matrix $W$ is:

$$ W = \begin{pmatrix} 0 & 1 \\ 1 & 0 \\ 0 & 1 \end{pmatrix} $$

2. Reconstruction Error with Missing Data

We can only measure the reconstruction error on the entries we observed ($\Omega$).

The error is the sum of squared differences only over $\Omega$:

$$ \sum_{(i,j) \in \Omega} (X_{ij} - \hat{X}_{ij})^2 $$

This can be expressed using the mask matrix $W$ and the Frobenius norm, where * denotes the element-wise product (Hadamard product):

$$ (A * B)_{ij} = A_{ij} \cdot B_{ij} $$

The error sum is equivalent to:

$$ \sum_{(i,j) \in \Omega} (X_{ij} - \hat{X}_{ij})^2 = \| W * (X - \hat{X}) \|_F^2 $$

This is because:

$$ \sum_{i=1}^n \sum_{j=1}^p W_{ij} \cdot (X_{ij} - \hat{X}_{ij})^2 = \sum_{i=1}^n \sum_{j=1}^p ( (W * (X - \hat{X}))_{ij} )^2 $$

The $W_{ij}=0$ terms “zero out” the error for missing entries, leaving only the error from $\Omega$.


3. Hard-Impute Algorithm

This algorithm iteratively solves the matrix completion problem. The goal is to find a low-rank matrix $\hat{X}$ that minimizes the observed reconstruction error.

Optimization Problem:

$$ \underset{\hat{X}}{\text{minimize}} \quad \|W * (X - \hat{X})\|_F^2 \quad \text{subject to} \quad \text{rank}(\hat{X}) = r $$

Algorithm:

  • Input: Matrix $X \in \mathbb{R}^{n \times p}$ (with NAs), rank $r$.
  • Initialization: A starting guess for $\hat{X} \in \mathbb{R}^{n \times p}$.

Repeat until convergence:

  1. Step 1 (Impute): Fill in the missing values of $X$ with the current guess from $\hat{X}$.

    $$ Y = W * X + (1-W) * \hat{X} $$

    (This creates a “completed” matrix $Y$ by taking observed values from $X$ and imputed values from $\hat{X}$)

  2. Step 2 (Project): Find the best rank-$r$ approximation of the completed matrix $Y$.

    $$ \text{Compute SVD of Y:} \quad Y = U D V^T $$

    $$ \text{Update } \hat{X}: \quad \hat{X} = U_{(r)} D_{(r)} V_{(r)}^T $$

    (This new $\hat{X}$ is the best rank-r approximation of $Y$, and it becomes the guess for the next iteration).

  • Output: The final converged matrix $\hat{X}$.

Hard-impute steps (in words):

  1. Impute missing values in $X$ with the elements from the current estimate $\hat{X}$.
  2. Make the resulting matrix $\hat{X}$ low-rank (by finding its best rank-r SVD approximation).

4. Example of One Iteration

Let’s trace one loop of the algorithm.

$$ X = \begin{pmatrix} \text{NA} & 1 \\ 0 & \text{NA} \\ \text{NA} & 2 \end{pmatrix}, \quad W = \begin{pmatrix} 0 & 1 \\ 1 & 0 \\ 0 & 1 \end{pmatrix}, \quad \text{Current } \hat{X} = \begin{pmatrix} a & b \\ c & d \\ e & f \end{pmatrix} $$

Step 1: $Y = W * X + (1-W) * \hat{X}$

  • We need $(1-W)$: $$ (1-W) = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 0 \end{pmatrix} $$
  • Now compute $Y$: $$ Y = \underbrace{\begin{pmatrix} 0 & 1 \\ 0 & 0 \\ 0 & 2 \end{pmatrix}}_{W * X} + \underbrace{\begin{pmatrix} a & 0 \\ 0 & d \\ e & 0 \end{pmatrix}}_{(1-W) * \hat{X}} = \begin{pmatrix} a & 1 \\ c & d \\ e & 2 \end{pmatrix} $$ (Note: The note has a typo, (a, 0, e) should be (a, c, e). My $Y$ is the correct imputation based on the formula.)

Step 2: $\hat{X} = U_{(r)} D_{(r)} V_{(r)}^T$

  • We take the matrix $Y = \begin{pmatrix} a & 1 \\ c & d \\ e & 2 \end{pmatrix}$
  • We compute its SVD: $Y = U D V^T$
  • We find its best rank-$r$ approximation (e.g., $r=1$): $$ \hat{X}_{\text{new}} = U_{(1)} D_{(1)} V_{(1)}^T = \begin{pmatrix} a' & b' \\ c' & d' \\ e' & f' \end{pmatrix} $$
  • This $\hat{X}_{\text{new}}$ is now fed back into Step 1 for the next iteration.

Chapter 3: Non-Linear Methods


PCA is a Linear Method

The reason PCA is considered a linear method is because the new “score” $z_i$ (the coordinate on a principal component $v$) is a linear combination of the original $p$ features ($f_1, \dots, f_p$, or $x_1, \dots, x_p$).

If $x = (f_1, \dots, f_p)^T$ is a single data point and $v = (v_1, \dots, v_p)^T$ is a principal component:

$$ z = v^T x = v_1 f_1 + v_2 f_2 + \dots + v_p f_p $$

Pros:

  • easy to implement (relies on standard EVD or SVD)
  • easy to interpret (the weights $v_j$ in the vector $v$ tell you how much original feature $f_j$ contributes to that component)

Cons:

  • has limited flexibility… (it assumes the data lies on a flat $r$-dimensional plane, and fails on “curved” manifolds)

Modifications and Non-linear Methods

  • Modifications:

    • sparse PCA
    • kernel PCA (This is the non-linear extension we will look at)
  • Non-linear methods (different algorithms):

    • tSNE
    • UMAP

Kernel PCA

This note introduces a non-linear extension to PCA, for cases where linear PCA fails.

  • The Problem: Standard PCA finds a linear projection (PC1). On data that is not linearly separable (like two separable clusters or concentric circles), PC1 fails to capture the structure and mixes the classes.
  • Top-Left Diagram: Two clusters, PC1 projects them onto the same line.
  • Bottom-Left Diagram: Concentric circles, PC1 projects them onto the same line.

Idea: Transform Feature Space

The core idea is to transform the data into a higher-dimensional space where it becomes linearly separable.

$$ f_1, \dots, f_p \longrightarrow \Phi_1(f_1, \dots, f_p), \dots, \Phi_q(f_1, \dots, f_p) \quad (\text{where } q > p) $$

$$ \Leftrightarrow \quad x_i \in \mathbb{R}^p \longrightarrow \Phi(x_i) \in \mathbb{R}^q $$

…and then run PCA on the transformed data $\Phi(x_1), \dots, \Phi(x_n)$.


Example: The “Magic” of Feature Transformation

This example shows how the transformation works for the concentric circles data.

$$ \text{Example:} \quad (f_1, f_2) \to (f_1, f_2, f_1^2 + f_2^2) $$
  • Left Diagram: The original 2D data (concentric circles).
  • Right Diagram: The new 3D data. The new 3rd dimension ($f_1^2 + f_2^2$) is the squared radius. The inner (blue) circle has a low radius, the outer (red) circle has a high radius. The data is now linearly separable by a 2D plane.

PCA on $\Phi(x_1), \dots, \Phi(x_n)$ will compute scores. For the $i$-th component:

$$ Z_{(i)} = v_{i1}f_1 + v_{i2}f_2 + v_{i3}(f_1^2+f_2^2) $$

This is non-linear in the original features $f_1, f_2$.

The Kernel Trick

The Problem: The new feature space $q$ can be computationally expensive (or even infinitely large) to compute.

The Solution: The PCA algorithm can be reformulated to only use an $n \times n$ inner product matrix $K$ (or Gram matrix).

$$ \text{Denote the transformed data by } \Phi = \begin{pmatrix} \text{---} & \Phi(x_1)^T & \text{---} \\ & \vdots & \\ \text{---} & \Phi(x_n)^T & \text{---} \end{pmatrix} \in \mathbb{R}^{n \times q} $$

$$ \text{Let } K = \Phi \Phi^T \in \mathbb{R}^{n \times n} \text{ be the inner product matrix.} $$

$$ K_{ij} = \langle \Phi(x_i), \Phi(x_j) \rangle $$

A kernel function $\mathcal{K}$ is a “shortcut” that computes this inner product cheaply in the original space:

$$ \mathcal{K}(x_i, x_j) = \langle \Phi(x_i), \Phi(x_j) \rangle $$

Example: kernel function $\Rightarrow$ implicit feature mapping

  • Quadratic:

    $$ \mathcal{K}(a, b) = (1 + \langle a, b \rangle)^2 = 1 + \sum a_i^2 + \sum b_i^2 + \dots + 2 a_i b_j + \dots $$

    The simple calculation on the left is equivalent to an inner product in a very high-dimensional space $\Phi$:

    $$ \Phi: (f_1, \dots, f_p) \to (1, f_1^2, \dots, f_p^2, \sqrt{2}f_1, \dots, \sqrt{2}f_p, \sqrt{2}f_1 f_2, \dots) $$
  • Polynomial:

    $$ \mathcal{K}(x, y) = (\langle x, y \rangle + d)^\gamma $$
  • Radial (RBF):

    $$ \mathcal{K}(x, y) = e^{-\gamma ||x-y||^2} $$

    (This implicitly maps to an infinite-dimensional space).

Kernel PCA: The Algorithm

The core insight of Kernel PCA is: you don’t need $\Phi$ to compute PCA scores, only $K$.

This works by running PCA in the $n$-dimensional “sample space” (using $K$) instead of the $q$-dimensional “feature space” (using $\Phi$).

The Algorithm

  • Input:

    • Data $X \in \mathbb{R}^{n \times p}$
    • Kernel function $\mathcal{K}(a, b)$
    • Target dimension $r$
  • Step 1: Compute the Kernel Matrix $K$

    • Compute the $n \times n$ matrix $K$ where: $$ K_{ij} = \mathcal{K}(x_i, x_j) $$
  • Step 2: Center the Kernel Matrix

    • (This is the crucial step to ensure the (implicit) data $\Phi$ is centered).
    • We must not use $K$, but a centered version $\tilde{K}$.
    • Define the $n \times n$ centering matrix $C$: $$ C = I - \frac{1}{n}\mathbf{1}\mathbf{1}^T $$
    • Compute the centered Gram matrix $\tilde{K}$: $$ \tilde{K} = C K C $$
  • Step 3: Compute Eigen-decomposition of $\tilde{K}$

    • Find the eigenvectors and eigenvalues of the $n \times n$ matrix $\tilde{K}$. $$ \tilde{K} U = U \Lambda $$
    • Keep the top $r$ eigenvectors $u_1, \dots, u_r$ (which form $U_{(r)}$) and their corresponding eigenvalues $\lambda_1, \dots, \lambda_r$ (which form $\Lambda_{(r)}$).
    • (Note: The note’s $d_i^2$ corresponds to $\lambda_i$)
  • Output: PC Scores $Z_{(r)}$

    • The new $r$-dimensional scores are the eigenvectors, scaled by the square root of their eigenvalues.
    • Let $D_{(r)} = \sqrt{\Lambda_{(r)}}$ (a diagonal matrix of $d_i = \sqrt{\lambda_i}$).
    • The $n \times r$ score matrix $Z_{(r)}$ is: $$ Z_{(r)} = U_{(r)} D_{(r)} $$
    • (The $i$-th PC score vector is $z_{(i)} = \sqrt{\lambda_i} u_i = d_i u_i$)

Mathematical Justification (Why this works)

This algorithm works by finding an equivalence between the SVD of $\Phi$ and the EVD of $K$.

  • Assume $\Phi$ is centered. (Achieved by using $\tilde{K}$).
  • Goal: We want the scores $Z = \Phi V$ (from standard PCA: data $\times$ directions).
  • SVD of $\Phi$: $\Phi = U D V^T$
    • As we saw in “PCA via SVD”, the scores are $Z = U D$.
  • EVD of $K$: $$ K = \Phi \Phi^T = (U D V^T)(\Phi)^T $$ $$ = (U D V^T)(V D^T U^T) = U D (V^T V) D^T U^T = U D D^T U^T $$ $$ K = U D^2 U^T $$
  • This is the key: $K = U (\Lambda) U^T$, where $\Lambda = D^2$.
    • The eigenvectors $U$ of $K$ are the left singular vectors $U$ of $\Phi$.
    • The eigenvalues $\Lambda$ of $K$ are the squared singular values $D^2$ of $\Phi$.
  • Conclusion:
    • By finding the eigenvectors $U$ and eigenvalues $\Lambda$ of $K$ (which we can compute), we have found…
    • …the $U$ and $D = \sqrt{\Lambda}$ needed to compute…
    • …the scores $Z = U D$.

Local Dimension Reduction: UMAP

UMAP (Uniform Manifold Approximation and Projection) is a powerful, modern non-linear dimension reduction method, often used for visualization. It is not an extension of PCA, but a different approach.

The UMAP Algorithm (Conceptual)

Given high-dimensional data $x_1, \dots, x_n \in \mathbb{R}^p$.

  1. Construct High-Dimensional Graph:

    • UMAP constructs a weighted K-neighbour graph in the original $p$-dimensional space.
    • This graph represents the “topological structure” (i.e., “who is a neighbor to whom”) of the high-dim data.
    • (Uses a technique called “fuzzy simplicial complex”).
  2. Construct Low-Dimensional Graph:

    • UMAP initializes a $r$-dimensional embedding (e.g., $r=2$) $z_1, \dots, z_n \in \mathbb{R}^r$.
    • It constructs a similar weighted K-neighbour graph for these points in the low-dim space.
  3. Optimize:

    • UMAP iteratively adjusts the positions of $z_1, \dots, z_n$ to make the low-dimensional graph as “similar” as possible to the high-dimensional graph.
    • (It uses cross-entropy as a cost function to measure this similarity).

Two Main Parameters:

  • n-neighbors: The number of nearest neighbors to consider. This is the most important parameter.
    • A small value (e.g., 5) forces UMAP to focus on local structure.
    • A large value (e.g., 50) allows UMAP to preserve more global structure.
  • min-dist: Controls how tightly UMAP packs neighbors together in the final embedding.
    • This is mostly a cosmetic parameter for the visualization.

Comparing to tSNE, UMAP is faster and better at preserving more global structure.


UMAP vs. PCA

Pros (+) of UMAP:

  • Non-linear: Its primary advantage. It can find complex, curved manifolds that PCA (which is linear) would miss.
  • Excellent for Visualization: Captures local neighborhood structure very well, revealing clusters.

Cons (-) of UMAP:

  • Slower than PCA (but faster than t-SNE).
  • Can struggle with noise.
  • Less Interpretable (Critical Warning):
    • There is no meaning to the UMAP coordinate axes (the X and Y axes are arbitrary).
    • The size of a cluster in a UMAP plot is not interpretable.
    • The distances between clusters are not interpretable.
    • UMAP only preserves local neighbor relationships.

Practical Aspects

Sometimes it’s better to combine PCA & UMAP. This is a common and highly effective workflow:

Filter $\to$ Apply PCA $\to$ Apply UMAP

  • Why? Use PCA first as a fast, linear de-noising step (e.g., to reduce 1000 dimensions to 50).
  • Then, use UMAP to perform the “hard” non-linear reduction (e.g., from 50 dimensions to 2).
  • This is often faster and produces cleaner results than using UMAP on the full, noisy, high-dimensional data.

Canonical Correlation Analysis (CCA)

Goal: Study the association between two sets of variables (two data matrices), $X$ and $Y$.

  • $X = \begin{pmatrix} x_1^T \\ \vdots \\ x_n^T \end{pmatrix} \in \mathbb{R}^{n \times p}$ ($n$ samples, $p$ features)
  • $Y = \begin{pmatrix} y_1^T \\ \vdots \\ y_n^T \end{pmatrix} \in \mathbb{R}^{n \times q}$ ($n$ samples, $q$ features)

The Problem: From 1D to Multi-dimensional

If $p=q=1$ (i.e., both $X$ and $Y$ are single vectors):

  • Association is measured via the Sample Correlation $r_{xy}$.
  • $y = ax + b$
    • $a > 0 \implies r_{xy} = 1$
    • $a < 0 \implies r_{xy} = -1$
  • Unknown if linear relationship exists or not
    • $r_{xy} = 0 \implies$ no association

If $p > 1$ and $q > 1$ (the multivariate case):

  • We could compute a $p \times q$ cross-correlation matrix $R_{xy}$.
  • Let $x = (f_1, \dots, f_p)$ and $y = (g_1, \dots, g_q)$ be the sets of features.
  • The matrix $R_{xy}$ has entries $(R_{xy})_{ij} = r_{f_i, g_j}$.
  • This matrix is large and difficult to interpret.

The motivating question for CCA:

$$ \text{How to aggregate information in } R_{xy}? $$

(CCA aims to find a few “canonical correlations” that summarize the main relationships, instead of looking at all $p \times q$ individual correlations.)


Canonical Correlation Analysis (CCA)

CCA 1: The Problem Definition

Assume that $X \in \mathbb{R}^{n \times p}$ and $Y \in \mathbb{R}^{n \times q}$ are column-centered.

The goal of the first canonical correlation (CCA 1) is to: Find $u \in \mathbb{R}^p$ and $v \in \mathbb{R}^q$ such that $a = Xu$ and $b = Yv$ are maximally correlated.

We aim to:

$$ \underset{u \in \mathbb{R}^p, v \in \mathbb{R}^q}{\text{maximize}} \quad r_{ab} $$
  • $a = Xu = \begin{pmatrix} a_1 \\ \vdots \\ a_n \end{pmatrix}$ summarizes $X$ into a 1D score.
  • $b = Yv = \begin{pmatrix} b_1 \\ \vdots \\ b_n \end{pmatrix}$ summarizes $Y$ into a 1D score.
  • $r_{ab}$ summarizes the association between $X$ and $Y$.

CCA Solution: Transformation of the Problem

The correlation $r_{ab}$ can be written using the covariance matrices of $X$ ($S_x$), $Y$ ($S_y$), and the cross-covariance $S_{xy}$:

$$ r_{ab} = \frac{S_{ab}}{\sqrt{S_a^2 S_b^2}} = \frac{u^T S_{xy} v}{\sqrt{(u^T S_x u) (v^T S_y v)}} $$
  • $S_{ab} = u^T S_{xy} v$
  • $S_a^2 = u^T S_x u$
  • $S_b^2 = v^T S_y v$

The correlation $r_{ab}$ is invariant to the scale of $u$ and $v$. (e.g., if $\tilde{u} = \alpha u$ and $\tilde{v} = \beta v$, $r_{\tilde{a}\tilde{b}} = r_{ab}$).

This allows us to restate the CCA problem as a constrained optimization problem. We can fix the denominator to 1 and maximize the numerator:

$$ \underset{u \in \mathbb{R}^p, v \in \mathbb{R}^q}{\text{maximize}} \quad u^T S_{xy} v $$

$$ \text{subject to} \quad u^T S_x u = 1 \quad \text{and} \quad v^T S_y v = 1 $$

This is still hard to solve. We “whiten” the variables by setting:

$$ \tilde{u} = S_x^{1/2} u \quad \text{and} \quad \tilde{v} = S_y^{1/2} v $$

This implies:

$$ u = S_x^{-1/2} \tilde{u} \quad \text{and} \quad v = S_y^{-1/2} \tilde{v} $$

The constraints now become:

$$ u^T S_x u = (S_x^{-1/2} \tilde{u})^T S_x (S_x^{-1/2} \tilde{u}) = \tilde{u}^T S_x^{-1/2} S_x S_x^{-1/2} \tilde{u} = \tilde{u}^T I \tilde{u} = \|\tilde{u}\|^2 = 1 $$

$$ v^T S_y v = \dots = \|\tilde{v}\|^2 = 1 $$

The objective function becomes:

$$ u^T S_{xy} v = (S_x^{-1/2} \tilde{u})^T S_{xy} (S_y^{-1/2} \tilde{v}) = \tilde{u}^T \left( S_x^{-1/2} S_{xy} S_y^{-1/2} \right) \tilde{v} $$

Let’s define $\tilde{S}_{xy} = S_x^{-1/2} S_{xy} S_y^{-1/2}$.

The equivalent problem is:

$$ \underset{\tilde{u}, \tilde{v}}{\text{maximize}} \quad \tilde{u}^T \tilde{S}_{xy} \tilde{v} $$

$$ \text{subject to} \quad \|\tilde{u}\|^2 = 1 \quad \text{and} \quad \|\tilde{v}\|^2 = 1 $$

Detour: SVD via Optimization (The Tool)

This note proves how to solve this exact form of problem. Given a matrix $A \in \mathbb{R}^{n \times p}$ with SVD $A = U D V^T$. The solution to the problem:

$$ \underset{u \in \mathbb{R}^n, v \in \mathbb{R}^p}{\text{maximize}} \quad u^T A v \quad \text{subject to} \quad \|u\| = 1, \|v\| = 1 $$

is:

  • $u = u_1$ (the top left singular vector of $A$)
  • $v = v_1$ (the top right singular vector of $A$)
  • The optimal $u^T A v = d_1$ (the top singular value of $A$).

(The note proves this using Lagrange multipliers, showing $v$ is an eigenvector of $A^T A$ (i.e., a right singular vector) and $u$ is an eigenvector of $A A^T$ (i.e., a left singular vector), and the max value is $d_1$ when $i=j=1$.)


Final Solution

We now apply this “tool” to our “transformed CCA problem”.

  • Problem: maximize @@\tilde{u}^T \tilde{S}_{xy} \tilde{v}@@
  • Tool: The solution is the top singular vectors of the matrix $\tilde{S}_{xy}$.

Therefore:

  • Optimal $\tilde{u} = \tilde{u}_1$ and Optimal $\tilde{v} = \tilde{v}_1$ are the top left and right singular vectors of $\tilde{S}_{xy} = S_x^{-1/2} S_{xy} S_y^{-1/2}$.

These are the solutions for the transformed problem. To get the final $u$ and $v$ for our original problem, we “un-whiten” them:

  • Optimal $u = S_x^{-1/2} \tilde{u}_1$
  • Optimal $v = S_y^{-1/2} \tilde{v}_1$

CCA 2: Finding the Second Canonical Correlation

The Problem: We want to find a new pair of vectors $u \in \mathbb{R}^p$ and $v \in \mathbb{R}^q$ such that their scores $a = Xu$ and $b = Yv$ are maximally correlated, but with new constraints:

  • $a$ is uncorrelated with $a_1 = X u_1$
  • $b$ is uncorrelated with $b_1 = Y v_1$

This is equivalent to the optimization problem:

$$ \underset{u \in \mathbb{R}^p, v \in \mathbb{R}^q}{\text{maximize}} \quad r_{ab} \quad \text{subject to} \quad r_{a a_1} = 0 \text{ and } r_{b b_1} = 0 $$

CCAr: Finding the $r$-th Canonical Correlation

We can generalize this to find the $r$-th pair of vectors, $(u_r, v_r)$.

The Optimization Problem: Find $u$ and $v$ that…

$$ \underset{u \in \mathbb{R}^p, v \in \mathbb{R}^q}{\text{maximize}} \quad u^T S_{xy} v $$

…subject to the standard constraints:

$$ u^T S_x u = 1 \quad \text{and} \quad v^T S_y v = 1 $$

…and subject to the new orthogonality constraints:

$$ u^T S_x u_i = 0 \quad \text{for } i = 1, \dots, r-1 $$

$$ v^T S_y v_i = 0 \quad \text{for } i = 1, \dots, r-1 $$

The Solution via SVD

This problem seems complex, but the “whitening” transformation makes it simple.

1. Restate the problem (for CCA2):

The problem Maximize @@u^T S_{xy} v@@ with constraints @@u^T S_x u = v^T S_y v = 1@@ and @@u^T S_x u_1 = v^T S_y v_1 = 0@@…

2. Transform the problem:

This problem can be shown to be equivalent to:

$$ \underset{\tilde{u}, \tilde{v}}{\text{maximize}} \quad \tilde{u}^T \tilde{S}_{xy} \tilde{v} $$

Subject to

@@ \quad \|\tilde{u}\|^2 = 1, \|\tilde{v}\|^2 = 1 \quad \text{and} \quad \tilde{u}^T \tilde{u}_1 = 0, \tilde{v}^T \tilde{v}_1 = 0 @@

where @@\tilde{S}_{xy} = S_x^{-1/2} S_{xy} S_y^{-1/2}@@, @@\tilde{u} = S_x^{1/2} u@@, and @@\tilde{u}_1 = S_x^{1/2} u_1@@.

3. The SVD Magic:

  • We know the solution to maximize @@ \tilde{u}^T \tilde{S}_{xy} \tilde{v}@@ is the top singular vectors (@@\tilde{u}_1@@, @@\tilde{v}_1@@).

  • The SVD naturally produces an orthogonal basis. The second singular vectors (@@\tilde{u}_2@@, @@\tilde{v}_2@@) are guaranteed to be orthogonal to the first (i.e., @@\tilde{u}_2^T \tilde{u}_1 = 0@@ and @@\tilde{v}_2^T \tilde{v}_1 = 0@@).

  • Therefore, the second largest singular vectors (@@\tilde{u}_2@@, @@\tilde{v}_2@@) are the optimal solution that automatically satisfies the new constraints.

4. General Solution:

  • Thus @@ \tilde{u} = \tilde{u}_r, \tilde{v} = \tilde{v}_r @@ are the $r$-th largest left and right singular vectors of @@\tilde{S}_{xy}@@

  • And the optimal @@u_r@@ and @@v_r@@ are found by “un-whitening”:

    $$ u_r = S_x^{-1/2} \tilde{u}_r, \quad v_r = S_y^{-1/2} \tilde{v}_r $$

CCA Computations (The Algorithm for $CCA_1 \dots CCA_r$)

  • Input: data $X \in \mathbb{R}^{n \times p}$ and $Y \in \mathbb{R}^{n \times q}$
  • Step 1: Compute $S_x, S_y, S_{xy}$ (from centered data).
  • Step 2: Compute the “whitened” cross-covariance matrix: $$ \tilde{S}_{xy} = S_x^{-1/2} S_{xy} S_y^{-1/2} $$
  • Step 3: Find the SVD of $\tilde{S}_{xy}$ to get the top $r$ singular vectors: $$ \tilde{S}_{xy} \to (\tilde{u}_1 \dots \tilde{u}_r) \text{ and } (\tilde{v}_1 \dots \tilde{v}_r) $$
  • Step 4: Transform back (un-whiten) to get the canonical directions: $$ u_i = S_x^{-1/2} \tilde{u}_i, \quad v_i = S_y^{-1/2} \tilde{v}_i \quad \text{for } i=1 \dots r $$
  • Output: The pairs $(u_1, v_1), \dots, (u_r, v_r)$

Terminology

  • $(u_i, v_i)$ is called the $i$-th pair of canonical directions.
  • $(a_i = X u_i, b_i = Y v_i)$ are called the canonical score vectors.
  • $r_{a_i b_i} = u_i^T S_{xy} v_i$ is called the $i$-th canonical correlation. (This value is simply $d_i$, the $i$-th singular value of $\tilde{S}_{xy}$).