#Explanation of the geometric meaning of PC1 and how to find PC1 through the data matrix
First for the Data Matrix:
- The number of Rows (horizontally) represents the number of data points.
- Each column distinguishes observation.
- The number of Column (vertical) represents the dimension of the data collected in the current Matrix.
- Each column distinguishes features.
Take question (a) as an example.
1. Geometric intuition for data
Observation data matrix $X$:
$$ X = \begin{pmatrix} 0 & 0 \\ 1 & 1 \\ -1 & -1 \end{pmatrix} $$The coordinates of these three sample points on the two-dimensional plane are:
- Point 1: $(0, 0)$
- Point 2: $(1, 1)$
- Point 3: $(-1, -1)$
A Scatterplot is a straight line. All points fall perfectly on the line $x_2 = x_1$ (that is, the 45-degree line of $y=x$).
**2. Why is this the first principal component? **
The goal of PCA is to find the direction with the largest variance (Variance Maximization).
- Along the line $x_2 = x_1$, the data is distributed the most “dispersed” and has the largest variance.
- Perpendicular to this line, the data has no fluctuations at all (variance is 0).
Therefore, the direction vector of the first principal component is pointing in the direction of $(1, 1)$.
3. Normalization PCA requires that the loading vector (ie, direction vector) must be a unit vector (Unit Vector)**, that is, the length (Norm) is 1.
The current unnormalized direction vector is $u = \begin{pmatrix} 1 \\ 1 \end{pmatrix}$. Calculate its length $\|u\|$:
$$ \|u\| = \sqrt{1^2 + 1^2} = \sqrt{2} $$Divide by length to normalize:
$$ v_1 = \frac{u}{\|u\|} = \begin{pmatrix} \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} \end{pmatrix} $$This is where the $v_1$ in the answer comes from.
Then ask the answer to (a), why is the variance the largest along this line?
In short, PCA (Principal Component Analysis) is asking a question: “From which angle should I look at this pile of data to see that they are most “open” and “scattered”?”**
The key words here are: **Variance (Variance) = the degree of data dispersion (Spread). **
Comprehension: Comprehension from projection onto the coordinate axes defined by the PC.
1. The geometry of the data: they are a line
First look at your data points:
- Point 1: $(0, 0)$
- Point 2: $(1, 1)$
- Point 3: $(-1, -1)$
If drawn on paper, these three points line up perfectly, like a string of candied haws. The straight line where they are located is $y=x$ (that is, $x_2 = x_1$ in the picture), which is a 45-degree diagonal line.
2. What are “projection” and “variance”?
Imagine you shine a flashlight on this string of “candied haws” and project their shadow (projection) onto the wall. PCA is rotating the flashlight to find the angle that makes the shadow longest.
Let’s compare the two directions:
Case A: Look along the line $y=x$ (that is, the direction of the first principal component)
If you project these three points onto the straight line $y=x$ (or in other words, you measure their distance along this line):
- $(0,0)$ is at the origin.
- $(1,1)$ is far from the origin (the distance is $\sqrt{2} \approx 1.41$).
- $(-1,-1)$ is also far away from the origin, at the other end.
Conclusion: The data points are extended on the straight line and very scattered. **Dispersion = large variance. **
Case B: Look perpendicular to the $y=x$ line (i.e. $y=-x$ direction)
If you project these three points onto a vertical line (that is, you “flatten” the line from the side):
- $(0,0)$ is still at the origin.
- $(1,1)$ is projected over, and since it is vertical, it will fall on the origin.
- If $(-1,-1)$ is projected, it will also fall on the origin.
Conclusion: All points are crowded on one point (0). Not scattered at all. **Undispersed = 0 variance. **
3. Why is the “maximum variance” the “first principal component”?
In data analysis, variance represents the amount of information.
- If a bunch of data are crowded together (small variance), they all look the same, there is no difference, and there is no information.
- If a bunch of data is widely separated (large variance), you can clearly see the differences between samples, and the amount of information is the greatest.
Back to picture: The picture says “Along the line $x_2=x_1$, the data is most spread out and has the largest variance.” What this means is: because these three points themselves form a line, if you describe them along this line, you can retain all position information (you know who is in front of whom). This is the direction that retains the most information, so it is the First Principal Component (First PC).
Summarize
- Maximum variance = The longest shadow = The widest point distribution.
- Because the points themselves are all on $y=x$, so along the direction of $y=x$, the points must be spread the widest.
- Any deviation from this 45-degree line, the distance between points will “shrink” after projection, and the variance will become smaller.
(b) Calculation Variance Explained by PC1
1. Projection We need to project the original data $X$ onto the new coordinate axis $v_1$ to get the new coordinate value (i.e. Scores, recorded as $z$). Mathematically, projection is a dot product:
$$ z_{i(1)} = x_i^\top v_1 $$Let’s restore the calculation process one by one:
- Point 1 $(0,0)$: $$0 \cdot \frac{1}{\sqrt{2}} + 0 \cdot \frac{1}{\sqrt{2}} = 0$$
- Point 2 $(1,1)$: $$1 \cdot \frac{1}{\sqrt{2}} + 1 \cdot \frac{1}{\sqrt{2}} = \frac{2}{\sqrt{2}} = \sqrt{2}$$
- Point 3 $(-1,-1)$: $$-1 \cdot \frac{1}{\sqrt{2}} + (-1) \cdot \frac{1}{\sqrt{2}} = -\frac{2}{\sqrt{2}} = -\sqrt{2}$$
Therefore, the projected data (Scores) are: $z_{(1)} = (0, \sqrt{2}, -\sqrt{2})^\top$.
2. Variance Calculation The question asks “how much variance is explained”, that is, calculating the variance of this set of $z$ values. First check the mean: $\bar{z} = \frac{0 + \sqrt{2} + (-\sqrt{2})}{3} = 0$. The mean is 0, which makes the calculation much easier.
According to the formula of the answer (note: the population variance definition is used here, divided by $n=3$, not the sample variance $n-1$. In some textbook definitions of PCA, in order to simplify the decomposition, it is sometimes directly defined as the eigenvalue of $\frac{1}{n}X^\top X$):
$$ \lambda_1 = \text{Var}(z_{(1)}) = \frac{1}{3} \sum (z_i - \bar{z})^2 $$$$ \lambda_1 = \frac{1}{3} \left( 0^2 + (\sqrt{2})^2 + (-\sqrt{2})^2 \right) = \frac{1}{3} (0 + 2 + 2) = \frac{4}{3} $$Conclusion: The variance explained by the first principal component is $\frac{4}{3}$.
(c) Variance Explained by PC2
Go through PC1 to find PC2.
Step one: Find direction $v_2$ (Find Direction)
rule: The second principal component of PCA ($v_2$) must satisfy two conditions:
- Perpendicular to the first principal component ($v_1$).
- Length 1 (unit vector).
operate:
Review $v_1$: In Part (a) we are looking for the direction of $y=x$, so the unnormalized direction vector is $u_1 = \begin{pmatrix} 1 \\ 1 \end{pmatrix}$.
Find vertical vector: On a two-dimensional plane, the simplest way to find the vertical vector of a vector $\begin{pmatrix} a \\ b \end{pmatrix}$ is: exchange the coordinates and negate one of them.
- $\begin{pmatrix} 1 \\ 1 \end{pmatrix} \xrightarrow{\text{swap and negate}} \begin{pmatrix} -1 \\ 1 \end{pmatrix}$。
- Verify the dot product (Dot Product): $1 \cdot (-1) + 1 \cdot 1 = -1 + 1 = 0$. A dot product of 0 indicates verticality. OK.
- Normalize: The length of vector $\begin{pmatrix} -1 \\ 1 \end{pmatrix}$ is $\sqrt{(-1)^2 + 1^2} = \sqrt{2}$. Divide by the length to get $v_2$: $$ v_2 = \begin{pmatrix} -\frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} \end{pmatrix} $$ This is where the $v_2$ in the answer comes from.
Step 2: Calculate the projection value $z$ (Calculate Projections)
significance: This step is asking: “If I stand on the new coordinate axis $v_2$ and look at the original points, what do they read on my scale?”
calculate: Use the data point $x_i$ to dot multiply the direction vector $v_2$.
$$ z_{i(2)} = x_i^\top v_2 = x_{i1} \cdot (-\frac{1}{\sqrt{2}}) + x_{i2} \cdot (\frac{1}{\sqrt{2}}) $$We bring in the data points $X = \begin{pmatrix} 0 & 0 \\ 1 & 1 \\ -1 & -1 \end{pmatrix}$ one by one:
- The first point $(0,0)$: $$0 \cdot (-\dots) + 0 \cdot (\dots) = 0$$
- Second point $(1,1)$: $$1 \cdot (-\frac{1}{\sqrt{2}}) + 1 \cdot (\frac{1}{\sqrt{2}}) = -\frac{1}{\sqrt{2}} + \frac{1}{\sqrt{2}} = 0$$
- The third point $(-1,-1)$: $$-1 \cdot (-\frac{1}{\sqrt{2}}) + (-1) \cdot (\frac{1}{\sqrt{2}}) = \frac{1}{\sqrt{2}} - \frac{1}{\sqrt{2}} = 0$$
Intuitive explanation: Because all data points lie on the $y=x$ line (the first principal component), and $v_2$ is perpendicular to the direction of this line. The data points do not have any deviation (Deviation) in this vertical direction. It’s like you’re walking a tightrope, and your swing from side to side is 0.
Step 3: Calculate Variance
calculate: Now you get a new set of coordinate values (Scores): $\{0, 0, 0\}$. What is the variance of this set of numbers?
$$ \text{Variance} = \frac{1}{3} \sum (z_i - \bar{z})^2 $$The mean is 0, and so is every number:
$$ \lambda_2 = \frac{1}{3} (0^2 + 0^2 + 0^2) = 0 $$Summarize
Part (c) is actually verifying a geometric fact: **If the 2D data is perfectly aligned in a straight line, then there is no information (variance is 0) in the direction perpendicular to the straight line. **
How to do this step?
- Geometry Find Vertical: $(1,1) \to (-1,1)$.
- Algebraic projection: found that the results are all 0.
- Conclusion: The variance is 0.
How to find PCs in higher dimensions?
In multi-dimensional space, we cannot find vertical vectors through the simple trick of “exchanging coordinates” like 2D, because there are infinitely many vertical vectors. We have to rely on computing the eigenvectors of the covariance matrix $X^\top X$. The properties of matrix algebra ensure that the calculated set of vectors not only represents the directions with the largest variance, but also automatically meets the requirements of being perpendicular to each other.
Mathematical solution: Eigenvectors of the covariance matrix This is the “master key” to solving all dimensions. Whether it is 3 dimensions or 1000 dimensions, the steps are exactly the same:
Step one: Construct the covariance matrix (Covariance Matrix)
Compute $\Sigma = \frac{1}{n} X^\top X$ (assuming $X$ is centralized). This matrix $\Sigma$ contains all relationships between pairs of variables.
Step 2: Eigen-decomposition
We solve the characteristic equation of $\Sigma$:
$$\Sigma v = \lambda v$$Step 3: Spectral Theorem
Because the covariance matrix $\Sigma$ is a Real Symmetric Matrix (because $X^\top X$ is still itself after transposition), linear algebra has a powerful theorem called the Spectral Theorem which guarantees that all eigenvectors ($v_1, v_2, \dots, v_p$) must be orthogonal (perpendicular) to each other. All eigenvalues ($\lambda_1, \lambda_2, \dots, \lambda_p$) are real and nonnegative.
in conclusion:
You don’t need to manually “find” the vertical vector at all. As long as you calculate all the eigenvectors of $\Sigma$, they are inherently perpendicular to each other. The vector with the largest corresponding eigenvalue is PC1. The second largest one, is PC2 (which is automatically perpendicular to PC1). The third largest one, is PC3 (which is automatically perpendicular to PC1 and PC2).