FINAL1-Q1
Here are the step-by-step solutions for Part A: Matrix Algebra & Spectral Analysis.
These exercises explore the mathematical foundations of the “Hat Matrix” ($H$) and the “Residual Maker” ($M$), which are central to understanding Linear Regression geometry.
Q1. Spectral Properties of the Residual Maker
(a) SVD and the Hat Matrix
Goal: Show that $H = UU^\top$ using the Singular Value Decomposition (SVD) of $X$.
Proof:
Recall the SVD: Let the “thin” SVD of $X$ be $X = UDV^\top$, where:
- $U \in \mathbb{R}^{n \times p}$ is semi-orthogonal ($U^\top U = I_p$).
- $D \in \mathbb{R}^{p \times p}$ is diagonal with positive singular values.
- $V \in \mathbb{R}^{p \times p}$ is orthogonal ($V^\top V = VV^\top = I_p$).
Substitute into the definition of $H$:
$$H = X(X^\top X)^{-1}X^\top$$$$H = (UDV^\top) \left[ (UDV^\top)^\top (UDV^\top) \right]^{-1} (UDV^\top)^\top$$Expand the transpose terms:
$$H = (UDV^\top) \left[ (V D U^\top) (UDV^\top) \right]^{-1} (V D U^\top)$$Simplify the inner term using $U^\top U = I_p$:
$$H = U D V^\top \left[ V D (I_p) D V^\top \right]^{-1} V D U^\top$$$$H = U D V^\top \left[ V D^2 V^\top \right]^{-1} V D U^\top$$Apply the inverse: Note that for invertible matrices $(ABC)^{-1} = C^{-1}B^{-1}A^{-1}$. Since $V$ is orthogonal, $V^{-1} = V^\top$.
$$\left[ V D^2 V^\top \right]^{-1} = (V^\top)^{-1} (D^2)^{-1} V^{-1} = V D^{-2} V^\top$$Substitute back and simplify:
$$H = U D V^\top \left( V D^{-2} V^\top \right) V D U^\top$$$$H = U D (V^\top V) D^{-2} (V^\top V) D U^\top$$Since $V^\top V = I_p$:
$$H = U D I_p D^{-2} I_p D U^\top$$$$H = U (D D^{-2} D) U^\top$$$$H = U (I_p) U^\top$$
Result:
$$H = UU^\top$$(b) Eigenvalues of $M$ and PSD Property
Goal: Determine eigenvalues/multiplicities of $M$ and explain why it is Positive Semi-Definite (PSD).
Analysis:
Eigenvalues of $H$: From part (a), $H = UU^\top$. This is an orthogonal projection matrix onto the column space of $X$ (which has dimension/rank $p$).
- A projection matrix has eigenvalues of either 1 or 0.
- Since the rank is $p$, there are $p$ eigenvalues equal to 1.
- The remaining $n - p$ eigenvalues are 0.
Eigenvalues of $M$: We are given $M = I_n - H$. If $\lambda$ is an eigenvalue of $H$ with eigenvector $v$, then:
$$Mv = (I - H)v = v - Hv = v - \lambda v = (1 - \lambda)v$$The eigenvalues of $M$ are simply $1 - \lambda_H$.
- For the $p$ eigenvalues where $\lambda_H = 1$: $\lambda_M = 1 - 1 = \mathbf{0}$.
- For the $n-p$ eigenvalues where $\lambda_H = 0$: $\lambda_M = 1 - 0 = \mathbf{1}$.
Summary of Eigenvalues:
- Value 0: Multiplicity $p$
- Value 1: Multiplicity $n - p$
Why is $M$ PSD? A matrix is Positive Semi-Definite (PSD) if and only if all of its eigenvalues are non-negative ($\lambda \geq 0$). Since the eigenvalues of $M$ are exclusively 0 and 1 (both $\geq 0$), $M$ is a PSD matrix.
(c) Trace of the Hat Matrix and Average Leverage
Goal: Show $\text{tr}(H) = p$ and find the average of diagonal elements $H_{ii}$.
1. Trace Calculation: Using the property $\text{tr}(AB) = \text{tr}(BA)$ (cyclic property) and the result from (a) $H = UU^\top$:
$$\text{tr}(H) = \text{tr}(UU^\top) = \text{tr}(U^\top U)$$Since $U$ has orthonormal columns ($U \in \mathbb{R}^{n \times p}$), $U^\top U = I_p$ (the $p \times p$ identity matrix).
$$\text{tr}(H) = \text{tr}(I_p) = p$$2. Average Leverage: The “leverage” scores are the diagonal elements of the Hat matrix, $H_{ii}$.
- The sum of the diagonal elements is the trace: $\sum_{i=1}^n H_{ii} = \text{tr}(H) = p$.
- The average value is the sum divided by the number of observations ($n$).
Answer:
$$\text{Average Leverage} = \frac{p}{n}$$(d) Residual Sum of Squares Proof
Goal: Prove $\|r\|^2 = \text{tr}(y^\top M y)$.
Proof:
Definition of Norm: The squared Euclidean norm of the residuals is defined as inner product of $r$ with itself:
$$\|r\|^2 = r^\top r$$Substitute $r = My$:
$$\|r\|^2 = (My)^\top (My) = y^\top M^\top M y$$Properties of $M$:
- Symmetry: Since $H$ is symmetric and $I$ is symmetric, $M = I - H$ is symmetric. Thus, $M^\top = M$.
- Idempotence: $M$ is a projection matrix (it projects onto the orthogonal complement of the column space of X). $$M^2 = (I - H)(I - H) = I - 2H + H^2$$ Since $H$ is idempotent ($H^2 = H$), $M^2 = I - 2H + H = I - H = M$.
- Therefore, $M^\top M = M \cdot M = M$.
Simplify:
$$y^\top M^\top M y = y^\top M y$$Using the Trace Hint: The term $y^\top M y$ results in a scalar (a $1 \times 1$ matrix). The trace of a scalar is just the scalar itself.
$$y^\top M y = \text{tr}(y^\top M y)$$(Alternatively, using the cyclic property strictly from step 1: $\|r\|^2 = \text{tr}(r^\top r) = \text{tr}(r r^\top)$. Substituting $r=My$ eventually yields the same result).
Conclusion:
$$\|r\|^2 = \text{tr}(y^\top M y)$$FINAL1-Q2
Here are the step-by-step solutions for Part B: Data Transformation & SVD, along with the related mathematical concepts.
Q2. Total Variance and Principal Components
(a) Diagonalizing the Sample Covariance Matrix
Goal: Show that $S = \frac{1}{n}VD^2V^\top$.
Proof:
- Definition: The sample covariance matrix is defined as $S = \frac{1}{n}X^\top X$.
- Substitute SVD: We are given the Singular Value Decomposition $X = UDV^\top$. Substitute this into the expression for $S$: $$S = \frac{1}{n} (UDV^\top)^\top (UDV^\top)$$
- Transpose Property: Recall that $(ABC)^\top = C^\top B^\top A^\top$. Also, $D$ is a diagonal matrix, so $D^\top = D$. $$S = \frac{1}{n} (V D U^\top) (U D V^\top)$$
- Simplify: Since $U$ is semi-orthogonal (columns are orthonormal), $U^\top U = I_p$ (the identity matrix). $$S = \frac{1}{n} V D (I_p) D V^\top$$ $$S = \frac{1}{n} V D^2 V^\top$$
Result:
$$S = \frac{1}{n}VD^2V^\top$$(Note: This is the Eigendecomposition of $S$, where the eigenvalues are $\frac{d_i^2}{n}$ and the eigenvectors are the columns of $V$.)
(b) Total Variance and Frobenius Norm
Goal: Show that $\text{Total Variance} = \frac{1}{n} \sum_{i=1}^p d_i^2$.
Proof:
- Definition: Total Variance is defined as the trace of the covariance matrix, $\text{tr}(S)$.
- Substitute Result from (a): $$\text{Total Variance} = \text{tr}\left( \frac{1}{n} V D^2 V^\top \right) = \frac{1}{n} \text{tr}(V D^2 V^\top)$$
- Cyclic Property of Trace: For matrices $A, B, C$, $\text{tr}(ABC) = \text{tr}(BCA)$. Let $A=V$, $B=D^2$, $C=V^\top$. $$\text{tr}(V D^2 V^\top) = \text{tr}(D^2 V^\top V)$$
- Orthogonality: Since $V$ is orthogonal, $V^\top V = I_p$. $$\text{tr}(D^2 I_p) = \text{tr}(D^2)$$
- Compute Trace: $D$ is diagonal with entries $d_1, \dots, d_p$. Therefore, $D^2$ is diagonal with entries $d_1^2, \dots, d_p^2$. The trace is the sum of diagonal elements. $$\text{tr}(D^2) = \sum_{i=1}^p d_i^2$$
Final Result:
$$\text{Total Variance} = \frac{1}{n} \sum_{i=1}^p d_i^2$$(c) Transformed Data and Correlation
Goal: Find the covariance of $Z = XV$ and determine if $Z$ is correlated.
1. Calculate Covariance of $Z$: Let $S_Z$ be the sample covariance of the transformed data.
$$S_Z = \frac{1}{n} Z^\top Z$$Substitute $Z = XV$:
$$S_Z = \frac{1}{n} (XV)^\top (XV) = \frac{1}{n} V^\top X^\top X V$$From part (a), we know $X^\top X = V D^2 V^\top$. Substitute this back:
$$S_Z = \frac{1}{n} V^\top (V D^2 V^\top) V$$$$S_Z = \frac{1}{n} (V^\top V) D^2 (V^\top V)$$Since $V^\top V = I$:
$$S_Z = \frac{1}{n} D^2$$2. Is $Z$ correlated? The matrix $S_Z = \frac{1}{n} \text{diag}(d_1^2, \dots, d_p^2)$ is a diagonal matrix.
- The diagonal elements represent the variance of the new variables.
- The off-diagonal elements are all zero.
Conclusion: Since the off-diagonal elements (covariances) are zero, the variables in $Z$ are uncorrelated. (Context: This is the core mechanism of Principal Component Analysis (PCA). $Z$ represents the Principal Component Scores, which are guaranteed to be orthogonal/uncorrelated).
(d) Gradient Application
Goal: Verify that the derivative of $f(A) = \|X - A\|_F^2$ is zero when $A = X$.
Proof:
Expand the Frobenius Norm:
$$f(A) = \text{tr}((X - A)^\top (X - A))$$$$f(A) = \text{tr}(X^\top X - X^\top A - A^\top X + A^\top A)$$Apply Trace Linearity:
$$f(A) = \text{tr}(X^\top X) - \text{tr}(X^\top A) - \text{tr}(A^\top X) + \text{tr}(A^\top A)$$Note that $\text{tr}(X^\top A) = \text{tr}((X^\top A)^\top) = \text{tr}(A^\top X)$.
$$f(A) = \text{tr}(X^\top X) - 2\text{tr}(A^\top X) + \text{tr}(A^\top A)$$Compute the Gradient $\nabla_A f(A)$: We use standard matrix calculus identities:
- $\nabla_A \text{tr}(A^\top A) = 2A$
- $\nabla_A \text{tr}(A^\top X) = X$
- $\nabla_A (\text{constant}) = 0$
Applying these to our equation:
$$\nabla_A f(A) = 0 - 2(X) + 2A$$$$\nabla_A f(A) = 2A - 2X$$Set to Zero: To find the critical point, set the gradient to zero:
$$2A - 2X = 0 \implies 2A = 2X \implies A = X$$
Conclusion: The derivative is zero exactly when $A = X$, verifying that the matrix closest to $X$ in the Frobenius norm is $X$ itself (which makes intuitive sense as the distance is zero).
Related Concepts & Theorems
This problem set relies heavily on several key Linear Algebra concepts:
- Singular Value Decomposition (SVD): The factorization $X = UDV^\top$ is the “master key” of linear algebra. It reveals the intrinsic geometry of the data matrix $X$.
- Spectral Theorem: Since $S = \frac{1}{n}X^\top X$ is a symmetric matrix, the Spectral Theorem guarantees it can be diagonalized by an orthogonal matrix ($V$). The proof in Q2(a) explicitly constructs this diagonalization.
- Principal Component Analysis (PCA): Q2(c) is the mathematical derivation of PCA.
- $V$ contains the Principal Directions (eigenvectors of covariance).
- $Z = XV$ are the Principal Component Scores.
- $\frac{1}{n}d_i^2$ are the Explained Variances (eigenvalues of covariance).
- Matrix Calculus: The gradient derivation in Q2(d) is fundamental to machine learning optimization (e.g., deriving the Ordinary Least Squares solution).
- Trace & Frobenius Norm: The relationship $\|X\|_F^2 = \text{tr}(X^\top X) = \sum \sigma_i^2$ connects element-wise magnitude to spectral properties (singular values).
FINAL1-Q3
Here are the step-by-step solutions for Part C: Statistical Inference (Ridge Regression Context).
These exercises explore the properties of the Ridge Estimator, a biased estimator used to handle multicollinearity and reduce variance in regression models.
Q3. Distribution of the Ridge Estimator
Context:
- Model: $y \sim \mathcal{N}_n(X\beta, \sigma^2 I_n)$.
- Estimator: $\hat{\beta}_R = (X^\top X + \lambda I_p)^{-1}X^\top y$.
- Constant: $\lambda > 0$.
(a) Expectation and Bias
Goal: Calculate $\mathbb{E}[\hat{\beta}_R]$ and determine if it is unbiased.
Derivation:
Linearity of Expectation: Let $W = (X^\top X + \lambda I_p)^{-1}X^\top$. Then $\hat{\beta}_R = Wy$. Since $W$ is a constant matrix (deterministic $X$):
$$\mathbb{E}[\hat{\beta}_R] = \mathbb{E}[Wy] = W\mathbb{E}[y]$$Substitute Expected Value of $y$: We know $\mathbb{E}[y] = X\beta$.
$$\mathbb{E}[\hat{\beta}_R] = (X^\top X + \lambda I_p)^{-1}X^\top (X\beta)$$$$\mathbb{E}[\hat{\beta}_R] = (X^\top X + \lambda I_p)^{-1}(X^\top X)\beta$$Check for Unbiasedness: For an estimator to be unbiased, we require $\mathbb{E}[\hat{\beta}_R] = \beta$. This would require $(X^\top X + \lambda I_p)^{-1}(X^\top X) = I$.
- Note that $(X^\top X + \lambda I_p)^{-1}(X^\top X + \lambda I_p) = I$.
- Our term has $X^\top X$ instead of $(X^\top X + \lambda I_p)$.
- We can rewrite the expectation as: $$(X^\top X + \lambda I_p)^{-1}(X^\top X + \lambda I_p - \lambda I_p)\beta = [I - \lambda(X^\top X + \lambda I_p)^{-1}]\beta$$ $$= \beta - \lambda(X^\top X + \lambda I_p)^{-1}\beta$$
Conclusion: Since $\mathbb{E}[\hat{\beta}_R] \neq \beta$ (assuming $\lambda > 0$ and $\beta \neq 0$), $\hat{\beta}_R$ is a BIASED estimator of $\beta$.
(b) Covariance Matrix
Goal: Derive $\text{Var}(\hat{\beta}_R)$.
Derivation:
Property of Variance: For a linear transformation $Ay$, the covariance is $\text{Var}(Ay) = A\text{Var}(y)A^\top$. Here, $A = (X^\top X + \lambda I_p)^{-1}X^\top$.
Substitute $\text{Var}(y)$: $\text{Var}(y) = \sigma^2 I_n$.
$$\text{Var}(\hat{\beta}_R) = A (\sigma^2 I_n) A^\top = \sigma^2 A A^\top$$Compute $A A^\top$:
$$A = (X^\top X + \lambda I_p)^{-1}X^\top$$$$A^\top = \left[ (X^\top X + \lambda I_p)^{-1}X^\top \right]^\top = X \left[ (X^\top X + \lambda I_p)^{-1} \right]^\top$$- Since $(X^\top X + \lambda I_p)$ is symmetric, its inverse is also symmetric. $$A^\top = X (X^\top X + \lambda I_p)^{-1}$$
Combine Terms:
$$\text{Var}(\hat{\beta}_R) = \sigma^2 \left[ (X^\top X + \lambda I_p)^{-1}X^\top \right] \left[ X (X^\top X + \lambda I_p)^{-1} \right]$$$$\text{Var}(\hat{\beta}_R) = \sigma^2 (X^\top X + \lambda I_p)^{-1} (X^\top X) (X^\top X + \lambda I_p)^{-1}$$
Result:
$$\text{Var}(\hat{\beta}_R) = \sigma^2 (X^\top X + \lambda I_p)^{-1} (X^\top X) (X^\top X + \lambda I_p)^{-1}$$(Note: Unlike OLS, the terms do not cancel out to satisfy the simple $\sigma^2 (X^\top X)^{-1}$ form. This is the “sandwich” variance estimator form.)
(c) Ridge “Hat” Matrix
Goal: Determine if $A_\lambda = X(X^\top X + \lambda I_p)^{-1}X^\top$ is a projection matrix.
Analysis:
Definition: $\hat{y}_R = X \hat{\beta}_R = X(X^\top X + \lambda I_p)^{-1}X^\top y$. Thus, $A_\lambda = X(X^\top X + \lambda I_p)^{-1}X^\top$.
Condition for Projection Matrix: A projection matrix $P$ must be idempotent ($P^2 = P$). Let’s calculate $A_\lambda^2$:
$$A_\lambda^2 = [X(X^\top X + \lambda I)^{-1}X^\top] [X(X^\top X + \lambda I)^{-1}X^\top]$$$$A_\lambda^2 = X(X^\top X + \lambda I)^{-1} (X^\top X) (X^\top X + \lambda I)^{-1}X^\top$$Comparison: For $A_\lambda^2 = A_\lambda$, the middle term $(X^\top X)$ would effectively need to cancel one of the inverse terms. Specifically, we would need:
$$(X^\top X + \lambda I)^{-1} (X^\top X) = I$$$$X^\top X = X^\top X + \lambda I \implies \lambda I = 0$$Since $\lambda > 0$, this is impossible.
Conclusion: No, $A_\lambda$ is NOT a projection matrix. It does not project data orthogonally; instead, it “shrinks” the fitted values towards zero (the eigenvalues of $A_\lambda$ are strictly less than 1).
(d) Full Distribution
Goal: State the distribution $\hat{y}_{new} \sim \mathcal{N}(?, ?)$.
Reasoning:
- $\hat{y}_{new} = u^\top \hat{\beta}_R$ is a linear combination of the elements of $\hat{\beta}_R$.
- Since $y$ is Normally distributed, $\hat{\beta}_R$ (a linear transformation of $y$) is Normal.
- Therefore, $\hat{y}_{new}$ is also Normally distributed.
Parameters:
Mean:
$$\mu_{new} = \mathbb{E}[u^\top \hat{\beta}_R] = u^\top \mathbb{E}[\hat{\beta}_R]$$Using the result from (a):
$$\mu_{new} = u^\top (X^\top X + \lambda I_p)^{-1} X^\top X \beta$$Variance:
$$\sigma^2_{new} = \text{Var}(u^\top \hat{\beta}_R) = u^\top \text{Var}(\hat{\beta}_R) u$$Using the result from (b):
$$\sigma^2_{new} = \sigma^2 u^\top (X^\top X + \lambda I_p)^{-1} (X^\top X) (X^\top X + \lambda I_p)^{-1} u$$
Final Distribution:
$$\hat{y}_{new} \sim \mathcal{N} \left( u^\top (X^\top X + \lambda I)^{-1} X^\top X \beta, \quad \sigma^2 u^\top (X^\top X + \lambda I)^{-1} X^\top X (X^\top X + \lambda I)^{-1} u \right)$$Related Concepts
- Bias-Variance Tradeoff: Ridge regression introduces a small bias (as proved in Q3a) to significantly reduce the variance of the estimator. This often results in a lower Mean Squared Error (MSE) than OLS, especially when $X$ is multicollinear.
- Sandwich Estimator: The variance form derived in Q3(b) $(A^{-1} B A^{-1})$ is a classic “sandwich” structure, often seen in robust statistics and generalized estimating equations.
- Shrinkage: The matrix $A_\lambda$ in Q3(c) scales the singular values of $X$ by a factor of $\frac{d_i^2}{d_i^2 + \lambda}$. Since this factor is $<1$, the predictions are “shrunk” toward zero, preventing overfitting.
FINAL2-Q1
Here are the step-by-step solutions for Part A: Advanced Inference in Linear Models (GLS & Constraints).
These exercises derive the Generalized Least Squares (GLS) estimator, which is the standard method for handling correlation or non-constant variance (heteroscedasticity) in error terms.
Q1. Generalized Least Squares (GLS)
Context:
- Model: $y = X\beta + \epsilon$
- Noise: $\epsilon \sim \mathcal{N}(0, \sigma^2 \Psi)$, where $\Psi$ is a known positive definite matrix.
(a) Transformation Matrix $L$
Goal: Find a matrix $L$ such that the transformed noise $L\epsilon$ is “white noise” ($\sim \mathcal{N}(0, \sigma^2 I)$).
Derivation:
Analyze the Covariance: We want the variance of the transformed errors to be $\sigma^2 I$. Let $\epsilon^* = L\epsilon$.
$$\text{Var}(\epsilon^*) = \text{Var}(L\epsilon) = L \text{Var}(\epsilon) L^\top = L (\sigma^2 \Psi) L^\top = \sigma^2 (L \Psi L^\top)$$We require $L \Psi L^\top = I$.
Construct $L$: Since $\Psi$ is positive definite, its inverse $\Psi^{-1}$ is also positive definite. We can use the Cholesky Decomposition (or eigen-decomposition) to find a matrix $L$ such that:
$$L^\top L = \Psi^{-1}$$(Usually, $L$ is chosen as the upper triangular Cholesky factor of $\Psi^{-1}$, or symmetric square root $\Psi^{-1/2}$).
Verify the condition: If $L^\top L = \Psi^{-1}$, then $\Psi = (L^\top L)^{-1} = L^{-1} (L^\top)^{-1}$. Substitute this back into the variance equation:
$$L \Psi L^\top = L [L^{-1} (L^\top)^{-1}] L^\top = (L L^{-1}) ((L^\top)^{-1} L^\top) = I \cdot I = I$$This satisfies the requirement.
Result: $L$ is any matrix satisfying $L^\top L = \Psi^{-1}$ (e.g., $L = \Psi^{-1/2}$).
(b) The GLS Estimator
Goal: Derive $\hat{\beta}_{GLS}$ using the transformation from (a).
Derivation:
Transform the Model: Multiply the original model $y = X\beta + \epsilon$ by $L$:
$$Ly = LX\beta + L\epsilon$$Let $y^* = Ly$, $X^* = LX$, and $\epsilon^* = L\epsilon$. The new model is $y^* = X^*\beta + \epsilon^*$, where $\epsilon^* \sim \mathcal{N}(0, \sigma^2 I)$.
Apply OLS: Since the new noise is “white” (homoscedastic and uncorrelated), the Ordinary Least Squares (OLS) estimator is the Best Linear Unbiased Estimator (BLUE).
$$\hat{\beta}_{GLS} = (X^{*\top} X^*)^{-1} X^{*\top} y^*$$Substitute Original Variables: Substitute $X^* = LX$ and $y^* = Ly$:
$$\hat{\beta}_{GLS} = ((LX)^\top (LX))^{-1} (LX)^\top (Ly)$$$$\hat{\beta}_{GLS} = (X^\top L^\top L X)^{-1} X^\top L^\top L y$$Simplify using $\Psi$: Recall from part (a) that $L^\top L = \Psi^{-1}$.
$$\hat{\beta}_{GLS} = (X^\top \Psi^{-1} X)^{-1} X^\top \Psi^{-1} y$$
Result:
$$\hat{\beta}_{GLS} = (X^\top \Psi^{-1} X)^{-1} X^\top \Psi^{-1} y$$(c) Distribution of $\hat{\beta}_{GLS}$
Goal: State the mean and covariance matrix of the estimator.
Reasoning:
- $\hat{\beta}_{GLS}$ is a linear combination of the normally distributed vector $y$. Thus, it is also normally distributed.
- Mean: Since GLS is an unbiased estimator (derived from OLS on a valid model): $$\mathbb{E}[\hat{\beta}_{GLS}] = \beta$$
- Covariance: The variance of the OLS estimator on the transformed data is $\sigma^2 (X^{*\top} X^*)^{-1}$. Using the substitution $X^{*\top} X^* = X^\top \Psi^{-1} X$: $$\text{Var}(\hat{\beta}_{GLS}) = \sigma^2 (X^\top \Psi^{-1} X)^{-1}$$
Final Distribution:
$$\hat{\beta}_{GLS} \sim \mathcal{N}_p \left( \beta, \sigma^2 (X^\top \Psi^{-1} X)^{-1} \right)$$(d) Hypothesis Testing (General Constraint)
Goal: Construct a $\chi^2$ test statistic for $H_0: A\beta = 0$ vs $H_1: A\beta \neq 0$.
Derivation:
Distribution of $A\hat{\beta}_{GLS}$: Since $\hat{\beta}_{GLS}$ is Normal, the linear transformation $A\hat{\beta}_{GLS}$ is also Normal.
- Mean under $H_0$: $\mathbb{E}[A\hat{\beta}_{GLS}] = A\beta = 0$.
- Variance: $\text{Var}(A\hat{\beta}_{GLS}) = A \text{Var}(\hat{\beta}_{GLS}) A^\top$. Substitute the variance from (c): $$V_{constrained} = \sigma^2 A (X^\top \Psi^{-1} X)^{-1} A^\top$$
Construct Quadratic Form (Wald Statistic): For a vector $z \sim \mathcal{N}(0, \Sigma)$, the quadratic form $z^\top \Sigma^{-1} z$ follows a $\chi^2$ distribution with degrees of freedom equal to the rank of $\Sigma$ (which is $r$, the number of rows in $A$). Let $z = A\hat{\beta}_{GLS}$.
$$T = (A\hat{\beta}_{GLS})^\top \left[ \text{Var}(A\hat{\beta}_{GLS}) \right]^{-1} (A\hat{\beta}_{GLS})$$Substitute Variance:
$$T = (A\hat{\beta}_{GLS})^\top \left[ \sigma^2 A (X^\top \Psi^{-1} X)^{-1} A^\top \right]^{-1} (A\hat{\beta}_{GLS})$$
Result:
$$T = \frac{1}{\sigma^2} (A\hat{\beta}_{GLS})^\top \left( A (X^\top \Psi^{-1} X)^{-1} A^\top \right)^{-1} (A\hat{\beta}_{GLS}) \sim \chi^2_r$$(Note: If $\sigma^2$ is unknown, we would typically replace it with an estimate $\hat{\sigma}^2$ and use an F-distribution, but the question specifically asks for a statistic following a $\chi^2$ distribution, implying $\sigma^2$ is treated as known or part of the scaling).
Here are the step-by-step solutions for the three remaining questions: Part C (Kernel Methods), Part B (PCA Duality), and Part D (CCA Optimization).
FINAL2-Q2
This question explores how to perform PCA efficiently when $p \gg n$ (e.g., genetics) using the “Dual” approach (SVD on $X$ instead of eigen-decomposition of covariance).
(a) Eigenvalues
Goal: Express eigenvalues $\lambda_i$ of Sample Covariance $S$ in terms of singular values $d_i$ of $X$.
Derivation:
- The sample covariance is $S = \frac{1}{n} X^\top X$.
- Using SVD, $X = UDV^\top$. Thus, $X^\top X = V D^2 V^\top$.
- The eigenvalues of $X^\top X$ are the diagonal entries of $D^2$, which are $d_i^2$.
- Scaling by $\frac{1}{n}$: $$\lambda_i = \frac{d_i^2}{n}$$
(b) Scores vs. Loadings
Goal: Show Principal Component Scores $Z = XV$ can be computed as $UD$.
Proof:
- Definition: Scores are the projection of data $X$ onto the loadings $V$: $Z = XV$.
- Substitute SVD: Replace $X$ with $UDV^\top$: $$Z = (UDV^\top) V$$
- Simplify: Since $V$ is orthogonal, $V^\top V = I$. $$Z = U D (I) = UD$$
Significance: We can compute the scores directly from the left singular vectors $U$ and singular values $D$. If $p$ is huge (e.g., 20,000 genes), we avoid computing the massive $p \times p$ covariance matrix or finding the $p$-dimensional vector $V$. We only work with the $n \times n$ matrix $XX^\top$ to get $U$.
(c) Eckart-Young Theorem (Reconstruction)
Goal: Write reconstruction $\hat{X}_k$ and error $\|\cdot\|_F^2$.
- Reconstruction: The best rank-$k$ approximation of $X$ is formed by keeping the top $k$ singular values/vectors: $$\hat{X}_k = \sum_{i=1}^k d_i u_i v_i^\top = U_k D_k V_k^\top$$
- Error Norm: The squared Frobenius norm of the error is the sum of the squared singular values of the discarded components (from $k+1$ to $p$): $$\|X - \hat{X}_k\|_F^2 = \sum_{i=k+1}^p d_i^2$$
(d) Geometry (Variance vs. Reconstruction)
Goal: Explain why Max Variance $\iff$ Min Reconstruction Error.
Explanation:
- Total “Energy”: The total squared Frobenius norm of $X$ is the sum of all squared singular values: $\|X\|_F^2 = \sum_{i=1}^p d_i^2$. This is proportional to the Total Variance.
- Decomposition: $$\|X\|_F^2 = \underbrace{\sum_{i=1}^k d_i^2}_{\text{Variance Explained}} + \underbrace{\sum_{i=k+1}^p d_i^2}_{\text{Reconstruction Error}}$$
- Equivalence: Since the Total Variance ($\|X\|_F^2$) is constant for a fixed dataset:
- Maximizing the first term (Variance Explained by top $k$ components)…
- …is mathematically identical to Minimizing the second term (Reconstruction Error).
FINAL2-Q3
This question bridges the gap between the theoretical definition of feature maps $\Phi(x)$ and the practical implementation of Kernel PCA using only the kernel matrix $K$.
(a) Distance in Feature Space
Goal: Express $\|\Phi(x_i) - \Phi(x_j)\|^2$ using only kernel values $K(\cdot, \cdot)$.
Derivation:
- Expand the squared norm: The squared Euclidean distance is the inner product of the difference vector with itself: $$d^2(\Phi(x_i), \Phi(x_j)) = \langle \Phi(x_i) - \Phi(x_j), \Phi(x_i) - \Phi(x_j) \rangle$$
- Distribute the terms: $$= \langle \Phi(x_i), \Phi(x_i) \rangle - \langle \Phi(x_i), \Phi(x_j) \rangle - \langle \Phi(x_j), \Phi(x_i) \rangle + \langle \Phi(x_j), \Phi(x_j) \rangle$$
- Apply the Kernel definition: By definition, the kernel function is the inner product in feature space: $K(a, b) = \langle \Phi(a), \Phi(b) \rangle$. Also, inner products are symmetric. $$= K(x_i, x_i) - 2K(x_i, x_j) + K(x_j, x_j)$$
Result:
$$\|\Phi(x_i) - \Phi(x_j)\|^2 = K(x_i, x_i) - 2K(x_i, x_j) + K(x_j, x_j)$$(b) The Centering Matrix
Goal: Explain why $\tilde{K} = CKC$ centers the kernel matrix.
Explanation:
- Recall that $C = I_n - \frac{1}{n}1_n 1_n^\top$.
- Multiplying by $C$ on the right ($KC$): This subtracts the row means from every element. Specifically, $(KC)_{ij} = K_{ij} - \frac{1}{n}\sum_{k=1}^n K_{ik}$.
- Multiplying by $C$ on the left ($CK$): This subtracts the column means from every element.
- Multiplying on both sides ($CKC$): This performs the “double centering” operation. It subtracts the row mean, the column mean, and adds back the grand mean (to avoid double subtraction). Mathematically, this corresponds to centering the feature vectors $\tilde{\Phi}(x_i) = \Phi(x_i) - \frac{1}{n}\sum_j \Phi(x_j)$ purely in terms of the kernel matrix, without ever computing the explicit means in high-dimensional space.
(c) Effect on Eigenvalues
- Does it change eigenvalues? Yes. The uncentered kernel matrix $K$ captures the raw inner products (relative to the origin). The centered kernel matrix $\tilde{K}$ captures the covariance (inner products relative to the mean). The eigenvalues of $\tilde{K}$ represent the variance along the principal components.
- Why is it necessary? Standard PCA is defined on the covariance matrix (centered data). If data is not centered, the first “principal component” would largely just point from the origin to the center of the data cloud (the mean), rather than capturing the direction of maximum variance within the data. We need centering to perform valid PCA.
(d) Linear Kernel Check
Goal: Verify if $\tilde{K} = CKC$ works for the linear kernel $K = XX^\top$.
Derivation:
- Substitute $K = XX^\top$ into the equation: $$\tilde{K} = C (XX^\top) C$$
- Use the property that $C$ is symmetric ($C = C^\top$): $$\tilde{K} = (CX) (X^\top C^\top) = (CX) (CX)^\top$$
- Interpretation: We know that $CX$ is the matrix where the columns of $X$ are centered (mean subtracted). Therefore, $\tilde{K}$ is the inner product matrix of the centered data, which is exactly the definition of the centered kernel.
Answer: Yes, it effectively centers the original data $X$.
FINAL2-Q4
Canonical Correlation Analysis (CCA) finds linear combinations of two sets of variables ($X$ and $Y$) that are maximally correlated.
(a) The Objective
Goal: Write the optimization problem for CCA.
Problem Statement: Since correlation is scale-invariant, we fix the variance of the projections to 1 to make the solution unique.
- Maximize: $u^\top S_{XY} v$ (Covariance between transformed variables)
- Subject to:
- $u^\top S_{XX} u = 1$ (Variance of $X$-projection is 1)
- $v^\top S_{YY} v = 1$ (Variance of $Y$-projection is 1)
(Note: $S_{XY} = \frac{1}{n}X^\top Y$, $S_{XX} = \frac{1}{n}X^\top X$, etc.)
(b) Independence
Scenario: $X$ and $Y$ are statistically independent.
- Theoretical Implication: If independent, there is no linear relationship between them. The theoretical cross-covariance matrix $\Sigma_{XY}$ is the zero matrix.
- Solution:
- The objective function becomes maximizing $u^\top (\approx 0) v$.
- The maximum canonical correlation would be approximately 0. (In a finite sample, it will be a small non-zero value due to random noise/spurious correlation, but theoretically 0).
(c) Relation to Regression
Goal: Show CCA direction $u$ is proportional to OLS $\hat{\beta}$ when $Y$ is univariate ($y$).
Proof:
- CCA Objective: Maximize $\text{Corr}(X^\top u, y)$. This seeks the linear combination of $X$ (let’s call it $\hat{y}_{cca} = X u$) that has the highest correlation with $y$.
- OLS Property: Ordinary Least Squares finds $\hat{\beta}$ such that the fitted values $\hat{y}_{ols} = X \hat{\beta}$ minimize squared error.
- Geometrically, $\hat{y}_{ols}$ is the orthogonal projection of $y$ onto the column space of $X$.
- The projection of a vector $y$ onto a subspace (Col($X$)) is the vector in that subspace that forms the smallest angle with $y$.
- Smallest angle $\iff$ Maximum cosine similarity $\iff$ Maximum Correlation.
- Conclusion: Since OLS produces the vector in the column space of $X$ maximally correlated with $y$, the CCA direction $u$ must be proportional to the OLS coefficients $\hat{\beta}$. $u \propto (X^\top X)^{-1} X^\top y = \hat{\beta}_{OLS}$.
FINAL3-Q1 Answer
(a) System of Equations
Goal: Write down the equations relating the loadings $l_1, l_2, l_3$ to the observed correlations $r_{ij}$.
Reasoning: In a standard one-factor model where variables are standardized, the correlation matrix $R$ (which equals the covariance matrix $\Sigma$) is modeled as:
$$R = \Sigma = LL^\top + \Psi$$For the off-diagonal elements (where $i \neq j$), the specific variance term $\Psi$ (which is diagonal) is zero. Therefore, the observed correlation between two variables is simply the product of their factor loadings:
$$r_{ij} = (LL^\top)_{ij} = l_i l_j$$Answer: Given the observed correlations $r_{12} = 0.6$, $r_{13} = 0.6$, and $r_{23} = 0.6$, the system of equations is:
- $$l_1 l_2 = 0.6$$
- $$l_1 l_3 = 0.6$$
- $$l_2 l_3 = 0.6$$
(b) The “Heywood Case” (Boundary Solution)
Goal: Solve for $l_1^2, l_2^2, l_3^2$, calculate the specific variance $\psi_1$, and determine if the model is valid.
Step 1: Solve for the squared loadings ($l_i^2$) We can solve for $l_1^2$ by combining the three equations derived in part (a).
$$l_1^2 = \frac{(l_1 l_2) \times (l_1 l_3)}{l_2 l_3}$$Substituting the observed values:
$$l_1^2 = \frac{0.6 \times 0.6}{0.6} = 0.6$$By symmetry (since all correlations are 0.6), the solutions for the others are identical:
$$l_2^2 = 0.6$$$$l_3^2 = 0.6$$Step 2: Calculate the specific variance $\psi_1$ In a standardized model, the total variance of each variable is 1. The model decomposes this variance into communality ($l_i^2$) and specific variance ($\psi_i$):
$$1 = l_i^2 + \psi_i$$$$\psi_1 = 1 - l_1^2$$Substituting our calculated value:
$$\psi_1 = 1 - 0.6 = \mathbf{0.4}$$Step 3: Is this a valid statistical model? Yes, this is a valid statistical model.
Explanation: A statistical model of variance is invalid if it produces a “Heywood case,” which is defined as a specific variance estimate that is negative ($\psi < 0$). Since variance cannot be negative, such a result would imply the model does not fit the data. In this specific calculation, we found $\psi_1 = 0.4$. Since $0.4 > 0$, the specific variance is positive and valid.
(Note: While the question title is “The Heywood Case,” this specific set of numbers ($r=0.6$) produces a valid solution. A true Heywood case would occur if the correlations were much higher, e.g., if $r_{ij} > 1$ or if the algebraic combination resulted in $l^2 > 1$.)
(c) Rotation Invariance
Goal: Prove that the “Total Communality” $\sum_{i=1}^p h_i$ is invariant to rotation.
Definitions:
- Let $L$ be the $p \times r$ matrix of loadings.
- The communality of the $i$-th variable is $h_i = \sum_{j=1}^r l_{ij}^2$, which corresponds to the diagonal elements of the matrix product $LL^\top$.
- The “Total Communality” is the sum of these diagonal elements, which is the Trace of the matrix: $\text{Total Communality} = \text{tr}(LL^\top)$.
- Let $Q$ be an orthogonal rotation matrix, such that $Q Q^\top = Q^\top Q = I$.
- The rotated loadings are $L^* = LQ$.
Proof: We want to show that the Total Communality of $L^*$ is the same as that of $L$.
Write the Total Communality for the rotated loadings $L^*$:
$$\sum h_i^* = \text{tr}(L^* (L^*)^\top)$$Substitute $L^* = LQ$:
$$\text{tr}( (LQ)(LQ)^\top )$$Apply the transpose property $(AB)^\top = B^\top A^\top$:
$$\text{tr}( L Q Q^\top L^\top )$$Apply the property of orthogonal matrices ($Q Q^\top = I$):
$$\text{tr}( L I L^\top )$$$$\text{tr}( L L^\top )$$This equals the Total Communality of the original loadings $L$:
$$\text{tr}( L L^\top ) = \sum h_i$$
Conclusion: Since $\text{tr}(L^* (L^*)^\top) = \text{tr}(LL^\top)$, the Total Communality is invariant to orthogonal rotation.
FINAL3-Q2 Answer
(a) Covariance and PCA Rotation Proposition: PCA applied to $S$ yields an arbitrary rotation (effectively Identity). Proof: Let $S = (S_1, S_2)^\top$. The joint distribution has probability mass $p=1/4$ at points $\{(0,1), (0,-1), (1,0), (-1,0)\}$.
- Mean Vector: $\mathbb{E}[S] = \frac{1}{4}(0,1) + \frac{1}{4}(0,-1) + \frac{1}{4}(1,0) + \frac{1}{4}(-1,0) = (0,0)^\top$.
- Covariance Matrix $\Sigma_S$:
- Variance: $\sigma_1^2 = \mathbb{E}[S_1^2] = \frac{1}{2}(0)^2 + \frac{1}{4}(1)^2 + \frac{1}{4}(-1)^2 = 0.5$. Similarly, $\sigma_2^2 = 0.5$.
- Covariance: $\mathbb{E}[S_1 S_2] = \sum p_i s_{1i} s_{2i} = \frac{1}{4}(0\cdot1 + 0\cdot(-1) + 1\cdot0 + (-1)\cdot0) = 0$.
- Thus, $\Sigma_S = \begin{pmatrix} 0.5 & 0 \\ 0 & 0.5 \end{pmatrix} = 0.5 I_2$.
- Eigen-decomposition: The characteristic equation is $\det(\Sigma_S - \lambda I) = (0.5 - \lambda)^2 = 0$. Eigenvalues are $\lambda_1 = \lambda_2 = 0.5$. Since eigenvalues are degenerate, any orthogonal basis in $\mathbb{R}^2$ serves as eigenvectors. PCA defaults to the canonical basis $I_2$. $\therefore$ No rotation occurs. [cite_start]$\blacksquare$ [cite: 567-569]
(b) Independence Check Proposition: $S_1$ and $S_2$ are dependent. Proof: We test the independence condition: $P(S_1 \in A, S_2 \in B) \stackrel{?}{=} P(S_1 \in A)P(S_2 \in B)$. Let $A = \{1\}$ and $B = \{1\}$.
- Joint Probability: $P(S_1=1, S_2=1) = 0$ (Point $(1,1)$ is not in the support).
- Marginal Probabilities: $P(S_1=1) = P((1,0)) = 1/4$. $P(S_2=1) = P((0,1)) = 1/4$.
- Comparison: $P(S_1=1)P(S_2=1) = \frac{1}{4} \cdot \frac{1}{4} = \frac{1}{16}$. Since $0 \neq \frac{1}{16}$, $S_1 \not\perp S_2$. [cite_start]$\blacksquare$ [cite: 570-571]
(c) Kurtosis Calculation Proposition: $S_1$ is sub-Gaussian with $\mathcal{K}(S_1) = -1$. Proof: From (a), $\mathbb{E}[S_1] = 0$ and $\mathbb{E}[S_1^2] = 0.5$.
- Fourth Moment: $\mathbb{E}[S_1^4] = \sum p_i s_{1i}^4 = \frac{1}{2}(0)^4 + \frac{1}{4}(1)^4 + \frac{1}{4}(-1)^4 = 0 + 0.25 + 0.25 = 0.5$.
- Excess Kurtosis: $\mathcal{K}(S_1) = \frac{\mathbb{E}[S_1^4]}{(\mathbb{E}[S_1^2])^2} - 3 = \frac{0.5}{(0.5)^2} - 3 = \frac{0.5}{0.25} - 3 = 2 - 3 = -1$. Since $\mathcal{K} < 0$, the distribution is platykurtic (sub-Gaussian). [cite_start]$\blacksquare$ [cite: 572-573]
(d) ICA Objective Proposition: Maximizing non-Gaussianity recovers sources. Proof (via CLT): Let $y = w^\top S$ be a linear mixture. By the Central Limit Theorem, if $S_i$ are independent, the sum $y$ converges to a Gaussian distribution. Since $\mathcal{K}_{Gauss} = 0$ and source $\mathcal{K}_{Source} = -1$: $|\mathcal{K}(y_{mixture})| < |\mathcal{K}(y_{source})|$. $\therefore \max_w |\mathcal{K}(w^\top S)| \iff w \text{ aligns with a source axis}$. [cite_start]$\blacksquare$ [cite: 574-575]
FINAL3-Q3 Answer
(a) Proof of Scale-Invariance
Goal: Prove that $\mathcal{K}(wX) = \mathcal{K}(X)$ for any non-zero scalar $w$.
Definition: The excess kurtosis of a random variable $X$ is defined as:
$$\mathcal{K}(X) = \frac{E[(X - \mu_X)^4]}{(\sigma_X^2)^2} - 3$$Proof:
1. Determine the Mean and Variance of the Scaled Variable $wX$ First, let us define the properties of the scaled variable $wX$.
- Mean ($\mu_{wX}$): Using the linearity of expectation, $$\mu_{wX} = E[wX] = w E[X] = w \mu_X$$
- Variance ($\sigma_{wX}^2$): Using the property $Var(cX) = c^2 Var(X)$, $$\sigma_{wX}^2 = E[(wX - \mu_{wX})^2] = w^2 \sigma_X^2$$
2. Substitute into the Kurtosis Formula Now, substitute these terms into the definition of kurtosis for the variable $wX$:
$$\mathcal{K}(wX) = \frac{E[(wX - \mu_{wX})^4]}{(\sigma_{wX}^2)^2} - 3$$3. Expand and Simplify
Numerator: Substitute $\mu_{wX} = w \mu_X$
$$E[(wX - w\mu_X)^4] = E[(w(X - \mu_X))^4] = w^4 E[(X - \mu_X)^4]$$Denominator: Substitute $\sigma_{wX}^2 = w^2 \sigma_X^2$
$$(\sigma_{wX}^2)^2 = (w^2 \sigma_X^2)^2 = w^4 (\sigma_X^2)^2$$
4. Final Cancellation Substitute the simplified numerator and denominator back into the main equation:
$$\mathcal{K}(wX) = \frac{w^4 E[(X - \mu_X)^4]}{w^4 (\sigma_X^2)^2} - 3$$Since $w \neq 0$, the $w^4$ terms cancel out:
$$\mathcal{K}(wX) = \frac{E[(X - \mu_X)^4]}{(\sigma_X^2)^2} - 3$$By definition, the right-hand side is exactly $\mathcal{K}(X)$.
Conclusion:
$$\mathcal{K}(wX) = \mathcal{K}(X)$$(This confirms that kurtosis is scale-invariant; amplifying a signal does not change its non-Gaussianity.)
Part (b): Derivation of the Objective Function
Goal: Derive the expression for $\mathcal{K}(y)$ solely in terms of $\kappa, w_1$, and $w_2$.
Given:
- $y = w_1 z_1 + w_2 z_2$.
- $z_1, z_2$ have zero mean, unit variance ($Var(z_i)=1$), and identical kurtosis $\mathcal{K}(z_i) = \kappa$.
- Whitening constraint: $w_1^2 + w_2^2 = 1$.
- Formula from Exercise (Image 2): $$\mathcal{K}(A+B) = \frac{\sigma_A^4 \mathcal{K}(A) + \sigma_B^4 \mathcal{K}(B)}{(\sigma_A^2 + \sigma_B^2)^2}$$
- Scale Invariance (from Part a): $\mathcal{K}(cX) = \mathcal{K}(X)$.
Step 1: Define Variables Let $A = w_1 z_1$ and $B = w_2 z_2$. Thus $y = A + B$.
Step 2: Calculate Variance and Kurtosis for A and B
- Variance ($\sigma^2$):
- $\sigma_A^2 = Var(w_1 z_1) = w_1^2 Var(z_1) = w_1^2 (1) = w_1^2$.
- $\sigma_B^2 = Var(w_2 z_2) = w_2^2 Var(z_2) = w_2^2 (1) = w_2^2$.
- Kurtosis ($\mathcal{K}$):
- Using the Scale Invariance property proved in part (a):
- $\mathcal{K}(A) = \mathcal{K}(w_1 z_1) = \mathcal{K}(z_1) = \kappa$.
- $\mathcal{K}(B) = \mathcal{K}(w_2 z_2) = \mathcal{K}(z_2) = \kappa$.
Step 3: Substitute into the Sum Formula
$$\mathcal{K}(y) = \frac{(w_1^2)^2 \cdot \kappa + (w_2^2)^2 \cdot \kappa}{(w_1^2 + w_2^2)^2}$$$$\mathcal{K}(y) = \frac{\kappa (w_1^4 + w_2^4)}{(w_1^2 + w_2^2)^2}$$Step 4: Apply the Whitening Constraint We are given that $w_1^2 + w_2^2 = 1$. The denominator becomes $1^2 = 1$.
Final Expression:
$$\mathcal{K}(y) = \kappa (w_1^4 + w_2^4)$$(c) Optimization & Source Recovery
1. Mathematical Proof
Goal: Maximize $J(w) = |\mathcal{K}(y)| = |\kappa(w_1^4 + w_2^4)|$ subject to $w_1^2 + w_2^2 = 1$.
Since $\kappa$ is a constant and $w_1^4, w_2^4$ are non-negative, maximizing $|\mathcal{K}(y)|$ is equivalent to maximizing the sum $f(w_1, w_2) = w_1^4 + w_2^4$.
Substitution: Let $a = w_1^2$. From the constraint $w_1^2 + w_2^2 = 1$, we have $w_2^2 = 1 - a$. Since $w$ consists of real numbers, $w_1^2$ must be between 0 and 1. So, the domain is $a \in [0, 1]$.
Substitute these into our function to create a single-variable function $g(a)$:
$$g(a) = a^2 + (1-a)^2$$Expand the function:
$$g(a) = a^2 + (1 - 2a + a^2)$$$$g(a) = 2a^2 - 2a + 1$$Optimization: To find the extrema, take the derivative with respect to $a$:
$$g'(a) = 4a - 2$$Set the derivative to 0 to find the critical point:
$$4a - 2 = 0 \implies a = 0.5$$Now, evaluate $g(a)$ at the critical point and the boundaries (endpoints) of the interval $[0, 1]$:
- At $a = 0.5$: $g(0.5) = 2(0.5)^2 - 2(0.5) + 1 = 0.5 - 1 + 1 = 0.5$ (This is the minimum).
- At $a = 0$ (Boundary): $g(0) = 2(0)^2 - 0 + 1 = 1$ (This is a maximum).
- At $a = 1$ (Boundary): $g(1) = 2(1)^2 - 2(1) + 1 = 1$ (This is a maximum).
Conclusion: The maximum value occurs only at the boundaries $a=0$ and $a=1$.
- If $a = 1$, then $w_1^2 = 1$ and $w_2^2 = 0$. This corresponds to the solution $w = (1, 0)$.
- If $a = 0$, then $w_1^2 = 0$ and $w_2^2 = 1$. This corresponds to the solution $w = (0, 1)$.
Thus, the maximum occurs only at the endpoints.
2. Physical Meaning
Question: What does the solution $w = (1, 0)$ imply about the relationship between our recovered signal $y$ and the original sources $z_1, z_2$?
Explanation: If the optimization results in the weight vector $w = (1, 0)$, the recovered signal is:
$$y = 1 \cdot z_1 + 0 \cdot z_2$$$$y = z_1$$Meaning: This implies that we have successfully separated the mixed signal and perfectly recovered one of the original independent sources ($z_1$).
In the context of Independent Component Analysis (ICA): According to the Central Limit Theorem, a mixture of independent non-Gaussian variables will be more Gaussian (have lower absolute kurtosis) than the individual sources. Therefore, by maximizing non-Gaussianity (maximizing kurtosis), we drive the weights away from a “mixture” (like $a=0.5$) and towards the boundaries, allowing us to isolate the original, pure source signals.
Mock Final 解释
1. 在F3Q2里,PCA与ICA的区别凸显在哪?ICA Algorithm是什么?
这是一个非常深刻的问题,也是这道题目的终极考点。
要把这两个概念在这道题的语境下区分开,我们可以用**“看云彩”**的比喻来理解。
想象这团数据(那个菱形分布)是一朵云。
1. PCA 的视角:只看“胖瘦” (Variance)
PCA 追求的是方差最大化。 方差就是数据的“铺开程度”或“胖瘦”。
- PCA 的逻辑: PCA 拿着一把尺子,试图找到这朵云最长的那个方向。它认为,最长的方向就是最重要的“主成分”。
- 在这道题里:
- PCA 量了一下横向($S_1$):方差是 0.5。
- PCA 量了一下纵向($S_2$):方差是 0.5。
- PCA 试着量了一下 45度斜向:方差算出来还是 0.5(因为菱形在二阶统计上是各向同性的,像个球)。
- PCA 的困惑: PCA 会说:“这团数据是圆的!不管往哪个方向看,它都一样胖。所以我找不到一个‘最长’的方向来旋转它。算了,我就不动它了(输出单位矩阵)。”
- 结果: PCA 既然不旋转,它就把原始数据扔回给你了。 但是,原始数据 $S_1$ 和 $S_2$ 是不独立的(比如 $S_1=1$ 时 $S_2$ 必须是 0)。PCA 完全没有解决这个依赖问题,因为它“看不见”这种依赖,它只能看见“胖瘦”。
2. ICA 的视角:看“骨架/形状” (Independence)
ICA 追求的是独立性。 独立的意思是:知道 $x$ 的值,对猜测 $y$ 的值完全没有帮助。
ICA 的逻辑: ICA 不在乎数据哪个方向长,它在乎数据的形状是否“整齐”(即是否由独立的源信号构成)。它知道,如果两个信号真正独立,它们的联合分布通常看起来像个正方形(或长方形),边缘是直的。
在这道题里:
- 现状(菱形):ICA 看着现在的 $S_1$ 和 $S_2$,它发现:“这不对劲。虽然它们不相关(协方差为0),但它们互相牵制。如果是菱形,说明 $S_1$ 变大时,$S_2$ 被强行压缩到 0。这不是独立!”
- 尝试旋转:ICA 开始尝试旋转坐标轴。
- 发现真相(正方形): 如果 ICA 把坐标轴旋转 45度,会发生什么? 原来的菱形(Diamond)就会变成一个正放的正方形(Square)。
ICA 的狂喜: 在正方形里(对应均匀分布),你告诉我 $x$ 是多少,完全不影响 $y$ 的取值范围。这才是真正的独立!
怎么找到的? ICA 用的工具就是你算出来的 Kurtosis(峰度)。
- 菱形方向的峰度是 -1。
- 正方形方向(旋转45度后)的峰度绝对值会更大(更非高斯)。
- ICA 就会沿着峰度绝对值变大的方向走,最终帮你把“菱形”转成“正方形”。
总结对比
在这道具体的题目语境下:
| 特性 | PCA (方差最大化) | ICA (追求独立/非高斯) |
|---|---|---|
| 关注点 | 能量/大小 (Energy) | 信息/结构 (Information) |
| 眼中的数据 | 它觉得这只是个**“球”** (各向同性)。 | 它看出了这是个**“菱形”** (有棱角)。 |
| 在这道题的表现 | 失效。因为每个方向方差都一样,它无法确定旋转角度,只能“躺平”。 | 成功。它会发现虽然是个“球”,但内部结构不独立。它会旋转坐标系,直到找到最“棱角分明”(独立)的角度。 |
| 结论 | 只能做到 Uncorrelated (不相关)。 | 能做到 Independent (独立)。 |
一句话总结: 对于这道题的数据,PCA 像个近视眼,只看到模糊的一团圆影(方差相等),觉得没法分;而 ICA 像个显微镜,看到了边缘的锐利结构(非高斯性),从而通过旋转找到了真正独立的源头。
关于ICA Algorithm:
在这道题目(以及经典的 FastICA 算法)的语境下,我们可以把 ICA 的步骤想象成一个**“两阶段变形记”**。
如果把这道题的 Diamond 数据 $(S_1, S_2)$ 作为我们要处理的“观测信号”,ICA 的具体操作步骤如下:
第一阶段:预处理 (Preprocessing) —— 把数据“摆正”
这一阶段的目的不是为了独立,而是为了简化后续的计算。这一步其实就是 PCA 做的事情。
1. 中心化 (Centering)
- 动作:把数据的重心移到原点 $(0,0)$。
- 本题情况:题目算过了,均值已经是 $\mathbb{E}[S] = (0,0)^T$。
- 结果:不需要动,跳过。
2. 白化 (Whitening / Sphering)
- 动作:消除各个方向上的相关性,并将方差统一归一化为 1。
- 本题情况:
- 协方差矩阵 $\Sigma = 0.5I$。
- 数据虽然不相关,但方差是 0.5,不是 1。
- 操作:把所有数据除以标准差 $\sqrt{0.5}$。
- 点 $(0, 1) \rightarrow (0, \sqrt{2})$
- 点 $(1, 0) \rightarrow (\sqrt{2}, 0)$
- 结果:现在数据的协方差矩阵变成了单位矩阵 $I$。数据的几何形状还是一个菱形,只是变大了一点。
- 关键点:此时 PCA 的工作结束了。因为它发现协方差已经是 $I$,觉得完美了,就停止了。
第二阶段:ICA 迭代 (Iteration) —— 寻找“棱角”
这是 ICA 真正发威的地方。它的任务是旋转这个“白化后的菱形”,直到找到某种非高斯性最强的角度。
3. 初始化 (Initialization)
- 算法随机选取一个方向向量 $w$(比如指向 10度方向)。
4. 投影与计算非高斯性 (Projection & Calculation)
这是算法的核心循环:
- 投影:把所有二维点投影到 $w$ 这个向量上,得到一维数据 $y = w^T S$。
- 打分:计算这个一维数据 $y$ 的 峰度绝对值 $|\mathcal{K}(y)|$。
让我们看看两个关键角度的得分对比:
假设 $w$ 指向 0度 (原始轴):
- 投影后的值是 $\{0, \sqrt{2}, -\sqrt{2}\}$ (白化后的值)。
- 根据题目计算,标准化后的峰度是 -1。
- 得分:$|-1| = \mathbf{1}$。
假设 $w$ 指向 45度 (旋转轴):
- 如果我们将坐标轴旋转 45度,原来的“菱形”顶点 $(0, \sqrt{2})$ 和 $(\sqrt{2}, 0)$ 投影到新轴上,会变成只有两个值:$+\alpha$ 和 $-\alpha$(二值分布)。
- 对于二值分布(比如硬币的正反面),其峰度是 -2(这是所有分布中峰度最小的,最不像高斯)。
- 得分:$|-2| = \mathbf{2}$。
5. 更新与收敛 (Update & Convergence)
- 算法比较得分:发现 2 > 1。
- 动作:算法会调整 $w$ 的角度,让它从 0度 慢慢转到 45度。
- 原理:这就是题目中写的 “Maximize non-Gaussianity”。算法像爬山一样,沿着峰度绝对值增加的方向旋转。
最终结果:从菱形到正方形
当 ICA 算法收敛在 45度 时,它找到了新的基向量。
- 几何意义:
- 原始数据是菱形(边不与坐标轴平行)。
- ICA 旋转 45 度后,如果把新的基向量画出来,数据点就构成了正方形的四个角。
- 统计意义:
- 在正方形状态下,数据不仅是不相关的,而且是独立的。
- ICA 成功把“不相关但依赖”的菱形信号,还原成了“独立”的源信号(这通常是两个独立的均匀分布或二值分布信号混合而成的)。
总结 ICA 的步骤
- 减均值(移回原点)。
- 除方差(白化,PCA做的事,变成球/正菱形)。
- 转圈圈(旋转坐标轴)。
- 算峰度(看哪个角度最“尖”或最“方”)。
- 定坐标(在峰度绝对值最大的角度停下来,也就是 45度)。
MF3Q1 - Factor Analysis 相关
问题1: 标准化的Orthogonal Single Factor Model,其R,E,Phi都是什么样的?
根据你的要求,我们来把题目中的 $R$(相关系数矩阵)和 Epsilon ($\epsilon$) 相关的矩阵形式明确地写出来。
这里需要区分一下:$\epsilon$ 本身是一个随机向量,而在因子分析的矩阵运算中,我们要找的通常是它的协方差矩阵,记作 $\Psi$ (Psi)。
以下是它们各自的 Explicit Form(显式形式),以及如何用 $r$(观测相关系数)和 $l$(因子载荷)来表达。
1. 矩阵 $R$ (Correlation Matrix)
$R$ 是观测变量之间的标准化相关系数矩阵。因为题目假设了标准化(Standardized),对角线(方差)都是 1。
形式 A:用观测相关系数 $r$ 表示
$$ R = \begin{bmatrix} 1 & r_{12} & r_{13} \\ r_{12} & 1 & r_{23} \\ r_{13} & r_{23} & 1 \end{bmatrix} $$(带入题目数值即:)
$$ R = \begin{bmatrix} 1 & 0.6 & 0.6 \\ 0.6 & 1 & 0.6 \\ 0.6 & 0.6 & 1 \end{bmatrix} $$形式 B:用因子载荷 $l$ 表示 根据单因子模型结构,非对角线元素 $r_{ij} = l_i l_j$。
$$ R = \begin{bmatrix} 1 & l_1 l_2 & l_1 l_3 \\ l_1 l_2 & 1 & l_2 l_3 \\ l_1 l_3 & l_2 l_3 & 1 \end{bmatrix} $$2. 关于 Epsilon ($\epsilon$)
这里有两个概念:误差向量本身和误差方差矩阵。
(1) 随机向量形式 $\epsilon$
这是模型方程 $X = L Z + \epsilon$ 中的误差项向量:
$$ \epsilon = \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \end{bmatrix} $$(这是一个随机变量向量,没有固定的数值,只有统计性质。)
(2) 误差方差矩阵 $\Psi$ (Specific Variance Matrix)
这是我们在计算中实际使用的矩阵(即 $\text{Cov}(\epsilon) = \Psi$)。根据因子分析的假设,误差项之间互不相关,所以这是一个对角矩阵。
定义形式:
$$ \Psi = \begin{bmatrix} \psi_1 & 0 & 0 \\ 0 & \psi_2 & 0 \\ 0 & 0 & \psi_3 \end{bmatrix} $$用因子载荷 $l$ 表示的形式: 在标准化模型中,总方差为 1。总方差 = 公共方差 ($l_i^2$) + 特殊方差 ($\psi_i$)。 因此,$\psi_i = 1 - l_i^2$。
$$ \Psi = \begin{bmatrix} 1 - l_1^2 & 0 & 0 \\ 0 & 1 - l_2^2 & 0 \\ 0 & 0 & 1 - l_3^2 \end{bmatrix} $$3. $R$ 和 $\Psi$ 的核心关系公式
因子分析的核心就是把 $R$ 分解为“公共部分”和“特殊部分”。用矩阵显式表达就是:
$$ R = L L^\top + \Psi $$展开来看就是:
$$ \underbrace{\begin{bmatrix} 1 & r_{12} & r_{13} \\ r_{12} & 1 & r_{23} \\ r_{13} & r_{23} & 1 \end{bmatrix}}_{R}=\underbrace{\begin{bmatrix} l_1^2 & l_1 l_2 & l_1 l_3 \\ l_1 l_2 & l_2^2 & l_2 l_3 \\ l_1 l_3 & l_2 l_3 & l_3^2 \end{bmatrix}}_{\text{公共部分 } (LL^\top)}+\underbrace{\begin{bmatrix} 1 - l_1^2 & 0 & 0 \\ 0 & 1 - l_2^2 & 0 \\ 0 & 0 & 1 - l_3^2 \end{bmatrix}}_{\text{特殊部分 } (\Psi)} $$注意看对角线:
$$1 = l_i^2 + (1 - l_i^2)$$这验证了分解是成立的。
至于为什么 $l_1 l_2 = \text{corr}(X_1, X_2)$ 是完全正确的(前提是在正交单因子模型中),是因为在因子分析模型中,变量 $X_1$ 和 $X_2$ 之间的所有“联系”都是通过那个共同的因子 $Z$ 来传递的。
数学推导 (Proof)
我们要计算 $X_1$ and $X_2$ 的相关系数。
1. 模型的定义 根据题目,单因子模型方程是:
$$X_1 = l_1 Z + \epsilon_1$$$$X_2 = l_2 Z + \epsilon_2$$2. 关键假设 为了推导成立,因子分析有几个标准假设:
- 标准化:$X$ 和 $Z$ 的均值都是 0,方差都是 1。即 $Var(Z)=1$。
- 独立性 A:公共因子 $Z$ 与误差项 $\epsilon$ 不相关 (Uncorrelated)。
- 独立性 B:误差项 $\epsilon_1$ 与 $\epsilon_2$ 之间不相关 (Uncorrelated)。这是关键,因为 $\epsilon$ 代表“特殊”方差,只属于各自的变量。
3. 计算相关系数 因为变量已经标准化(方差为1),相关系数等于协方差:
$$r_{12} = \text{Corr}(X_1, X_2) = \text{Cov}(X_1, X_2)$$把模型方程代入:
$$\text{Cov}(X_1, X_2) = \text{Cov}(l_1 Z + \epsilon_1, \quad l_2 Z + \epsilon_2)$$利用协方差的线性性质展开(像做乘法分配律一样):
$$ \begin{aligned} &= \text{Cov}(l_1 Z, l_2 Z) && \text{(项 1: 因子对因子的关系)} \\ &+ \text{Cov}(l_1 Z, \epsilon_2) && \text{(项 2: 因子对误差的关系, 为0)} \\ &+ \text{Cov}(\epsilon_1, l_2 Z) && \text{(项 3: 误差对因子的关系, 为0)} \\ &+ \text{Cov}(\epsilon_1, \epsilon_2) && \text{(项 4: 误差对误差的关系, 为0)} \end{aligned} $$4. 简化结果
根据独立性 A,项 2 和项 3 都是 0。
根据独立性 B,项 4 是 0。
只剩下项 1:
$$\text{Cov}(l_1 Z, l_2 Z) = l_1 l_2 \times \text{Cov}(Z, Z)$$$$\text{Cov}(Z, Z) = \text{Var}(Z)$$根据标准化假设,$Var(Z) = 1$。
5. 最终结论
$$r_{12} = l_1 l_2 \times 1 = l_1 l_2$$直观理解
你可以把这想象成一种路径。 $X_1$ 和 $X_2$ 并不直接认识对方。
- $X_1$ 通过强度 $l_1$ 连接到公共中心 $Z$。
- $X_2$ 通过强度 $l_2$ 连接到公共中心 $Z$。
它们之间的“重叠程度”(相关性)就是这两条路径强度的乘积。
问题2: 未标准化的Orthogonal Single Factor Model,其R,E,Phi都是什么样的?
如果模型没有被标准化(Unstandardized),情况会变得稍微复杂一点,但核心逻辑依然保留在协方差层面。
在未标准化的情况下,我们通常用希腊字母 $\lambda$ (lambda) 来表示载荷,以区别于标准化载荷 $l$。
以下是未标准化正交单因子模型的关键变化总结:
1. 核心区别一句话总结
- 标准化模型:载荷乘积 = 相关系数 ($l_i l_j = r_{ij}$)
- 未标准化模型:载荷乘积 = 协方差 ($\lambda_i \lambda_j = \sigma_{ij}$)
2. 详细数学关系
假设模型为 $X = \mu + \Lambda Z + \epsilon$,其中 $Var(Z) = 1$(通常为了确定标度,即使未标准化也会固定因子的方差为1)。
(1) 协方差矩阵 $\Sigma$ 的结构
这是未标准化模型中最漂亮的地方,形式依然非常简洁。
$$ \Sigma = \Lambda \Lambda^\top + \Psi $$写成显式形式(Explicit Form):
$$ \Sigma = \begin{bmatrix} \sigma_1^2 & \sigma_{12} & \sigma_{13} \\ \sigma_{21} & \sigma_2^2 & \sigma_{23} \\ \sigma_{31} & \sigma_{32} & \sigma_3^2 \end{bmatrix}=\begin{bmatrix} \lambda_1^2 + \psi_1 & \lambda_1 \lambda_2 & \lambda_1 \lambda_3 \\ \lambda_1 \lambda_2 & \lambda_2^2 + \psi_2 & \lambda_2 \lambda_3 \\ \lambda_1 \lambda_3 & \lambda_2 \lambda_3 & \lambda_3^2 + \psi_3 \end{bmatrix} $$- 对角线(方差):$\text{Var}(X_i) = \sigma_i^2 = \lambda_i^2 + \psi_i$
- 注意:这里不再等于 1 了,而是等于实际数据的方差。
- 非对角线(协方差):$\text{Cov}(X_i, X_j) = \sigma_{ij} = \lambda_i \lambda_j$
(2) 相关系数 $R$ 的结构(变得复杂了)
如果你想看未标准化载荷 $\lambda$ 和相关系数 $r$ 的关系,公式就不再那么简单优美了。
因为相关系数的定义是:$r_{ij} = \frac{\text{Cov}(X_i, X_j)}{\sqrt{\text{Var}(X_i)\text{Var}(X_j)}}$
代入上面的结果:
$$ r_{ij} = \frac{\lambda_i \lambda_j}{\sqrt{(\lambda_i^2 + \psi_i)(\lambda_j^2 + \psi_j)}} $$或者写成:
$$ r_{ij} = \frac{\lambda_i \lambda_j}{\sigma_i \sigma_j} $$3. 标准化载荷 ($l$) 与 未标准化载荷 ($\lambda$) 的换算
其实这两种模型是可以互相转换的。
如果你手头有未标准化的载荷 $\lambda_i$,以及变量的总标准差 $\sigma_i$,你可以通过以下公式算出你在题目中看到的标准化载荷 $l_i$:
$$ l_i = \frac{\lambda_i}{\sigma_i} = \frac{\text{未标准化载荷}}{\text{变量的总标准差}} $$把这个代回上面的复杂公式,你就又回到了最开始那个简单的公式:
$$ r_{ij} = \left( \frac{\lambda_i}{\sigma_i} \right) \left( \frac{\lambda_j}{\sigma_j} \right) = l_i l_j $$总结对照表
| 特性 | 标准化模型 (Standardized) | 未标准化模型 (Unstandardized) |
|---|---|---|
| 载荷符号 | $l_i$ | $\lambda_i$ |
| 变量总方差 | $1$ | $\sigma_i^2 = \lambda_i^2 + \psi_i$ |
| 特殊方差 ($\psi$) | $\psi_i = 1 - l_i^2$ | $\psi_i = \sigma_i^2 - \lambda_i^2$ |
| 两个变量的关系 | $r_{ij} = l_i l_j$ (相关系数) | $\sigma_{ij} = \lambda_i \lambda_j$ (协方差) |
所以,题目之所以假设“Standardized”,就是为了让 $r_{ij} = l_i l_j$ 成立,从而让你能简单地解出方程组。如果没标准化,你解出来的积将会是协方差。
为什么在题目的语境下方差矩阵和相关性矩阵是一样的?
这是一个非常好的问题,触及了因子分析(Factor Analysis)最核心的数学构架。
- R是什么?
在统计学和因子分析的语境下,$R$ (Correlation Matrix) “本身”的表达可以从三个层面来理解:数据层面(它的样子)、定义层面(它是怎么算出来的)以及模型层面(它在等式中的角色)。
以下是 $R$ 的具体表达形式:
1. 数据层面:它长什么样?(矩阵形式)
这是最直观的表达。$R$ 是一个对称矩阵,对角线是 1,非对角线是变量间的相关系数。
对于你题目中的 3 维变量 ($X_1, X_2, X_3$),$R$ 的显式表达是:
$$ R = \begin{bmatrix} 1 & r_{12} & r_{13} \\ r_{21} & 1 & r_{23} \\ r_{31} & r_{32} & 1 \end{bmatrix} $$- 对角线 ($1$):因为是标准化的,每个变量自己和自己的相关性(即方差)是 1。
- 非对角线 ($r_{ij}$):变量 $i$ 和变量 $j$ 的相关系数。
- 对称性:$r_{12} = r_{21}$,以此类推。
代入你题目中的数值 ($r=0.6$):
$$ R = \begin{bmatrix} 1 & 0.6 & 0.6 \\ 0.6 & 1 & 0.6 \\ 0.6 & 0.6 & 1 \end{bmatrix} $$2. 定义层面:它是怎么来的?(数学定义)
如果深究 $R$ 本身的数学来源,它是由标准化后的随机向量 $X$ 定义的期望值。
假设 $X$ 是一个已经标准化(均值为0,方差为1)的列向量:
$$X = \begin{bmatrix} X_1 \\ X_2 \\ X_3 \end{bmatrix}$$那么 $R$ 的严格数学定义是 $X$ 乘以它自己的转置后的期望值:
$$ R = E[X X^\top] $$展开来看就是:
$$ R = E \left( \begin{bmatrix} X_1 \\ X_2 \\ X_3 \end{bmatrix} \begin{bmatrix} X_1 & X_2 & X_3 \end{bmatrix} \right) = E \begin{bmatrix} X_1^2 & X_1 X_2 & X_1 X_3 \\ X_2 X_1 & X_2^2 & X_2 X_3 \\ X_3 X_1 & X_3 X_2 & X_3^2 \end{bmatrix} $$- $E[X_i^2] = 1$ (标准化后的方差)
- $E[X_i X_j] = r_{ij}$ (标准化后的协方差即为相关系数)
3. 模型层面:它在等式中的地位
回到你刚才问的等式:
$$LL^\top + \Psi = R$$在这里,$R$ 是已知量(Observed Data)。
- $R$ 是“果”/“目标”:它是我们在现实世界中收集数据算出来的结果。
- $LL^\top + \Psi$ 是“因”/“解释”:这是我们试图构造的数学结构来逼近这个 $R$。
一句话总结: $R$ 本身就是一个由 $1$ 和 $r_{ij}$ 组成的实数矩阵,它是因子分析想要去分解和重构的对象。
其次,我需要帮你稍微校正一下符号,以免后续混淆。根据你提供的题目图片和因子分析的标准定义:
- 你写的 $\Phi$ (Phi),在题目中对应的是 $\Psi$ (Psi)(即 $\psi_i$ 组成的对角矩阵,代表特殊方差)。
- 你写的 Epsilon,在题目图片中其实是 $\Sigma$ (大写的 Sigma)。
- 图片原文是:“Assume standardized variables so that $\Sigma = R$”。
- $\Sigma$ 代表协方差矩阵。
- $\epsilon$ (小写 epsilon) 通常代表误差项本身,而不是矩阵。
所以,你原本想问的等式应该是:
$$LL^\top + \Psi = \Sigma = R$$下面我来解释为什么这个等式成立。它其实包含了两层逻辑:
第一层逻辑:为什么 $\Sigma = R$ ?
原因:标准化 (Standardization)
这是题目给出的一个强假设:“Assume standardized variables”(假设变量已标准化)。
- $\Sigma$ (协方差矩阵):衡量变量间的绝对变异程度。对角线是方差 ($\sigma^2$)。
- $R$ (相关系数矩阵):衡量变量间的相对线性关系。对角线恒为 1。
当变量标准化后(即每个变量减去均值再除以标准差):
- 所有变量的均值变为 0。
- 所有变量的方差变为 1。
- 根据公式 $\text{Corr}(X,Y) = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)}\sqrt{\text{Var}(Y)}}$,当分母均为 1 时,协方差就等于相关系数。
所以,$\Sigma$ 和 $R$ 在这里是同一个东西。
第二层逻辑:为什么 $\Sigma = LL^\top + \Psi$ ?
原因:方差分解 (Variance Decomposition)
这是因子分析模型的定义。我们把观测数据 $X$ 看作由两部分组成:
$$X = \underbrace{L Z}_{\text{公共部分}} + \underbrace{\epsilon}_{\text{特殊部分}}$$当我们计算 $X$ 的协方差矩阵 $\Sigma$ 时,利用方差的性质($\text{Var}(A+B) = \text{Var}(A) + \text{Var}(B)$,前提是 A 和 B 独立),可以将总方差拆解开:
$$ \begin{aligned} \Sigma &= \text{Cov}(X) \\ &= \text{Cov}(LZ + \epsilon) \\ &= \text{Cov}(LZ) + \text{Cov}(\epsilon) \quad \text{(因为假设公共因子 Z 和特殊因子 } \epsilon \text{ 相互独立)} \end{aligned} $$我们分别看这两项:
公共部分 $\text{Cov}(LZ)$:
$$L \cdot \text{Cov}(Z) \cdot L^\top$$在正交因子模型中,我们假设因子 $Z$ 的方差为 1 且互不相关,即 $\text{Cov}(Z) = I$(单位矩阵)。 所以这一项变成了:$LL^\top$。
特殊部分 $\text{Cov}(\epsilon)$: 这是特殊因子 $\epsilon$ 的协方差矩阵。假设它们之间互不相关,所以这是一个对角矩阵。 在数学上我们记作:$\Psi$。
合并起来就是:
$$\Sigma = LL^\top + \Psi$$总结:整个等式的含义
这个长等式 $LL^\top + \Psi = \Sigma = R$ 实际上是在说:
- $R$:这是我们观察到的现实(数据里算出来的相关系数)。
- $LL^\top + \Psi$:这是我们构建的模型(公共影响 + 独有影响)。
- 等号 ($=$):我们的目标是让模型完美还原现实。
直观理解:
“如果我们的模型是对的,那么观测到的所有相关性 ($R$),都应该能被分解为‘公共因子的贡献’ ($LL^\top$) 和 ‘变量自身的噪声’ ($\Psi$)。”