重点提示

  1. 矩阵求导和迹 (Trace) 的性质: 在 Regression 和 PCA 的推导中(特别是 Question 1c 和 Question 2a),Trace 的循环性质 ($tr(ABC)=tr(CAB)$) 是解题核心。
  2. PCA vs FA 的区别: 考试很喜欢问 “Why is specific variance important?” 或者让你算 Commulity。记住 FA 把误差分成了 Common ($L$) 和 Unique ($\Psi$),而 PCA 没有区分。
  3. Kurtosis 的极值: ICA 的核心就是 “Central Limit Theorem 告诉我们混合信号更像高斯分布”,所以我们要 Maximize Non-Gaussianity (Kurtosis) 来分离信号。Question 5b 的证明务必看一眼 PDF 第 9 页 。
  4. MVN 的条件分布公式: 那个 $E[Y|X] = \mu_Y + \Sigma_{YX}\Sigma_{X}^{-1}(X-\mu_X)$ 的公式,如果不背下来,考场上现推会很慢。

Question 4: Factor Analysis (FA) vs. PCA

核心考点:模型假设、Heywood Case、旋转不变性

Question 4: Factor Analysis (FA) vs. PCA

Focuses on the “Identifiability” and model checking (Heywood cases).

  • (a) The Model: Write down the orthogonal Factor Analysis model equation involving Loadings $L$, Factors $Z$, and Specific Variances $\Psi$. State the assumptions on the covariance of $Z$ and $\epsilon$.
  • (b) Heywood Case: Suppose you fit a 1-factor model and find that for one variable, the estimated loading squared $l_i^2$ is greater than the total variance of that variable (standardized variance = 1). This implies the specific variance $\psi_i = 1 - l_i^2$ is negative. Is this a valid statistical model? Explain why or why not.
  • (c) Rotation Invariance: Prove that the “Total Communality” (the total variance explained by the common factors, $\sum h_i$) is invariant to orthogonal rotation of the loadings matrix $L$. (Hint: Use the Trace property) .

(a) The Model (模型定义)

Factor Analysis (FA) 的正交因子模型(Orthogonal Factor Model)定义如下:

$$X = LZ + \epsilon$$

其中:

  • $X \in \mathbb{R}^p$ 是观测数据向量(Observed variables)。
  • $L \in \mathbb{R}^{p \times r}$ 是 Loadings Matrix(载荷矩阵)。
  • $Z \in \mathbb{R}^r$ 是 Latent Factors(潜在因子),假设 $Z \sim \mathcal{N}(0, I_r)$。这意味着因子之间是互不相关且标准化的 。
  • $\epsilon \in \mathbb{R}^p$ 是 Specific Variances(特殊误差/噪声),假设 $\epsilon \sim \mathcal{N}(0, \Psi)$,其中 $\Psi = \text{diag}(\psi_1, ..., \psi_p)$ 是对角矩阵 。
  • 关键假设:$Z$ 和 $\epsilon$ 是独立的,即 $Cov(Z, \epsilon) = 0$ 。

这导出了 $X$ 的协方差结构:

$$Cov(X) = \Sigma = LL^\top + \Psi$$

(b) Heywood Case (边界解问题)

答案:不是一个有效的统计模型。 解释: 在标准化变量(Standardized variables)的假设下,总方差为 1。根据模型,第 $i$ 个变量的方差分解为:

$$Var(X_i) = \sum_{j=1}^r l_{ij}^2 + \psi_i = 1$$

如果计算出的 loading squared $l_i^2$ (在 1-factor模型中) 大于 1,那么根据公式 $\psi_i = 1 - l_i^2$,得出的特殊方差 $\psi_i$ 将是负数 。 方差(Variance)在定义上必须是非负的 ($\psi_i \ge 0$)。这种情况被称为 Heywood Case,通常意味着模型设定错误(比如提取的因子太多)或者样本量太小导致估计不稳定 。

(c) Rotation Invariance (旋转不变性)

题目: 证明 Total Communality $\sum h_i$ 对旋转是不变的。 证明:

  1. 定义: Communality $h_i$ 是第 $i$ 个变量被公因子解释的方差,即 $L$ 矩阵第 $i$ 行的平方和。 Total Communality 是所有 $h_i$ 的和: $$\text{Total Communality} = \sum_{i=1}^p h_i = \sum_{i=1}^p \sum_{j=1}^r l_{ij}^2 = ||L||_F^2 = tr(LL^\top)$$ (这里利用了 Frobenius norm 和 Trace 的性质) 。
  2. 旋转: 设 $Q$ 是一个正交旋转矩阵 ($Q^\top Q = I$)。旋转后的载荷矩阵为 $L^* = LQ$ 。
  3. 计算旋转后的 Total Communality: $$\text{Total Comm}^* = tr(L^* (L^*)^\top) = tr((LQ)(LQ)^\top)$$ $$= tr(L Q Q^\top L^\top)$$ 由于 $Q$ 是正交的,$Q Q^\top = I$: $$= tr(L I L^\top) = tr(LL^\top)$$
  4. 结论: $\text{Total Comm}^* = \text{Total Comm}$。因此,总公因子方差(Total Communality)是旋转不变的。

Question 5: Independent Component Analysis (ICA)

ICA is distinct because of the Gaussianity constraint. This question targets the “Kurtosis” maximization proof.

  • (a) Kurtosis & Scale Invariance: Let $X$ be a random variable and $w$ be a scalar. Prove that the Excess Kurtosis is scale-invariant, i.e., $\mathcal{K}(wX) = \mathcal{K}(X)$.
  • (b) Maximizing Non-Gaussianity: We define the ICA objective as maximizing $| \mathcal{K}(y) |$ where $y = w_1 z_1 + w_2 z_2$ (a mixture of independent sources). Under the whitening constraint $w_1^2 + w_2^2 = 1$, show that the maximum occurs only at the boundaries (e.g., $w=(1,0)$). Explain what this implies physically about recovering the original sources.
  • (c) Why not PCA?: Consider a dataset where the variables are uncorrelated but dependent (e.g., uniformly distributed on a diamond shape). Explain why PCA cannot separate these signals (Hint: What is the rotation matrix for uncorrelated data?), whereas ICA can.

Question 5: Independent Component Analysis (ICA)

核心考点:Kurtosis 的性质、ICA 优化目标

(a) Kurtosis Scale Invariance (峰度的尺度不变性)

题目: 证明 $\mathcal{K}(wX) = \mathcal{K}(X)$。 证明: Excess Kurtosis 的定义是:$\mathcal{K}(X) = \frac{E[(X-\mu)^4]}{(\sigma^2)^2} - 3$ 。 设 $Y = wX$。

  1. 均值:$\mu_Y = w\mu_X$。
  2. 中心化矩:$Y - \mu_Y = w(X - \mu_X)$。
  3. 方差:$\sigma_Y^2 = Var(wX) = w^2 \sigma_X^2$。
  4. 代入公式: $$\mathcal{K}(wX) = \frac{E[(w(X-\mu_X))^4]}{(w^2 \sigma_X^2)^2} - 3$$ $$= \frac{w^4 E[(X-\mu_X)^4]}{w^4 (\sigma_X^2)^2} - 3$$ $$= \frac{E[(X-\mu_X)^4]}{(\sigma_X^2)^2} - 3 = \mathcal{K}(X)$$ 结论: 标量乘法 $w$ 会在分子分母中相互抵消(只要 $w \neq 0$),因此峰度是尺度不变的 。

(b) Maximizing Non-Gaussianity (非高斯性最大化)

题目: 在 $w_1^2 + w_2^2 = 1$ 约束下,证明 $|\mathcal{K}(w_1 z_1 + w_2 z_2)|$ 在边界处最大。 证明: 根据 PDF 中的线性组合性质(假设 $z_i$ 独立且标准化):

$$\mathcal{K}(w_1 z_1 + w_2 z_2) = w_1^4 \mathcal{K}(z_1) + w_2^4 \mathcal{K}(z_2)$$

。 我们需要最大化目标函数 $J(w) = |w_1^4 \mathcal{K}(z_1) + w_2^4 \mathcal{K}(z_2)|$。 利用不等式性质和约束 $w_1^2 + w_2^2 = 1$ (这意味着 $w_i^4 \le w_i^2$):

$$|\mathcal{K}(y)| \le w_1^4 |\mathcal{K}(z_1)| + w_2^4 |\mathcal{K}(z_2)| \le (w_1^4 + w_2^4) \max(|\mathcal{K}(z_1)|, |\mathcal{K}(z_2)|)$$

因为 $w_1^4 + w_2^4 \le (w_1^2 + w_2^2)^2 = 1$,这个和只有在其中一个 $w_i=1$ 且另一个 $w_j=0$ 时取最大值 1。 因此,最大值只能在边界点 $w=(1, 0)$ 或 $w=(0, 1)$ 取得 。

物理意义: 解 $w=(1, 0)$ 意味着 $y = 1 \cdot z_1 + 0 \cdot z_2 = z_1$。 这表示我们要找的“最非高斯”的方向,恰好就是原始的、独立的源信号(Original Sources)的方向。只有当我们完全分离出一个源信号时,非高斯性才达到最大;任何混合信号的峰度绝对值都会变小(更接近高斯分布,即 CLT 效应)。

(c) Why not PCA? (为什么 PCA 无法分离)

答案: PCA 依赖于协方差矩阵 (Covariance Matrix) 的对角化来消除相关性(Decorrelation)。

  • 如果数据已经是互不相关(Uncorrelated, $Cov=0$)但是统计上并不独立(Dependent),比如 PDF 中提到的均匀分布在菱形上的点 $(0,1), (0,-1), (1,0), (-1,0)$ 。
  • 对于这种数据,协方差矩阵已经是单位阵(或者对角阵)。PCA 看到的 $X^\top X$ 已经是对应的形状,它无法区分任何特定的旋转方向。对 PCA 来说,任何正交旋转都是等价的(Identity)。
  • ICA 的优势: ICA 不仅看二阶矩(协方差),还看高阶矩(四阶矩 Kurtosis)。即使协方差矩阵是对角的,ICA 也能通过最大化 Kurtosis 找到数据分布“尖峰”的方向,从而恢复独立的源信号 。

Question 6: Multivariate Hypothesis Testing & Conditional Distributions

The “Calculation” heavy question involving partitioned matrices.

  • (a) Conditional Distribution: Let $X \sim N_p(\mu, \Sigma)$ be partitioned into $X_A$ and $X_B$. Write down the formula for the conditional mean $E[X_A | X_B = x_B]$ and conditional variance $Var(X_A | X_B = x_B)$ using Schur complements.
  • (b) Independence vs. Correlation: In the context of Multivariate Normal Distribution (MVN), prove or explain why zero covariance (uncorrelatedness) implies statistical independence. Does this hold for non-Gaussian distributions?.
  • (c) Two-Sample $T^2$ Test: You have two samples with sizes $n$ and $m$. Write down the expression for the pooled covariance matrix $S_{pooled}$. Then, state the Hotelling’s $T^2$ statistic for testing $H_0: \mu_x = \mu_y$ and its distribution under the null.

Question 6: Multivariate Hypothesis Testing

核心考点:MVN条件分布公式、$T^2$ 统计量

(a) Conditional Distribution (条件分布)

设 $X$ 被划分为两部分 $X_A$ 和 $X_B$,即 $X = \begin{pmatrix} X_A \\ X_B \end{pmatrix} \sim \mathcal{N}_p \left( \begin{pmatrix} \mu_A \\ \mu_B \end{pmatrix}, \begin{pmatrix} \Sigma_{AA} & \Sigma_{AB} \\ \Sigma_{BA} & \Sigma_{BB} \end{pmatrix} \right)$ 。

给定 $X_B = x_B$ 时,$X_A$ 的条件分布是多元正态分布,参数如下:

  • 条件均值 (Mean): $$E[X_A | X_B = x_B] = \mu_A + \Sigma_{AB}\Sigma_{BB}^{-1}(x_B - \mu_B)$$ 这其实就是用 $X_B$ 对 $X_A$ 做回归的预测值 。
  • 条件方差 (Variance): $$Var(X_A | X_B = x_B) = \Sigma_{AA} - \Sigma_{AB}\Sigma_{BB}^{-1}\Sigma_{BA}$$ 这就是 Schur Complement 。 注意:条件方差是一个常数矩阵,不依赖于 $x_B$ 的具体取值。

(b) Independence vs. Correlation (独立性与相关性)

证明/解释: 在多元正态分布(MVN)中,$Uncorrelated \iff Independent$。

  • 如果 $X_A$ 和 $X_B$ 不相关,则 $\Sigma_{AB} = 0$。
  • 此时联合概率密度函数 (PDF) 中的二次型项 $(x-\mu)^\top \Sigma^{-1} (x-\mu)$ 可以分解。由于 $\Sigma$ 是分块对角阵 ($\Sigma_{AB}=0$),$\Sigma^{-1}$ 也是分块对角的。
  • 于是联合密度函数可以写成边缘密度的乘积:$f(x_A, x_B) = f(x_A)f(x_B)$,这就意味着独立 。

非高斯情况: 这个性质不成立。两个非高斯变量可以是不相关的(Cov=0),但仍然相互依赖(Dependent)。例如 Question 5(c) 中提到的菱形分布例子 。

(c) Two-Sample $T^2$ Test (双样本检验)

Pooled Covariance (合并协方差矩阵): 假设两个样本的大小分别为 $n$ 和 $m$,样本协方差矩阵为 $S_x$ 和 $S_y$。

$$S_{pooled} = \frac{(n-1)S_x + (m-1)S_y}{n + m - 2}$$

(注意:PDF 中写的是 $(n+m-2)S_{pooled} = nS_x + mS_y$,这是基于分母为 $n$ 的有偏估计定义。考试时如果你用的是无偏估计 $S$(分母 n-1),请用上面的公式。如果用 PDF 的定义,请严格照抄 PDF)为了安全,根据 PDF 的公式 : $S_{pooled} = \frac{nS_x + mS_y}{n+m-2}$ (假设这里的 $S_x, S_y$ 是由 $1/n$ 定义的)。

Hotelling’s $T^2$ Statistic:

$$T^2 = \frac{nm}{n+m} (\bar{x} - \bar{y})^\top S_{pooled}^{-1} (\bar{x} - \bar{y})$$

。这里系数 $\frac{nm}{n+m}$ 来源于 $\frac{1}{1/n + 1/m}$。

Distribution: 在零假设 $H_0: \mu_x = \mu_y$ 下,统计量服从:

$$T^2 \sim T^2(p, n+m-2)$$

即自由度为 $p$ 和 $n+m-2$ 的 Hotelling $T^2$ 分布 。


Question 1: Generalized Least Squares & Ridge Regression (Linear Models)

This question tests your understanding of estimator properties when assumptions are violated or modified.

  • (a) GLS Transformation: Consider the linear model $y = X\beta + \epsilon$ where $Var(\epsilon) = \sigma^2 \Psi$ and $\Psi$ is known but not Identity. Show how to find a matrix $L$ such that the transformed model satisfies standard OLS assumptions. explicitly state the relationship between $L$ and $\Psi$.
  • (b) Ridge Bias: For the Ridge Regression estimator $\hat{\beta}_R = (X^\top X + \lambda I)^{-1}X^\top y$, prove that it is a biased estimator of $\beta$. Derive the specific expression for the bias $E[\hat{\beta}_R] - \beta$.
  • (c) Hat Matrix Trace: For a standard OLS model, the “Hat Matrix” is $H = X(X^\top X)^{-1}X^\top$. Prove that the trace of the Hat Matrix equals the number of predictors $p$. Explain what the diagonal elements $H_{ii}$ represent in terms of “leverage”.

核心考点:GLS变换矩阵、Ridge Bias推导、Hat Matrix Trace性质

(a) GLS Transformation (GLS 变换)

题目: Find a matrix $L$ such that the transformed model $Ly = LX\beta + L\epsilon$ satisfies standard OLS assumptions (white noise). 解答: 我们已知误差项 $\epsilon \sim \mathcal{N}(0, \sigma^2 \Psi)$,其中 $\Psi$ 是已知的正定矩阵 。 为了让新误差项 $\epsilon^* = L\epsilon$ 满足 OLS 假设(即协方差为 $\sigma^2 I$),我们需要:

$$Var(L\epsilon) = L Var(\epsilon) L^\top = L (\sigma^2 \Psi) L^\top = \sigma^2 (L \Psi L^\top)$$

我们希望 $L \Psi L^\top = I$。这意味着 $L^\top L = \Psi^{-1}$。 因此,我们可以选择 $L$ 为 $\Psi^{-1}$ 的 Cholesky分解 因子或者 特征分解 的平方根矩阵的逆 。 具体来说,如果令 $\Psi^{-1} = L^\top L$(Cholesky)或 $L = \Psi^{-1/2}$,则变换后的模型:

$$y^* = X^*\beta + \epsilon^*$$

其中 $y^* = Ly, X^* = LX$,其误差项满足 $Var(\epsilon^*) = \sigma^2 I$,可以使用 OLS 估计。

(b) Ridge Estimator Bias (Ridge 回归的偏差)

题目: Derive the expectation $E[\hat{\beta}_R]$ and show it is biased. 解答: Ridge 回归估计量定义为:

$$\hat{\beta}_R = (X^\top X + \lambda I)^{-1} X^\top y$$

代入真实模型 $y = X\beta + \epsilon$:

$$\hat{\beta}_R = (X^\top X + \lambda I)^{-1} X^\top (X\beta + \epsilon)$$

$$= (X^\top X + \lambda I)^{-1} (X^\top X)\beta + (X^\top X + \lambda I)^{-1} X^\top \epsilon$$

对两边求期望 $E[\cdot]$,利用 $E[\epsilon] = 0$:

$$E[\hat{\beta}_R] = (X^\top X + \lambda I)^{-1} (X^\top X)\beta$$

注意,如果 $\lambda = 0$,则前面两项抵消变为 $\beta$(无偏)。但因为 $\lambda > 0$,$(X^\top X + \lambda I)^{-1} (X^\top X) \neq I$。 因此,$E[\hat{\beta}_R] \neq \beta$,它是 有偏估计 (Biased Estimator) 。 具体的偏差项 (Bias) 为:

$$\text{Bias}(\hat{\beta}_R) = E[\hat{\beta}_R] - \beta = [(X^\top X + \lambda I)^{-1} (X^\top X) - I] \beta$$

(c) Trace of Hat Matrix (Hat Matrix 的迹)

题目: Show that $tr(H) = p$ and explain “leverage”. 解答: Hat Matrix 定义为 $H = X(X^\top X)^{-1}X^\top$ 。 利用迹的循环性质 (Cyclic Property: $tr(ABC) = tr(CAB)$) :

$$tr(H) = tr(X(X^\top X)^{-1}X^\top) = tr(X^\top X (X^\top X)^{-1})$$

由于 $(X^\top X)(X^\top X)^{-1} = I_p$($p \times p$ 的单位阵):

$$tr(H) = tr(I_p) = p$$

解释: $H$ 的对角元素 $H_{ii}$ 被称为 Leverage Score。它衡量了第 $i$ 个观测值对自身的拟合值 $\hat{y}_i$ 的影响程度。 由于 $\sum H_{ii} = tr(H) = p$,所以平均 Leverage 是 $p/n$ 。


Question 2: Kernel PCA & Centering

核心考点:Feature Space 中心化、Kernel Trick、Linear Kernel 等价性

Question 2: Kernel PCA & Centering (Feature Space)

This focuses on the “trickiest” part of Kernel PCA: data centering in the high-dimensional space.

  • (a) The Centering Problem: In Kernel PCA, we cannot explicitly compute the mean in feature space to center the data. We are given a kernel matrix $K_{ij} = \mathcal{K}(x_i, x_j)$. Let $C = I_n - \frac{1}{n}1_n 1_n^\top$ be the centering matrix. Prove (or explain the row/column operations) that the centered kernel matrix $\tilde{K}$ is computed as $\tilde{K} = CKC$.
  • (b) Eigenvalues: Does the centering operation $CKC$ change the eigenvalues compared to the uncentered $K$? Why is this step strictly necessary before performing the eigendecomposition to find Principal Components?.
  • (c) Linear Kernel Equivalence: Explain why performing Kernel PCA with the linear kernel $\mathcal{K}(a, b) = \langle a, b \rangle$ (and properly centering it) is mathematically equivalent to performing standard PCA on the original dataset.

(a) The Centering Matrix (中心化矩阵)

题目: Show that the centered kernel matrix $\tilde{K} = CKC$, where $C = I_n - \frac{1}{n}1_n 1_n^\top$. 解答: 在特征空间中,设 $\Phi$ 为未中心化的特征矩阵,$\bar{\Phi}$ 为中心化后的矩阵。我们需要计算 $\tilde{K} = \bar{\Phi}\bar{\Phi}^\top$。 中心化操作在代数上等价于右乘中心化矩阵 $C$,即 $\bar{\Phi} = C\Phi$(注意:通常如果是 $n \times p$ 数据矩阵 $X$,中心化是 $CX$,这里假设 $\Phi$ 是 $n \times \text{dim}$,操作类似,目的是移除列均值)。 更准确的推导基于 Kernel 矩阵的元素操作。对于 $K$ 矩阵,左乘 $C$ 会移除行的均值,右乘 $C$ 会移除列的均值 。

$$\tilde{K} = C K C$$

展开来看:

$$\tilde{K} = (I - \frac{1}{n}11^\top) K (I - \frac{1}{n}11^\top) = K - \frac{1}{n}11^\top K - \frac{1}{n}K11^\top + \frac{1}{n^2}11^\top K 11^\top$$

这一步操作确保了在未知的特征空间中,数据点的重心被移动到了原点 。

(b) Effect on Eigenvalues (对特征值的影响)

题目: Does centering change eigenvalues? Why is it necessary? 解答: 是的,会改变。 未中心化的 Kernel Matrix $K$ 包含了原点到数据点的距离信息,而不仅仅是数据点之间的相对方差结构。如果不进行中心化,第一主成分可能会指向数据的均值方向(即从原点指向数据云团中心),而不是数据方差最大的方向 。 PCA 的定义是寻找最大方差的方向。如果不减去均值(Centering),计算出的 $\frac{1}{n}X^\top X$ 只是二阶矩矩阵,而不是协方差矩阵。因此,必须使用 $\tilde{K} = CKC$ 来确保我们分解的是协方差结构 。

(c) Linear Kernel Check (线性核的等价性)

题目: Explain why linear kernel PCA is equivalent to standard PCA. 解答: 线性核定义为 $\mathcal{K}(x_i, x_j) = \langle x_i, x_j \rangle = x_i^\top x_j$。 对应的核矩阵为 $K = XX^\top$ 。 进行中心化后:$\tilde{K} = C(XX^\top)C = (CX)(CX)^\top$。 注意 $CX$ 正是中心化后的数据矩阵(设为 $\tilde{X}$)。 Kernel PCA 对 $\tilde{K} = \tilde{X}\tilde{X}^\top$ 进行特征分解。 Standard PCA (Dual Approach) 也是对 $\tilde{X}\tilde{X}^\top$ 进行特征分解(因为 $X^\top X$ 和 $XX^\top$ 共享非零特征值)。 因此,使用线性核的 Kernel PCA 在数学上完全等价于标准 PCA。


Question 3: Canonical Correlation Analysis (CCA) Setup

核心考点:CCA 优化目标、Whitening 变换、与回归的联系

Question 3: Canonical Correlation Analysis (CCA) Optimization

This tests the fundamental derivation of CCA using SVD, a core concept in the notes.

  • (a) The Objective: We want to find vectors $u, v$ to maximize $Corr(X^\top u, Y^\top v)$. Formulate this as a constrained optimization problem: “Maximize $u^\top S_{XY} v$ subject to…”? Explain why we set the variance constraints to 1.
  • (b) Whitening Step: The notes define a “transformed” or “whitened” cross-covariance matrix $\tilde{S}_{xy} = S_x^{-1/2}S_{xy}S_y^{-1/2}$. Show how the singular values of this specific matrix $\tilde{S}_{xy}$ relate to the canonical correlations.
  • (c) Relation to Regression: If $Y$ is univariate ($q=1$), show that the first canonical direction vector $u$ is proportional to the Ordinary Least Squares (OLS) regression coefficient $\hat{\beta}$.

(a) The Objective (优化目标)

题目: Write down the optimization problem for CCA. 解答: CCA 的目标是寻找投影向量 $u$ (for $X$) 和 $v$ (for $Y$),使得投影后的变量相关性最大。 由于相关性是尺度不变的 (Scale-invariant),我们通常固定投影后的方差为 1 来使得解唯一。 Optimization Problem:

$$\text{Maximize } u^\top S_{XY} v$$

Subject to constraints:

$$u^\top S_X u = 1$$

$$v^\top S_Y v = 1$$

其中 $S_X, S_Y$ 是样本协方差矩阵,$S_{XY}$ 是互协方差矩阵 。

(b) Whitening & Solution (白化与求解)

题目: How is the solution found using $\tilde{S}_{xy}$? 解答: 为了解决上述带约束的优化问题,我们通过变量代换进行“白化” (Whitening)。 定义变换后的向量:$\tilde{u} = S_X^{1/2}u$ 和 $\tilde{v} = S_Y^{1/2}v$。 这使得约束条件简化为 Euclidean 范数约束:$||\tilde{u}||^2 = 1$ 和 $||\tilde{v}||^2 = 1$ 。 目标函数变为:

$$u^\top S_{XY} v = (S_X^{-1/2}\tilde{u})^\top S_{XY} (S_Y^{-1/2}\tilde{v}) = \tilde{u}^\top (S_X^{-1/2} S_{XY} S_Y^{-1/2}) \tilde{v}$$

定义白化后的矩阵 $\tilde{S}_{xy} = S_X^{-1/2} S_{XY} S_Y^{-1/2}$ 。 问题的解 $\tilde{u}, \tilde{v}$ 正是矩阵 $\tilde{S}_{xy}$ 的左、右奇异向量 (Singular Vectors)。 对应的最大典型相关系数 (Canonical Correlation) 就是 $\tilde{S}_{xy}$ 的最大奇异值 。

(c) Relation to Regression (与回归的关系)

题目: If $Y$ is univariate ($q=1$), show $u$ is proportional to OLS coefficient $\hat{\beta}$. 解答: 当 $Y$ 是单变量 $y$ 时,$S_Y$ 是标量方差 $s_y^2$,$S_{XY}$ 是向量 $s_{xy}$。 CCA 试图最大化 $Corr(X^\top u, y)$。这等价于最大化决定系数 $R^2$,即寻找 $X$ 的线性组合使其最接近 $y$。 这正是多元线性回归 (Multiple Linear Regression) 的目标。 回归系数公式为 $\hat{\beta} = S_X^{-1} s_{xy}$。 在 CCA 中,我们最大化 $u^\top s_{xy}$ subject to $u^\top S_X u = 1$。 根据柯西-施瓦茨不等式或拉格朗日乘数法,最优的 $u$ 方向平行于 $S_X^{-1} s_{xy}$。 因此,CCA 的方向 $u$ 与 OLS 回归系数 $\hat{\beta}$ 是成比例 (Proportional) 的方向 。

Alex,没问题。既然这是Open Book考试,且你希望资料能够直接打印带入考场,这套题目我将完全聚焦于 “Note Practice” 和 “Midterm 1/2” 中出现过的硬核推导和反例

这套 Round 3 题目避开了上一轮的 Kernel/CCA/FA,转而攻克 Matrix Calculus(矩阵微积分)MLE 详细推导 以及 ICA 的经典反例。这些都是 Note 中不起眼但极易考大题的地方。


STA437 Final Exam - Round 3 Prediction

Focus: Derivations, Note Practice & Properties


Question 1: Spectral Properties of the “Residual Maker” (Linear Algebra)

[Source: Simulation Pt.1 Q1 & Midterm 1 Review] This question tests the geometric interpretation of Least Squares using SVD.

(a) SVD Representation of Hat Matrix Let $H = X(X^\top X)^{-1}X^\top$ be the Hat matrix. Using the Thin Singular Value Decomposition $X = UDV^\top$ (where $U \in \mathbb{R}^{n \times p}$ is column-orthogonal, $U^\top U = I_p$), show that $H$ simplifies to $UU^\top$. (b) Eigenvalues of the Residual Maker Let $M = I_n - H$ be the matrix that generates residuals ($e = My$). Determine the eigenvalues of $M$ and their multiplicities. Based on this, explain why $M$ is Positive Semi-Definite (PSD). (c) Trace of H Show that the trace of the Hat matrix equals the rank of $X$ (assume full rank $p$).


Solution 1: Spectral Properties of the “Residual Maker”

(a) SVD Representation of $H$

  • Definition: $H = X(X^\top X)^{-1}X^\top$.
  • Substitute SVD: Let $X = UDV^\top$ (Thin SVD, $U \in \mathbb{R}^{n \times p}, D \in \mathbb{R}^{p \times p}, V \in \mathbb{R}^{p \times p}$). Note that $V$ is orthogonal ($V^\top V = VV^\top = I_p$) and $U$ is column-orthogonal ($U^\top U = I_p$).
  • Compute inner term: $X^\top X = (UDV^\top)^\top (UDV^\top) = V D^\top U^\top U D V^\top = V D I_p D V^\top = V D^2 V^\top$.
  • Compute Inverse: $(X^\top X)^{-1} = (V D^2 V^\top)^{-1} = (V^\top)^{-1} (D^2)^{-1} V^{-1} = V D^{-2} V^\top$.
  • Substitute back into H: $H = (UDV^\top) (V D^{-2} V^\top) (UDV^\top)^\top$ $H = U D (V^\top V) D^{-2} (V^\top V) D U^\top$ $H = U D I D^{-2} I D U^\top = U (D D^{-2} D) U^\top = U I U^\top = UU^\top$.
  • Result: $H = UU^\top$.

(b) Eigenvalues of $M$

  • Definition: $M = I_n - H = I_n - UU^\top$.
  • Eigenvalues of $H$: Since $H = UU^\top$ is a projection matrix onto a p-dimensional subspace (span of U), it has $p$ eigenvalues equal to 1, and $n-p$ eigenvalues equal to 0.
  • Eigenvalues of $M$: The eigenvalues of $I - H$ are simply $1 - \lambda_i(H)$.
    • For the $p$ eigenvalues where $\lambda(H)=1$: $\lambda(M) = 1 - 1 = 0$.
    • For the $n-p$ eigenvalues where $\lambda(H)=0$: $\lambda(M) = 1 - 0 = 1$.
  • Conclusion: The eigenvalues are 1 (with multiplicity $n-p$) and 0 (with multiplicity $p$).
  • PSD Property: Since all eigenvalues $\lambda_i \ge 0$, the matrix $M$ is Positive Semi-Definite (PSD).

(c) Trace of $H$

  • Using the property $H=UU^\top$: $tr(H) = tr(UU^\top)$.
  • Using the cyclic property of trace ($tr(AB) = tr(BA)$): $tr(H) = tr(U^\top U)$.
  • Since $U$ is column-orthogonal ($U^\top U = I_p$): $tr(H) = tr(I_p) = p$.

Question 2: Matrix Gradient & Optimization (PCA Foundation)

[Source: Simulation Pt.1 Q2(d), Exercise 13/14] This derivation is the mathematical foundation for “Minimum Reconstruction Error” in PCA.

(a) Gradient Derivation Define the objective function $f(A) = ||X - A||_F^2$. Using matrix calculus properties (specifically $\nabla_A tr(A^\top A) = 2A$), derive the gradient $\nabla_A f(A)$. (b) Optimal A Show that setting the gradient to zero implies $A=X$. (c) Rank Constraint Connection If we impose a constraint that $rank(A) = r < p$, why can’t we just set $A=X$? Briefly explain how the Eckart-Young theorem modifies the solution derived in (b).


Solution 2: Matrix Gradient & Optimization

(a) Gradient Derivation

  • Objective: $f(A) = ||X - A||_F^2 = tr((X-A)^\top (X-A))$.
  • Expand: $f(A) = tr(X^\top X - X^\top A - A^\top X + A^\top A)$. Using $tr(X^\top A) = tr(A^\top X)$, we get: $f(A) = tr(X^\top X) - 2tr(A^\top X) + tr(A^\top A)$.
  • Differentiate wrt A:
    • $\nabla_A (constant) = 0$.
    • $\nabla_A (-2tr(A^\top X)) = -2X$.
    • $\nabla_A tr(A^\top A) = 2A$.
  • Result: $\nabla_A f(A) = 2A - 2X$.

(b) Optimal A

  • Set gradient to zero: $2A - 2X = 0 \implies 2A = 2X \implies A = X$.
  • This confirms that without constraints, the best approximation of a matrix is the matrix itself.

(c) Eckart-Young / Rank Constraint

  • If we require $rank(A) = r < p$, we cannot simply set $A=X$ (which has rank $p$).
  • The Eckart-Young theorem states that under the spectral norm or Frobenius norm, the best rank-$r$ approximation is given by the Truncated SVD: $A_{opt} = \sum_{i=1}^r d_i u_i v_i^\top = U_r D_r V_r^\top$.
  • This connects the optimization problem to PCA: PCA finds the subspace that minimizes this reconstruction error.

Question 3: MLE of Multivariate Normal Covariance

[Source: Note 8 FA / Midterm Review / Source 503-535] A classic “Knowledge Part” derivation that often appears to test understanding of the Trace Trick.

(a) The Trace Trick The log-likelihood for MVN includes the term $\sum_{i=1}^n (x_i - \mu)^\top \Sigma^{-1} (x_i - \mu)$. Using the cyclic property of the trace ($tr(ABC)=tr(CAB)$), prove that this sum can be rewritten as $n \cdot tr(\Sigma^{-1} S)$, where $S$ is the MLE sample covariance (using $1/n$). (b) Optimizing $\Sigma$ Considering the simplified log-likelihood $l(\Sigma) \propto -\frac{n}{2}\log|\Sigma| - \frac{n}{2}tr(\Sigma^{-1}S)$, differentiate with respect to $\Sigma^{-1}$ (or $\Sigma$) to prove that the Maximum Likelihood Estimator is $\hat{\Sigma} = S$.

Solution 3: MLE of Multivariate Normal Covariance

(a) The Trace Trick

  • Log-likelihood term: $\text{Sum} = \sum_{i=1}^n (x_i - \mu)^\top \Sigma^{-1} (x_i - \mu)$.
  • Note that $(x_i - \mu)^\top \Sigma^{-1} (x_i - \mu)$ is a scalar, so it equals its own trace.
  • $\text{Sum} = \sum_{i=1}^n tr((x_i - \mu)^\top \Sigma^{-1} (x_i - \mu))$.
  • Cyclic Property ($tr(ABC)=tr(CAB)$): $= \sum_{i=1}^n tr(\Sigma^{-1} (x_i - \mu)(x_i - \mu)^\top)$.
  • Linearity of Trace: Move summation inside. $= tr(\Sigma^{-1} \sum_{i=1}^n (x_i - \mu)(x_i - \mu)^\top)$.
  • Substitute Sample Covariance: Since $nS = \sum (x_i - \mu)(x_i - \mu)^\top$ (assuming $\mu = \bar{x}$ for MLE): $= tr(\Sigma^{-1} (nS)) = n \cdot tr(\Sigma^{-1} S)$.

(b) Optimizing $\Sigma$

  • Simplified Log-Likelihood: $l(\Sigma) = -\frac{n}{2}\log|\Sigma| - \frac{n}{2}tr(\Sigma^{-1}S)$.
  • Let $\Lambda = \Sigma^{-1}$ (Precision Matrix) for easier differentiation. $l(\Lambda) = \frac{n}{2}\log|\Lambda| - \frac{n}{2}tr(\Lambda S)$. (Since $\log|\Sigma| = -\log|\Sigma^{-1}|$).
  • Differentiate wrt $\Lambda$:
    • $\frac{\partial}{\partial \Lambda} \log|\Lambda| = \Lambda^{-1} = \Sigma$.
    • $\frac{\partial}{\partial \Lambda} tr(\Lambda S) = S^\top = S$ (S is symmetric).
  • First Order Condition: $\nabla_\Lambda l = \frac{n}{2}\Sigma - \frac{n}{2}S = 0$.
  • Result: $\Sigma = S$. Thus, the MLE $\hat{\Sigma} = S_{MLE} = \frac{1}{n}\sum (x_i - \bar{x})(x_i - \bar{x})^\top$.

Question 4: The “Diamond” Distribution (ICA vs PCA)

[Source: Note 9 ICA / Simulation Pt.3 Q2] This is the specific counter-example from the Note Practice used to justify ICA.

(a) Covariance Calculation Consider a random vector $S = (S_1, S_2)^\top$ uniformly distributed on 4 points: $(1,0), (-1,0), (0,1), (0,-1)$. Calculate the covariance matrix $Cov(S)$ and show that $S_1$ and $S_2$ are uncorrelated. (b) Independence Check Prove that despite being uncorrelated, $S_1$ and $S_2$ are not independent (Check $P(S_1=1, S_2=1)$ vs marginal probabilities). (c) PCA vs ICA Explain why PCA cannot separate these signals (hint: look at the covariance matrix from part a), whereas ICA can.


Solution 4: The “Diamond” Distribution (ICA vs PCA)

(a) Covariance Calculation

  • Data: $P(1,0)=P(-1,0)=P(0,1)=P(0,-1) = 1/4$.
  • Means: $E[S_1] = 1(1/4) + (-1)(1/4) + 0 + 0 = 0$. Similarly $E[S_2] = 0$.
  • Covariance: $Cov(S_1, S_2) = E[S_1 S_2] - E[S_1]E[S_2]$. $E[S_1 S_2] = (1)(0)\frac{1}{4} + (-1)(0)\frac{1}{4} + (0)(1)\frac{1}{4} + (0)(-1)\frac{1}{4} = 0$.
  • Result: $Cov(S_1, S_2) = 0$. The variables are uncorrelated. The covariance matrix is $I$ (Identity).

(b) Independence Check

  • Check joint probability at $(1, 1)$. The point $(1,1)$ does not exist in the dataset. So $P(S_1=1, S_2=1) = 0$.
  • Check marginal probabilities: $P(S_1=1) = P((1,0)) = 1/4$. $P(S_2=1) = P((0,1)) = 1/4$.
  • Test: $P(S_1=1, S_2=1) = 0 \neq P(S_1=1)P(S_2=1) = 1/16$.
  • Conclusion: They are dependent (Not independent).

(c) PCA vs ICA

  • PCA: PCA looks at the covariance matrix $S$. Here $S = I$. Since eigenvalues are equal ($\lambda_1 = \lambda_2 = 1$), PCA cannot find a unique “principal direction”. Any orthogonal rotation is equally valid for PCA. It cannot distinguish the “corners” of the diamond.
  • ICA: ICA maximizes Non-Gaussianity (Kurtosis). The projection onto the axes (e.g., $w=(1,0)$) gives a distribution taking values $\{-1, 0, 1\}$. The projection onto a 45-degree line ($w=(1/\sqrt{2}, 1/\sqrt{2})$) gives a distribution with different values. The kurtosis is maximized (or extremized) exactly at the axes (the original sources), allowing ICA to recover the structure that PCA misses.

Question 5: Precision Matrix & Conditional Independence

[Source: Knowledge Part / Block Matrix Inverse] Tests the relationship between the inverse covariance matrix and graphical models.

(a) Block Inverse Formula For a partitioned covariance matrix $\Sigma = \begin{pmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{pmatrix}$, use the Schur Complement to write down the expression for the top-left block of the precision matrix $(\Sigma^{-1})_{11}$. (b) Conditional Independence If the $(i,j)$-th entry of the precision matrix is zero ($\omega_{ij} = (\Sigma^{-1})_{ij} = 0$), prove (or explain using the PDF exponent) that variables $X_i$ and $X_j$ are conditionally independent given all other variables.


Solution 5: Precision Matrix & Conditional Independence

(a) Block Inverse Formula

  • Given $\Sigma = \begin{pmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{pmatrix}$.
  • The inverse $\Omega = \Sigma^{-1}$ has its top-left block given by the inverse of the Schur Complement: $(\Sigma^{-1})_{11} = (\Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21})^{-1}$.
  • (Alternatively denoted as $(\Sigma / \Sigma_{22})^{-1}$).

(b) Conditional Independence

  • The joint PDF of MVN is proportional to $\exp(-\frac{1}{2} x^\top \Sigma^{-1} x)$.
  • Let $\Omega = \Sigma^{-1}$. The term in the exponent is $-\frac{1}{2} \sum_i \sum_j x_i x_j \omega_{ij}$.
  • The interaction term between $x_i$ and $x_j$ is determined solely by $\omega_{ij}$.
  • If $\omega_{ij} = 0$, there is no $x_i x_j$ term in the exponent.
  • This means the conditional density $f(x_i, x_j | \text{others})$ factors into $g(x_i)h(x_j)$, which implies conditional independence.

Question 6: Derivation of Two-Sample Hotelling’s $T^2$

[Source: Simulation Pt.1 MVN / Source 560-581] The most complex derivation in the “MVN Testing” section.

(a) The Z Vector We want to test $H_0: \mu_x = \mu_y$. Construct a standardized vector $Z$ involving $(\bar{x} - \bar{y})$ and $\Sigma$ that follows $N_p(0, I_p)$. (b) The Wishart Matrix Write down the definition of the pooled covariance $S_{pooled}$ (using the definition $(n+m-2)S_{pooled} = nS_x + mS_y$) and the corresponding Wishart matrix $M$. (c) Constructing $T^2$ Combine $Z$ and $M$ into the definition $T^2 = (n+m-2)Z^\top M^{-1} Z$. Substitute back to derive the final formula: $T^2 = \frac{nm}{n+m}(\bar{x} - \bar{y})^\top S_{pooled}^{-1} (\bar{x} - \bar{y})$.


Solution 6: Derivation of Two-Sample Hotelling’s $T^2$

(a) The Z Vector

  • Goal: Standardize the difference of means $(\bar{x} - \bar{y})$ under $H_0: \mu_x = \mu_y$.
  • Variance of Difference: $Var(\bar{x} - \bar{y}) = Var(\bar{x}) + Var(\bar{y}) = \frac{\Sigma}{n} + \frac{\Sigma}{m} = (\frac{1}{n} + \frac{1}{m})\Sigma = (\frac{n+m}{nm})\Sigma$.
  • Construct Z: We need to multiply by the inverse square root of the variance scaler and matrix. $Z = \sqrt{\frac{nm}{n+m}} \Sigma^{-1/2} (\bar{x} - \bar{y})$.
  • Distribution: $Z \sim N_p(0, I_p)$.

(b) The Wishart Matrix

  • Pooled Covariance: $(n+m-2)S_{pooled} = nS_x + mS_y = \sum (x_i-\bar{x})(x_i-\bar{x})^\top + \sum (y_i-\bar{y})(y_i-\bar{y})^\top$.
  • Wishart Matrix M: We define $W = (n+m-2)S_{pooled}$. We need the “Whitened” Wishart matrix $M$ corresponding to $Z$’s scale. $M = \Sigma^{-1/2} W \Sigma^{-1/2}$. $M \sim W_p(I_p, n+m-2)$.

(c) Constructing $T^2$

  • Definition: The Hotelling $T^2$ is defined as ratio of Normal vector to Wishart matrix (analogous to $t^2 = z^2 / (s^2/\sigma^2)$). $T^2 = (df) Z^\top M^{-1} Z = (n+m-2) Z^\top M^{-1} Z$.
  • Substitution: Substitute $Z$ and $M$ back: $T^2 = (n+m-2) \left[ \sqrt{\frac{nm}{n+m}} (\bar{x}-\bar{y})^\top \Sigma^{-1/2} \right] \left[ \Sigma^{-1/2} W \Sigma^{-1/2} \right]^{-1} \left[ \sqrt{\frac{nm}{n+m}} \Sigma^{-1/2} (\bar{x}-\bar{y}) \right]$.
  • Simplify: The $\sqrt{\dots}$ terms square to $\frac{nm}{n+m}$. The inverse term: $[\Sigma^{-1/2} W \Sigma^{-1/2}]^{-1} = \Sigma^{1/2} W^{-1} \Sigma^{1/2}$. The $\Sigma$ terms cancel: $\Sigma^{-1/2} \Sigma^{1/2} = I$. We are left with: $T^2 = (n+m-2) \frac{nm}{n+m} (\bar{x}-\bar{y})^\top W^{-1} (\bar{x}-\bar{y})$.
  • Final Form: Since $W^{-1} = ((n+m-2)S_{pooled})^{-1} = \frac{1}{n+m-2} S_{pooled}^{-1}$, the $(n+m-2)$ cancels out. $T^2 = \frac{nm}{n+m} (\bar{x} - \bar{y})^\top S_{pooled}^{-1} (\bar{x} - \bar{y})$.

Factor Analysis (FA): New Predictions

Question 1: Factor Scoring - Bartlett vs. Regression Method

[Source: Note 8, Source 941-949] This question tests your ability to derive the estimators for the latent factors $Z$, a crucial step after fitting the model.

  • (a) The Weighted Regression (Bartlett’s) Approach: Assume the model $X \approx LZ + \epsilon$, where $\epsilon \sim \mathcal{N}(0, \Psi)$. We want to estimate $Z$ given an observation $x$. Set up the weighted least squares objective function: $$J(z) = (x - Lz)^\top \Psi^{-1} (x - Lz)$$ Derive the estimator $\hat{z}_{Bartlett}$ by minimizing this function with respect to $z$.
  • (b) The PCA Approach (Unweighted): If we fit FA via PCA (assuming $\Psi = \sigma^2 I$ or ignoring specific variances), the objective simplifies to unweighted least squares. Write down the objective and the resulting estimator $\hat{z}_{PCA}$.
  • (c) Comparison: Explain why the Bartlett method is generally preferred for Factor Analysis compared to the PCA method when $\Psi$ is not a scaled identity matrix.

FA Question 1 Solution: Factor Scoring

(a) Bartlett’s Method (Weighted Regression)

  • Objective: Minimize the squared error weighted by the inverse of the specific variances (noise). $$J(z) = (x - Lz)^\top \Psi^{-1} (x - Lz)$$
  • Gradient: Differentiate w.r.t $z$: $$\nabla_z J(z) = -2 L^\top \Psi^{-1} (x - Lz)$$
  • Solve: Set gradient to 0: $$L^\top \Psi^{-1} L z = L^\top \Psi^{-1} x$$ $$\hat{z}_{Bartlett} = (L^\top \Psi^{-1} L)^{-1} L^\top \Psi^{-1} x$$

(b) PCA Method (Unweighted)

  • Objective: Standard least squares (assuming $\Psi = I$ or similar). $$J_{PCA}(z) = ||x - Lz||^2 = (x - Lz)^\top (x - Lz)$$
  • Estimator: From standard OLS results: $$\hat{z}_{PCA} = (L^\top L)^{-1} L^\top x$$

(c) Comparison

  • Bartlett’s method accounts for heteroscedasticity (异方差性). Since $\Psi = \text{diag}(\psi_1, ..., \psi_p)$, variables with high specific variance (high noise, large $\psi_i$) are given less weight ($\psi_i^{-1}$ is small) in determining the factor score.
  • The PCA method treats all variables as equally reliable, which is suboptimal if the uniquenesses ($\psi_i$) vary significantly across variables.

Question 2: Non-Existence of FA Solution (Counter-Example)

[Source: Note 8, Source 120-128] The notes explicitly provide an example where a covariance matrix cannot be generated by a 1-factor model. This tests “Model Checking”.

  • (a) System of Equations: Consider a 3-dimensional random vector $X$ with covariance matrix: $$\Sigma = \begin{pmatrix} 1 & 0.9 & 0.7 \\ 0.9 & 1 & 0.4 \\ 0.7 & 0.4 & 1 \end{pmatrix}$$ We wish to fit a 1-factor model ($r=1$) such that $\Sigma = LL^\top + \Psi$. Write down the three equations relating the off-diagonal entries ($\sigma_{12}, \sigma_{23}, \sigma_{13}$) to the loading vector $L = (l_{11}, l_{21}, l_{31})^\top$.
  • (b) Inconsistency Proof: Solve this system for the absolute values of the loadings. Show that the value for $l_{11}$ implies a contradiction or an impossible statistical property (specifically, check if the resulting specific variance $\psi_1 = \sigma_{11} - l_{11}^2$ is valid, or if the correlation structure itself is violated). (Self-Correction based on Note: The note asks to “Find value of $l_{11}$” and “Show that it’s not possible for $\psi_1 = Var(\epsilon_1)$” .)

FA Question 2 Solution: Non-Existence of Solution

(a) System of Equations The 1-factor model implies $\Sigma \approx LL^\top$ for off-diagonal elements (since $\Psi$ is diagonal). Let $L = [l_1, l_2, l_3]^\top$. The equations are:

  1. $l_1 l_2 = 0.9$
  2. $l_1 l_3 = 0.7$
  3. $l_2 l_3 = 0.4$

(b) Inconsistency Proof

  • Multiply the first two equations: $(l_1 l_2)(l_1 l_3) = 0.9 \times 0.7 = 0.63$. $$l_1^2 (l_2 l_3) = 0.63$$
  • Substitute equation 3 ($l_2 l_3 = 0.4$): $$l_1^2 (0.4) = 0.63 \implies l_1^2 = \frac{0.63}{0.4} = 1.575$$
  • Check Variance Constraint: The model states $\Sigma_{11} = l_1^2 + \psi_1 = 1$. Since specific variance $\psi_1$ must be non-negative ($\psi_1 \ge 0$), we require $l_1^2 \le 1$.
  • Contradiction: We found $l_1^2 = 1.575 > 1$. This implies $\psi_1 = 1 - 1.575 = -0.575$, which is impossible for a variance.
  • Conclusion: No valid 1-factor model exists for this covariance matrix.

Independent Component Analysis (ICA): New Predictions

Question 3: Kurtosis of Sums (Derivation)

[Source: Note 9, Source 99-103, 149-151] This is a core property used to justify maximizing kurtosis. The notes list this derivation as “Exercise 10 Q3”.

  • (a) The Formula: Let $y_1$ and $y_2$ be independent random variables with zero mean. Let their variances be $\sigma_1^2, \sigma_2^2$ and their excess kurtoses be $\mathcal{K}(y_1), \mathcal{K}(y_2)$. Derive the formula for the kurtosis of their sum: $$\mathcal{K}(y_1 + y_2) = \frac{\sigma_1^4 \mathcal{K}(y_1) + \sigma_2^4 \mathcal{K}(y_2)}{(\sigma_1^2 + \sigma_2^2)^2}$$ (Hint: Expand $E[(y_1+y_2)^4]$ using independence and the binomial theorem).
  • (b) CLT Implication: If $y_1$ and $y_2$ are identically distributed with variance 1 and kurtosis $\kappa$, what is $\mathcal{K}(y_1 + y_2)$? Does it move closer to 0 (Gaussian)?

ICA Question 3 Solution: Kurtosis of Sums

(a) Derivation

  • Definition: Excess Kurtosis $\mathcal{K}(y) = E[y^4]/(\sigma^2)^2 - 3$.
  • Setup: Let $S = y_1 + y_2$. Since $y_i$ are independent with mean 0: $Var(S) = \sigma_1^2 + \sigma_2^2$.
  • Fourth Moment: $E[(y_1+y_2)^4] = E[y_1^4 + 4y_1^3y_2 + 6y_1^2y_2^2 + 4y_1y_2^3 + y_2^4]$ Using independence ($E[y_1 y_2] = E[y_1]E[y_2]$) and zero mean ($E[y_i]=0, E[y_i^3]=0$ usually assumed or terms vanish due to independence with mean 0 component): Cross terms with power 1 (e.g., $E[y_1^3]E[y_2]$) vanish because $E[y_2]=0$. The only surviving cross term is $6E[y_1^2]E[y_2^2] = 6\sigma_1^2 \sigma_2^2$. So, $E[S^4] = E[y_1^4] + 6\sigma_1^2\sigma_2^2 + E[y_2^4]$.
  • Substitute Kurtosis: Since $E[y_i^4] = (\mathcal{K}(y_i)+3)\sigma_i^4$: $E[S^4] = (\mathcal{K}(y_1)+3)\sigma_1^4 + 6\sigma_1^2\sigma_2^2 + (\mathcal{K}(y_2)+3)\sigma_2^4$ $= \mathcal{K}(y_1)\sigma_1^4 + \mathcal{K}(y_2)\sigma_2^4 + 3(\sigma_1^4 + 2\sigma_1^2\sigma_2^2 + \sigma_2^4)$ $= \mathcal{K}(y_1)\sigma_1^4 + \mathcal{K}(y_2)\sigma_2^4 + 3(\sigma_1^2 + \sigma_2^2)^2$.
  • Final Kurtosis: $\mathcal{K}(S) = \frac{E[S^4]}{(\sigma_1^2+\sigma_2^2)^2} - 3$ $\mathcal{K}(S) = \frac{\sigma_1^4 \mathcal{K}(y_1) + \sigma_2^4 \mathcal{K}(y_2) + 3(\text{Var})^2}{(\text{Var})^2} - 3$ $\mathcal{K}(S) = \frac{\sigma_1^4 \mathcal{K}(y_1) + \sigma_2^4 \mathcal{K}(y_2)}{(\sigma_1^2+\sigma_2^2)^2} + 3 - 3$ $\mathcal{K}(S) = \frac{\sigma_1^4 \mathcal{K}(y_1) + \sigma_2^4 \mathcal{K}(y_2)}{(\sigma_1^2+\sigma_2^2)^2}$.

(b) CLT Implication If $\sigma_1 = \sigma_2 = 1$ and $\mathcal{K}(y_1) = \mathcal{K}(y_2) = \kappa$:

$$\mathcal{K}(Sum) = \frac{1 \cdot \kappa + 1 \cdot \kappa}{(1+1)^2} = \frac{2\kappa}{4} = \frac{\kappa}{2}$$

The kurtosis is halved. As we add more variables, the kurtosis approaches 0. This confirms that sums of independent variables become more Gaussian (Central Limit Theorem).


Question 4: The Permutation Ambiguity (Matrix Algebra)

[Source: Note 9, Source 135-142] ICA cannot recover the order of sources. This question formalizes that using Permutation Matrices.

  • (a) Permutation Matrix: Define a permutation matrix $P$ that swaps the $i$-th and $j$-th elements of a vector. Write down $P$ explicitly for a 2D case where it swaps the 1st and 2nd elements.
  • (b) Invariance: Let the ICA model be $X = LZ$. Suppose we permute the sources to define $\tilde{Z} = PZ$. Find the new mixing matrix $\tilde{L}$ such that $X = \tilde{L}\tilde{Z}$ holds.
  • (c) Orthogonality: Show that if the original sources $Z$ had $Cov(Z)=I$, the permuted sources $\tilde{Z}$ also satisfy $Cov(\tilde{Z})=I$. (Hint: Use the property $P P^\top = I$).

ICA Question 4 Solution: Permutation Ambiguity

(a) Permutation Matrix A matrix $P$ that swaps the 1st and 2nd elements in $\mathbb{R}^2$ is:

$$P = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}$$

(b) New Mixing Matrix

  • Model: $X = LZ$.
  • New sources: $\tilde{Z} = PZ \implies Z = P^{-1}\tilde{Z} = P^\top \tilde{Z}$ (Since P is orthogonal).
  • Substitute back: $X = L (P^\top \tilde{Z}) = (L P^\top) \tilde{Z}$.
  • Therefore, the new mixing matrix is $\tilde{L} = L P^\top$. This effectively permutes the columns of $L$. (If $L = [c_1, c_2]$, then $\tilde{L} = [c_2, c_1]$).

(c) Orthogonality Check

  • Given $Cov(Z) = I$.
  • $Cov(\tilde{Z}) = Cov(PZ) = P Cov(Z) P^\top = P I P^\top = P P^\top$.
  • Since permutation matrices are orthogonal ($P^{-1} = P^\top$): $P P^\top = I$.
  • Thus, the permuted sources remain uncorrelated with unit variance. This is why ICA cannot determine the order of the components. \

关于 负熵 (Negentropy) 的考法通常集中在 ICA (独立成分分析) 的部分。它是用来替代 Kurtosis(峰度)作为衡量“非高斯性” (Non-Gaussianity) 的另一种更稳健的指标。负熵主要有以下 三种考法,按可能性从高到低排列:


考法 1:定义与基本性质 (概念题)

考察点:理解为什么我们要用负熵,以及它为什么非负。

预测题目:

  1. Definition: Define the differential entropy $H(y)$ of a random vector $y$ with density $f(y)$.
  2. Negentropy: Define Negentropy $J(y)$ in terms of $H(y)$. Explain why $J(y) \ge 0$ always holds and under what condition $J(y) = 0$.
  3. Motivation: Why is maximizing Negentropy equivalent to finding independent components in ICA?

参考答案:

  • Differential Entropy: $H(y) = - \int f(y) \log f(y) dy = E[-\log f(y)]$.
  • Negentropy definition: $J(y) = H(y_{gauss}) - H(y)$, where $y_{gauss}$ is a Gaussian random variable with the same covariance matrix as $y$.
  • Properties:
    • 总是非负 ($J(y) \ge 0$): 这是一个基于信息论的定理:在方差固定的情况下,高斯分布具有最大的微分熵 (Maximum Entropy)。因此 $H(y_{gauss}) \ge H(y)$。
    • 零值条件: 当且仅当 $y$ 本身服从高斯分布时,$J(y) = 0$。
  • ICA Connection: ICA 的目标是寻找最“非高斯”的方向(因为根据中心极限定理,混合信号趋向于高斯)。由于 $J(y)$ 度量了 $y$ 与高斯分布的“距离”,最大化负熵 $J(y)$ 等价于最大化非高斯性,从而分离出独立的源信号。

考法 2:负熵与峰度的关系 (推导/计算题)

考察点:资料中明确给出了负熵的近似公式,并指出在特定情况下它等价于峰度的平方。这是最容易出计算推导的地方。

预测题目: Approximating Negentropy. In practice, $f(y)$ is unknown, so we estimate $J(y)$ using expectations. The general approximation is:

$$J(y) \approx [E(G(y)) - E(G(\nu))]^2$$

where $\nu \sim N(0,1)$ and $G$ is a non-quadratic function. Question: If we choose the function $G(y) = y^4$, show that maximizing Negentropy is equivalent to maximizing the squared Excess Kurtosis. (Assume $y$ has mean 0 and variance 1).

详细推导 (必背):

  1. 设定 $G(y) = y^4$。
  2. 计算高斯部分的期望 $E[G(\nu)]$:对于标准正态分布 $\nu \sim N(0,1)$,其四阶矩 $E[\nu^4] = 3$。
  3. 计算 $y$ 的期望 $E[G(y)]$:$E[y^4]$。
  4. 代入近似公式: $$J(y) \propto [E(y^4) - 3]^2$$
  5. 回顾 Excess Kurtosis 定义:$\mathcal{K}(y) = E[y^4] - 3$ (因为方差为1)。
  6. 结论: $$J(y) \propto (\mathcal{K}(y))^2$$ 因此,在使用 $y^4$ 作为非线性函数时,最大化负熵就在最大化峰度的平方。这解释了为什么 FastICA 有时候直接用峰度作为目标函数。

考法 3:计算比较 (简答题)

考察点:ICA 中选择 Negentropy 还是 Kurtosis 的优缺点。虽然 PDF 中这部分较少,但 Source 1013 提到了 “non-quadratic functions”,暗示了鲁棒性。

预测题目: We can define Negentropy using different $G$ functions, such as $G(y) = y^4$ or $G(y) = \log(\cosh(y))$. Question: Why might we prefer a function like $\log(\cosh(y))$ over $y^4$ in robust ICA algorithms?

参考答案 (基于统计常识补充,资料隐含在 ):

  • 使用 $G(y) = y^4$ (Kurtosis) 对离群值 (Outliers) 非常敏感,因为通过 4 次方放大了尾部的值。
  • 使用 $G(y) = \log(\cosh(y))$ (Negentropy approximation) 增长较慢(近似线性或二次),因此对数据中的噪声和离群值更稳健 (Robust)
  • (资料 提到 $g$ 是 “some non-quadratic functions”,这正是为了获得比单纯矩估计更好的性质)。

总结:你现在要做什么?

  1. 记住公式:$J(y) = H(y_{gauss}) - H(y)$。
  2. 记住定理:固定方差下,高斯分布熵最大。
  3. 会推导:当 $G(y)=y^4$ 时,$J(y) \approx (\text{Kurtosis})^2$。

Alex,针对 SVD (奇异值分解)ED (特征分解/谱分解)QR Decomposition,这些内容在 Everything437.pdfLinear Algebra 部分有详细的定义和性质描述。


STA437 Final Exam Prediction - Linear Algebra Special

Topics: SVD, Spectral Decomposition, QR in Statistics

Question 1: SVD vs. Eigen-Decomposition in PCA (The “Dual” Relationship)

[Source: Simulation Pt.1 Q2, Source 161, 713-723] This question connects SVD directly to the Sample Covariance Matrix, explaining why SVD is the preferred computational method for PCA.

  • (a) SVD to Covariance: Let the centered data matrix be $X \in \mathbb{R}^{n \times p}$. Its Singular Value Decomposition is $X = UDV^\top$. Show that the sample covariance matrix $S = \frac{1}{n}X^\top X$ can be diagonalized as $S = V \Lambda V^\top$. Express the eigenvalues $\Lambda$ of $S$ explicitly in terms of the singular values $D$ of $X$.
  • (b) The “Dual” PCA (High Dimension Case): Suppose $p \gg n$ (more features than samples, e.g., Genomics). The matrix $X^\top X$ is $p \times p$ (huge), but $XX^\top$ is $n \times n$ (small). Show how to compute the right singular vectors $V$ (the Principal Components) using only the eigenvalues/vectors of the smaller matrix $XX^\top$. (Hint: Use the relationship $X^\top (u_i) = \dots$?)

Solution 1: SVD vs. Eigen-Decomposition

(a) SVD to Covariance

  1. Definitions: $X = UDV^\top$, where $U^\top U = I_n$ (or $I_p$ for thin SVD), $V^\top V = VV^\top = I_p$, and $D = \text{diag}(d_1, \dots, d_p)$.
  2. Covariance Formula: $S = \frac{1}{n} X^\top X$.
  3. Substitution: $$S = \frac{1}{n} (UDV^\top)^\top (UDV^\top) = \frac{1}{n} (V D^\top U^\top) (U D V^\top)$$
  4. Simplify: Since $U$ is column-orthogonal, $U^\top U = I$. $$S = \frac{1}{n} V D (I) D V^\top = V (\frac{1}{n}D^2) V^\top$$
  5. Conclusion: This matches the spectral decomposition form $S = V \Lambda V^\top$. The eigenvalues of $S$ are related to singular values of $X$ by: $$\lambda_i = \frac{d_i^2}{n}$$

(b) The “Dual” PCA (Small $n$, Large $p$)

  1. Problem: We want $V$ (eigenvectors of $X^\top X$), but $X^\top X$ is too big to compute.
  2. Use $XX^\top$: Compute eigenvalues/vectors of the smaller $n \times n$ matrix $XX^\top$. $$XX^\top = (UDV^\top)(UDV^\top)^\top = UDV^\top V D U^\top = U D^2 U^\top$$ This gives us $U$ (left singular vectors) and $D^2$ (squared singular values).
  3. Recover $V$: From the SVD equation $X = UDV^\top$, right multiply by $V$: $$XV = UD$$ Left multiply by $D^{-1}U^\top$: $$V = X^\top U D^{-1}$$
  4. Result: We can compute $v_i = \frac{1}{d_i} X^\top u_i$. This allows us to find Principal Components $V$ without ever forming the huge covariance matrix.

Question 2: OLS Estimation via QR Decomposition

[Source: Note Linear Algebra, Source 311-316] Although the course focuses on covariance structures, QR decomposition is introduced as a tool for linear models. This question tests algebraic simplification.

  • (a) QR Setup: Let $X \in \mathbb{R}^{n \times p}$ be a full-rank design matrix. We decompose it as $X = QR$, where $Q \in \mathbb{R}^{n \times p}$ has orthonormal columns ($Q^\top Q = I_p$) and $R \in \mathbb{R}^{p \times p}$ is an upper-triangular invertible matrix.
  • (b) Simplifying OLS: The Ordinary Least Squares (OLS) estimator is given by $\hat{\beta} = (X^\top X)^{-1} X^\top y$. Substitute $X=QR$ into this equation and derive the simplified expression for $\hat{\beta}$ that does not involve any matrix inversions of the form $(\cdot)^{-1}$ except for $R^{-1}$ (which is easy to compute via back-substitution).
  • (c) Projection Matrix: Show that the “Hat Matrix” $H = X(X^\top X)^{-1}X^\top$ simplifies to $QQ^\top$ using the QR decomposition.

Solution 2: OLS Estimation via QR Decomposition

(a) QR Setup $X = QR$, $Q^\top Q = I$, $R$ is upper triangular.

(b) Simplifying OLS

  1. OLS Formula: $\hat{\beta} = (X^\top X)^{-1} X^\top y$.
  2. Substitute $X=QR$: $$\hat{\beta} = ((QR)^\top (QR))^{-1} (QR)^\top y$$ $$\hat{\beta} = (R^\top Q^\top Q R)^{-1} R^\top Q^\top y$$
  3. Use Orthogonality ($Q^\top Q = I$): $$\hat{\beta} = (R^\top I R)^{-1} R^\top Q^\top y = (R^\top R)^{-1} R^\top Q^\top y$$
  4. Expand Inverse: Note that $(AB)^{-1} = B^{-1}A^{-1}$. $$(R^\top R)^{-1} = R^{-1} (R^\top)^{-1}$$
  5. Final Simplification: $$\hat{\beta} = R^{-1} (R^\top)^{-1} R^\top Q^\top y$$ Since $(R^\top)^{-1} R^\top = I$: $$\hat{\beta} = R^{-1} Q^\top y$$ (Benefit: solving $R\beta = Q^\top y$ is computationally strictly solving a triangular system, very stable).

(c) Projection Matrix $H$

  1. Definition: $H = X(X^\top X)^{-1}X^\top$.
  2. From (b), we know: $(X^\top X)^{-1}X^\top = R^{-1}Q^\top$. (Since $\hat{\beta} = (X^\top X)^{-1}X^\top y = R^{-1}Q^\top y$).
  3. Substitute into H: $$H = X (R^{-1} Q^\top)$$ $$H = (QR) R^{-1} Q^\top$$ $$H = Q (R R^{-1}) Q^\top = Q I Q^\top = Q Q^\top$$

Question 3: Matrix Powers & “Whitening” (Sphering)

[Source: Linear Algebra, Source 359-360, 466-469] This tests the application of Spectral Decomposition (ED) in standardizing multivariate data, a prerequisite for ICA and CCA.

  • (a) The Inverse Square Root: Let $\Sigma$ be a $p \times p$ symmetric Positive Definite (PD) matrix with spectral decomposition $\Sigma = U \Lambda U^\top$. Define the matrix $\Sigma^{-1/2}$. Show that $\Sigma^{-1/2}$ is symmetric.
  • (b) Whitening Transformation: Let $X$ be a random vector with mean 0 and covariance $\Sigma$. Define the transformed vector $Z = \Sigma^{-1/2}X$. Prove that the covariance matrix of $Z$ is the Identity matrix $I_p$. (This process is called “Whitening” or “Sphering”).
  • (c) Mahalanobis Distance: Show that the squared Euclidean norm of the whitened vector, $||Z||^2$, is exactly equal to the squared Mahalanobis distance of the original vector $X$ from the origin: $X^\top \Sigma^{-1} X$.

Solution 3: Matrix Powers & “Whitening”

(a) The Inverse Square Root

  1. Spectral Decomposition: $\Sigma = U \Lambda U^\top$, where $\Lambda = \text{diag}(\lambda_1, \dots, \lambda_p)$ with $\lambda_i > 0$.
  2. Definition: $\Sigma^{-1/2} = U \Lambda^{-1/2} U^\top$, where $\Lambda^{-1/2} = \text{diag}(1/\sqrt{\lambda_1}, \dots, 1/\sqrt{\lambda_p})$.
  3. Symmetry Check: $$(\Sigma^{-1/2})^\top = (U \Lambda^{-1/2} U^\top)^\top = (U^\top)^\top (\Lambda^{-1/2})^\top U^\top = U \Lambda^{-1/2} U^\top$$ (Since diagonal matrices are symmetric). Thus, $(\Sigma^{-1/2})^\top = \Sigma^{-1/2}$, so it is symmetric.

(b) Whitening Transformation

  1. Setup: $Z = \Sigma^{-1/2}X$. We need $Cov(Z)$.
  2. Covariance Calculation: $$Cov(Z) = Cov(\Sigma^{-1/2}X) = \Sigma^{-1/2} Cov(X) (\Sigma^{-1/2})^\top$$
  3. Substitute $Cov(X) = \Sigma$: $$Cov(Z) = \Sigma^{-1/2} \Sigma \Sigma^{-1/2}$$ (Using symmetry from part a).
  4. Use Eigen-decomposition: $$Cov(Z) = (U \Lambda^{-1/2} U^\top) (U \Lambda U^\top) (U \Lambda^{-1/2} U^\top)$$ Using $U^\top U = I$: $$Cov(Z) = U (\Lambda^{-1/2} \Lambda \Lambda^{-1/2}) U^\top$$ The diagonal term: $\frac{1}{\sqrt{\lambda}} \cdot \lambda \cdot \frac{1}{\sqrt{\lambda}} = 1$. So inner term is $I$. $$Cov(Z) = U I U^\top = U U^\top = I$$

(c) Mahalanobis Distance

  1. Squared Norm: $||Z||^2 = Z^\top Z$.
  2. Substitute Z: $$||Z||^2 = (\Sigma^{-1/2}X)^\top (\Sigma^{-1/2}X) = X^\top (\Sigma^{-1/2})^\top \Sigma^{-1/2} X$$
  3. Simplify Matrix Product: $$(\Sigma^{-1/2})^\top \Sigma^{-1/2} = \Sigma^{-1/2} \Sigma^{-1/2} = \Sigma^{-1}$$ (Since $A^{1/2}A^{1/2} = A$, so $A^{-1/2}A^{-1/2} = A^{-1}$).
  4. Result: $$||Z||^2 = X^\top \Sigma^{-1} X$$ This is the definition of the squared Mahalanobis distance.

Alex,求矩阵的 特征值 (Eigenvalues)特征向量 (Eigenvectors) 是这门课(特别是 PCA, FA, 和 MVN Testing)最基础的计算技能。在考场上手算通常只涉及 $2 \times 2$ 或简单的 $3 \times 3$ 矩阵。

这是一个绝对不会出错的标准化三步流程,请把它写在你的 Cheat Sheet 上:

核心定义

对于矩阵 $A$,如果在变换后向量的方向不变,只改变长度,则满足:

$$A v = \lambda v$$

变换为解方程的形式:

$$(A - \lambda I)v = 0$$

第一步:求特征值 $\lambda$ (The Characteristic Equation)

目标:找到让矩阵 $(A - \lambda I)$ 变得“不可逆”(行列式为 0)的 $\lambda$ 值。

公式

$$\det(A - \lambda I) = 0$$

实战举例: 假设协方差矩阵 $S = \begin{pmatrix} 4 & 2 \\ 2 & 7 \end{pmatrix}$。

  1. 写出 $S - \lambda I$: $$\begin{pmatrix} 4 - \lambda & 2 \\ 2 & 7 - \lambda \end{pmatrix}$$
  2. 计算行列式 (对角线相乘减去反对角线): $$(4 - \lambda)(7 - \lambda) - (2)(2) = 0$$
  3. 解一元二次方程: $$\lambda^2 - 11\lambda + 28 - 4 = 0$$ $$\lambda^2 - 11\lambda + 24 = 0$$ $$(\lambda - 3)(\lambda - 8) = 0$$ 结果:特征值为 $\lambda_1 = 8, \lambda_2 = 3$。(通常按从大到小排列)。

第二步:求特征向量 $v$ (The Null Space)

目标:把求出的 $\lambda$ 带回方程 $(A - \lambda I)v = 0$,解出 $v$。 关键点:这里的方程组必须有无穷多解(即行与行之间是倍数关系)。如果你算出了唯一解 $v=0$,说明第一步算错了。

继续实战情况 1:当 $\lambda_1 = 8$ 时

  1. 代入 $(S - 8I)v = 0$: $$\begin{pmatrix} 4-8 & 2 \\ 2 & 7-8 \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = \begin{pmatrix} 0 \\ 0 \end{pmatrix}$$ $$\begin{pmatrix} -4 & 2 \\ 2 & -1 \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = 0$$
  2. 观察方程:第一行 $-4x_1 + 2x_2 = 0$ 和第二行 $2x_1 - 1x_2 = 0$ 其实是同一个方程(第二行是第一行的 -0.5 倍)。
  3. 解得:$2x_1 = x_2$。
  4. 取一个简单的整数解(比如令 $x_1=1$): $$v_1 = \begin{pmatrix} 1 \\ 2 \end{pmatrix}$$

情况 2:当 $\lambda_2 = 3$ 时

  1. 代入 $(S - 3I)v = 0$: $$\begin{pmatrix} 4-3 & 2 \\ 2 & 7-3 \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = 0$$ $$\begin{pmatrix} 1 & 2 \\ 2 & 4 \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = 0$$
  2. 观察方程:$x_1 + 2x_2 = 0$。
  3. 解得:$x_1 = -2x_2$。
  4. 取简单整数解(比如令 $x_2=1$): $$v_2 = \begin{pmatrix} -2 \\ 1 \end{pmatrix}$$

第三步:标准化 (Normalization) —— STA437 必做

在统计学(PCA/FA)中,我们要求特征向量的模长为 1 ($||v||=1$)。 操作:计算向量长度,然后除以长度。

  1. 对于 $v_1 = \begin{pmatrix} 1 \\ 2 \end{pmatrix}$:

    • 长度 $||v_1|| = \sqrt{1^2 + 2^2} = \sqrt{5}$。
    • 标准化后:$u_1 = \begin{pmatrix} 1/\sqrt{5} \\ 2/\sqrt{5} \end{pmatrix}$。
  2. 对于 $v_2 = \begin{pmatrix} -2 \\ 1 \end{pmatrix}$:

    • 长度 $||v_2|| = \sqrt{(-2)^2 + 1^2} = \sqrt{5}$。
    • 标准化后:$u_2 = \begin{pmatrix} -2/\sqrt{5} \\ 1/\sqrt{5} \end{pmatrix}$。

考场上的“作弊”速查技巧 (Sanity Checks)

如果你算完了不确定对不对,用这三条性质检查:

  1. Trace Rule (迹的性质): 特征值之和 = 矩阵对角线元素之和。

    • 例子:$\lambda_1 + \lambda_2 = 8 + 3 = 11$。
    • 矩阵:$4 + 7 = 11$。 ✅ 对上了。
  2. Determinant Rule (行列式性质): 特征值之积 = 矩阵的行列式。

    • 例子:$\lambda_1 \cdot \lambda_2 = 8 \times 3 = 24$。
    • 矩阵:$(4)(7) - (2)(2) = 28 - 4 = 24$。 ✅ 对上了。
  3. Symmetric Matrix Rule (对称矩阵性质): 如果矩阵是对称的(如协方差矩阵),不同特征值对应的特征向量必须正交(点积为 0)。

    • 检查:$v_1 \cdot v_2 = (1)(-2) + (2)(1) = -2 + 2 = 0$。 ✅ 对上了。

特殊情况提醒:投影矩阵 如果题目中出现 Projection Matrix $P$(如 Hat Matrix $H$),根据 PDF ,你不需要算行列式。

  • $P$ 的特征值只能是 10
  • 特征值 1 的个数 = 矩阵的 Rank。
  • 特征值 0 的个数 = 维度 $n$ - Rank。