Simulation Pt.2 437

重点提示

矩阵求导和迹 (Trace) 的性质: 在 Regression 和 PCA 的推导中（特别是 Question 1c 和 Question 2a），Trace 的循环性质 ($tr(ABC)=tr(CAB)$) 是解题核心。
PCA vs FA 的区别: 考试很喜欢问 “Why is specific variance important?” 或者让你算 Commulity。记住 FA 把误差分成了 Common ($L$) 和 Unique ($\Psi$)，而 PCA 没有区分。
Kurtosis 的极值: ICA 的核心就是 “Central Limit Theorem 告诉我们混合信号更像高斯分布”，所以我们要 Maximize Non-Gaussianity (Kurtosis) 来分离信号。Question 5b 的证明务必看一眼 PDF 第 9 页。
MVN 的条件分布公式: 那个 $E[Y|X] = \mu_Y + \Sigma_{YX}\Sigma_{X}^{-1}(X-\mu_X)$ 的公式，如果不背下来，考场上现推会很慢。

Question 4: Factor Analysis (FA) vs. PCA

核心考点：模型假设、Heywood Case、旋转不变性

Question 4: Factor Analysis (FA) vs. PCA

Focuses on the “Identifiability” and model checking (Heywood cases).

(a) The Model: Write down the orthogonal Factor Analysis model equation involving Loadings $L$, Factors $Z$, and Specific Variances $\Psi$. State the assumptions on the covariance of $Z$ and $\epsilon$.
(b) Heywood Case: Suppose you fit a 1-factor model and find that for one variable, the estimated loading squared $l_i^2$ is greater than the total variance of that variable (standardized variance = 1). This implies the specific variance $\psi_i = 1 - l_i^2$ is negative. Is this a valid statistical model? Explain why or why not.
(c) Rotation Invariance: Prove that the “Total Communality” (the total variance explained by the common factors, $\sum h_i$) is invariant to orthogonal rotation of the loadings matrix $L$. (Hint: Use the Trace property) .

(a) The Model (模型定义)

Factor Analysis (FA) 的正交因子模型（Orthogonal Factor Model）定义如下：

$$X = LZ + \epsilon$$

其中：

$X \in \mathbb{R}^p$ 是观测数据向量（Observed variables）。
$L \in \mathbb{R}^{p \times r}$ 是 Loadings Matrix（载荷矩阵）。
$Z \in \mathbb{R}^r$ 是 Latent Factors（潜在因子），假设 $Z \sim \mathcal{N}(0, I_r)$。这意味着因子之间是互不相关且标准化的。
$\epsilon \in \mathbb{R}^p$ 是 Specific Variances（特殊误差/噪声），假设 $\epsilon \sim \mathcal{N}(0, \Psi)$，其中 $\Psi = \text{diag}(\psi_1, ..., \psi_p)$ 是对角矩阵。
关键假设：$Z$ 和 $\epsilon$ 是独立的，即 $Cov(Z, \epsilon) = 0$ 。

这导出了 $X$ 的协方差结构：

$$Cov(X) = \Sigma = LL^\top + \Psi$$

(b) Heywood Case (边界解问题)

答案： 这不是一个有效的统计模型。 解释： 在标准化变量（Standardized variables）的假设下，总方差为 1。根据模型，第 $i$ 个变量的方差分解为：

$$Var(X_i) = \sum_{j=1}^r l_{ij}^2 + \psi_i = 1$$

如果计算出的 loading squared $l_i^2$ (在 1-factor模型中) 大于 1，那么根据公式 $\psi_i = 1 - l_i^2$，得出的特殊方差 $\psi_i$ 将是负数。方差（Variance）在定义上必须是非负的 ($\psi_i \ge 0$)。这种情况被称为 Heywood Case，通常意味着模型设定错误（比如提取的因子太多）或者样本量太小导致估计不稳定。

(c) Rotation Invariance (旋转不变性)

题目： 证明 Total Communality $\sum h_i$ 对旋转是不变的。 证明：

定义： Communality $h_i$ 是第 $i$ 个变量被公因子解释的方差，即 $L$ 矩阵第 $i$ 行的平方和。 Total Communality 是所有 $h_i$ 的和： $$\text{Total Communality} = \sum_{i=1}^p h_i = \sum_{i=1}^p \sum_{j=1}^r l_{ij}^2 = ||L||_F^2 = tr(LL^\top)$$ (这里利用了 Frobenius norm 和 Trace 的性质) 。
旋转： 设 $Q$ 是一个正交旋转矩阵 ($Q^\top Q = I$)。旋转后的载荷矩阵为 $L^* = LQ$ 。
计算旋转后的 Total Communality： $$\text{Total Comm}^* = tr(L^* (L^*)^\top) = tr((LQ)(LQ)^\top)$$ $$= tr(L Q Q^\top L^\top)$$ 由于 $Q$ 是正交的，$Q Q^\top = I$： $$= tr(L I L^\top) = tr(LL^\top)$$
结论： $\text{Total Comm}^* = \text{Total Comm}$。因此，总公因子方差（Total Communality）是旋转不变的。

Question 5: Independent Component Analysis (ICA)

ICA is distinct because of the Gaussianity constraint. This question targets the “Kurtosis” maximization proof.

(a) Kurtosis & Scale Invariance: Let $X$ be a random variable and $w$ be a scalar. Prove that the Excess Kurtosis is scale-invariant, i.e., $\mathcal{K}(wX) = \mathcal{K}(X)$.
(b) Maximizing Non-Gaussianity: We define the ICA objective as maximizing $| \mathcal{K}(y) |$ where $y = w_1 z_1 + w_2 z_2$ (a mixture of independent sources). Under the whitening constraint $w_1^2 + w_2^2 = 1$, show that the maximum occurs only at the boundaries (e.g., $w=(1,0)$). Explain what this implies physically about recovering the original sources.
(c) Why not PCA?: Consider a dataset where the variables are uncorrelated but dependent (e.g., uniformly distributed on a diamond shape). Explain why PCA cannot separate these signals (Hint: What is the rotation matrix for uncorrelated data?), whereas ICA can.

Question 5: Independent Component Analysis (ICA)

核心考点：Kurtosis 的性质、ICA 优化目标

(a) Kurtosis Scale Invariance (峰度的尺度不变性)

题目： 证明 $\mathcal{K}(wX) = \mathcal{K}(X)$。 证明： Excess Kurtosis 的定义是：$\mathcal{K}(X) = \frac{E[(X-\mu)^4]}{(\sigma^2)^2} - 3$ 。设 $Y = wX$。

均值：$\mu_Y = w\mu_X$。
中心化矩：$Y - \mu_Y = w(X - \mu_X)$。
方差：$\sigma_Y^2 = Var(wX) = w^2 \sigma_X^2$。
代入公式： $$\mathcal{K}(wX) = \frac{E[(w(X-\mu_X))^4]}{(w^2 \sigma_X^2)^2} - 3$$ $$= \frac{w^4 E[(X-\mu_X)^4]}{w^4 (\sigma_X^2)^2} - 3$$ $$= \frac{E[(X-\mu_X)^4]}{(\sigma_X^2)^2} - 3 = \mathcal{K}(X)$$ 结论： 标量乘法 $w$ 会在分子分母中相互抵消（只要 $w \neq 0$），因此峰度是尺度不变的。

(b) Maximizing Non-Gaussianity (非高斯性最大化)

题目： 在 $w_1^2 + w_2^2 = 1$ 约束下，证明 $|\mathcal{K}(w_1 z_1 + w_2 z_2)|$ 在边界处最大。 证明： 根据 PDF 中的线性组合性质（假设 $z_i$ 独立且标准化）：

$$\mathcal{K}(w_1 z_1 + w_2 z_2) = w_1^4 \mathcal{K}(z_1) + w_2^4 \mathcal{K}(z_2)$$

。我们需要最大化目标函数 $J(w) = |w_1^4 \mathcal{K}(z_1) + w_2^4 \mathcal{K}(z_2)|$。利用不等式性质和约束 $w_1^2 + w_2^2 = 1$ (这意味着 $w_i^4 \le w_i^2$)：

$$|\mathcal{K}(y)| \le w_1^4 |\mathcal{K}(z_1)| + w_2^4 |\mathcal{K}(z_2)| \le (w_1^4 + w_2^4) \max(|\mathcal{K}(z_1)|, |\mathcal{K}(z_2)|)$$

因为 $w_1^4 + w_2^4 \le (w_1^2 + w_2^2)^2 = 1$，这个和只有在其中一个 $w_i=1$ 且另一个 $w_j=0$ 时取最大值 1。因此，最大值只能在边界点 $w=(1, 0)$ 或 $w=(0, 1)$ 取得。

物理意义： 解 $w=(1, 0)$ 意味着 $y = 1 \cdot z_1 + 0 \cdot z_2 = z_1$。这表示我们要找的“最非高斯”的方向，恰好就是原始的、独立的源信号（Original Sources）的方向。只有当我们完全分离出一个源信号时，非高斯性才达到最大；任何混合信号的峰度绝对值都会变小（更接近高斯分布，即 CLT 效应）。

(c) Why not PCA? (为什么 PCA 无法分离)

答案： PCA 依赖于协方差矩阵 (Covariance Matrix) 的对角化来消除相关性（Decorrelation）。

如果数据已经是互不相关（Uncorrelated, $Cov=0$）但是统计上并不独立（Dependent），比如 PDF 中提到的均匀分布在菱形上的点 $(0,1), (0,-1), (1,0), (-1,0)$ 。
对于这种数据，协方差矩阵已经是单位阵（或者对角阵）。PCA 看到的 $X^\top X$ 已经是对应的形状，它无法区分任何特定的旋转方向。对 PCA 来说，任何正交旋转都是等价的（Identity）。
ICA 的优势： ICA 不仅看二阶矩（协方差），还看高阶矩（四阶矩 Kurtosis）。即使协方差矩阵是对角的，ICA 也能通过最大化 Kurtosis 找到数据分布“尖峰”的方向，从而恢复独立的源信号。

Question 6: Multivariate Hypothesis Testing & Conditional Distributions

The “Calculation” heavy question involving partitioned matrices.

(a) Conditional Distribution: Let $X \sim N_p(\mu, \Sigma)$ be partitioned into $X_A$ and $X_B$. Write down the formula for the conditional mean $E[X_A | X_B = x_B]$ and conditional variance $Var(X_A | X_B = x_B)$ using Schur complements.
(b) Independence vs. Correlation: In the context of Multivariate Normal Distribution (MVN), prove or explain why zero covariance (uncorrelatedness) implies statistical independence. Does this hold for non-Gaussian distributions?.
(c) Two-Sample $T^2$ Test: You have two samples with sizes $n$ and $m$. Write down the expression for the pooled covariance matrix $S_{pooled}$. Then, state the Hotelling’s $T^2$ statistic for testing $H_0: \mu_x = \mu_y$ and its distribution under the null.

Question 6: Multivariate Hypothesis Testing

核心考点：MVN条件分布公式、$T^2$ 统计量

(a) Conditional Distribution (条件分布)

设 $X$ 被划分为两部分 $X_A$ 和 $X_B$，即 $X = \begin{pmatrix} X_A \\ X_B \end{pmatrix} \sim \mathcal{N}_p \left( \begin{pmatrix} \mu_A \\ \mu_B \end{pmatrix}, \begin{pmatrix} \Sigma_{AA} & \Sigma_{AB} \\ \Sigma_{BA} & \Sigma_{BB} \end{pmatrix} \right)$ 。

给定 $X_B = x_B$ 时，$X_A$ 的条件分布是多元正态分布，参数如下：

条件均值 (Mean): $$E[X_A | X_B = x_B] = \mu_A + \Sigma_{AB}\Sigma_{BB}^{-1}(x_B - \mu_B)$$ 这其实就是用 $X_B$ 对 $X_A$ 做回归的预测值。
条件方差 (Variance): $$Var(X_A | X_B = x_B) = \Sigma_{AA} - \Sigma_{AB}\Sigma_{BB}^{-1}\Sigma_{BA}$$ 这就是 Schur Complement 。 注意：条件方差是一个常数矩阵，不依赖于 $x_B$ 的具体取值。

(b) Independence vs. Correlation (独立性与相关性)

证明/解释： 在多元正态分布（MVN）中，$Uncorrelated \iff Independent$。

如果 $X_A$ 和 $X_B$ 不相关，则 $\Sigma_{AB} = 0$。
此时联合概率密度函数 (PDF) 中的二次型项 $(x-\mu)^\top \Sigma^{-1} (x-\mu)$ 可以分解。由于 $\Sigma$ 是分块对角阵 ($\Sigma_{AB}=0$)，$\Sigma^{-1}$ 也是分块对角的。
于是联合密度函数可以写成边缘密度的乘积：$f(x_A, x_B) = f(x_A)f(x_B)$，这就意味着独立。

非高斯情况： 这个性质不成立。两个非高斯变量可以是不相关的（Cov=0），但仍然相互依赖（Dependent）。例如 Question 5(c) 中提到的菱形分布例子。

(c) Two-Sample $T^2$ Test (双样本检验)

Pooled Covariance (合并协方差矩阵): 假设两个样本的大小分别为 $n$ 和 $m$，样本协方差矩阵为 $S_x$ 和 $S_y$。

$$S_{pooled} = \frac{(n-1)S_x + (m-1)S_y}{n + m - 2}$$

(注意：PDF 中写的是 $(n+m-2)S_{pooled} = nS_x + mS_y$，这是基于分母为 $n$ 的有偏估计定义。考试时如果你用的是无偏估计 $S$（分母 n-1），请用上面的公式。如果用 PDF 的定义，请严格照抄 PDF)。 为了安全，根据 PDF 的公式： $S_{pooled} = \frac{nS_x + mS_y}{n+m-2}$ (假设这里的 $S_x, S_y$ 是由 $1/n$ 定义的)。

Hotelling’s $T^2$ Statistic:

$$T^2 = \frac{nm}{n+m} (\bar{x} - \bar{y})^\top S_{pooled}^{-1} (\bar{x} - \bar{y})$$

。这里系数 $\frac{nm}{n+m}$ 来源于 $\frac{1}{1/n + 1/m}$。

Distribution: 在零假设 $H_0: \mu_x = \mu_y$ 下，统计量服从：

$$T^2 \sim T^2(p, n+m-2)$$

即自由度为 $p$ 和 $n+m-2$ 的 Hotelling $T^2$ 分布。

Question 1: Generalized Least Squares & Ridge Regression (Linear Models)

This question tests your understanding of estimator properties when assumptions are violated or modified.

(a) GLS Transformation: Consider the linear model $y = X\beta + \epsilon$ where $Var(\epsilon) = \sigma^2 \Psi$ and $\Psi$ is known but not Identity. Show how to find a matrix $L$ such that the transformed model satisfies standard OLS assumptions. explicitly state the relationship between $L$ and $\Psi$.
(b) Ridge Bias: For the Ridge Regression estimator $\hat{\beta}_R = (X^\top X + \lambda I)^{-1}X^\top y$, prove that it is a biased estimator of $\beta$. Derive the specific expression for the bias $E[\hat{\beta}_R] - \beta$.
(c) Hat Matrix Trace: For a standard OLS model, the “Hat Matrix” is $H = X(X^\top X)^{-1}X^\top$. Prove that the trace of the Hat Matrix equals the number of predictors $p$. Explain what the diagonal elements $H_{ii}$ represent in terms of “leverage”.

核心考点：GLS变换矩阵、Ridge Bias推导、Hat Matrix Trace性质

(a) GLS Transformation (GLS 变换)

题目： Find a matrix $L$ such that the transformed model $Ly = LX\beta + L\epsilon$ satisfies standard OLS assumptions (white noise). 解答： 我们已知误差项 $\epsilon \sim \mathcal{N}(0, \sigma^2 \Psi)$，其中 $\Psi$ 是已知的正定矩阵。为了让新误差项 $\epsilon^* = L\epsilon$ 满足 OLS 假设（即协方差为 $\sigma^2 I$），我们需要：

$$Var(L\epsilon) = L Var(\epsilon) L^\top = L (\sigma^2 \Psi) L^\top = \sigma^2 (L \Psi L^\top)$$

我们希望 $L \Psi L^\top = I$。这意味着 $L^\top L = \Psi^{-1}$。因此，我们可以选择 $L$ 为 $\Psi^{-1}$ 的 Cholesky分解 因子或者 特征分解 的平方根矩阵的逆。具体来说，如果令 $\Psi^{-1} = L^\top L$（Cholesky）或 $L = \Psi^{-1/2}$，则变换后的模型：

$$y^* = X^*\beta + \epsilon^*$$

其中 $y^* = Ly, X^* = LX$，其误差项满足 $Var(\epsilon^*) = \sigma^2 I$，可以使用 OLS 估计。

(b) Ridge Estimator Bias (Ridge 回归的偏差)

题目： Derive the expectation $E[\hat{\beta}_R]$ and show it is biased. 解答： Ridge 回归估计量定义为：

$$\hat{\beta}_R = (X^\top X + \lambda I)^{-1} X^\top y$$

代入真实模型 $y = X\beta + \epsilon$：

$$\hat{\beta}_R = (X^\top X + \lambda I)^{-1} X^\top (X\beta + \epsilon)$$

$$= (X^\top X + \lambda I)^{-1} (X^\top X)\beta + (X^\top X + \lambda I)^{-1} X^\top \epsilon$$

对两边求期望 $E[\cdot]$，利用 $E[\epsilon] = 0$：

$$E[\hat{\beta}_R] = (X^\top X + \lambda I)^{-1} (X^\top X)\beta$$

注意，如果 $\lambda = 0$，则前面两项抵消变为 $\beta$（无偏）。但因为 $\lambda > 0$，$(X^\top X + \lambda I)^{-1} (X^\top X) \neq I$。因此，$E[\hat{\beta}_R] \neq \beta$，它是 有偏估计 (Biased Estimator) 。具体的偏差项 (Bias) 为：

$$\text{Bias}(\hat{\beta}_R) = E[\hat{\beta}_R] - \beta = [(X^\top X + \lambda I)^{-1} (X^\top X) - I] \beta$$

(c) Trace of Hat Matrix (Hat Matrix 的迹)

题目： Show that $tr(H) = p$ and explain “leverage”. 解答： Hat Matrix 定义为 $H = X(X^\top X)^{-1}X^\top$ 。利用迹的循环性质 (Cyclic Property: $tr(ABC) = tr(CAB)$) ：

$$tr(H) = tr(X(X^\top X)^{-1}X^\top) = tr(X^\top X (X^\top X)^{-1})$$

由于 $(X^\top X)(X^\top X)^{-1} = I_p$（$p \times p$ 的单位阵）：

$$tr(H) = tr(I_p) = p$$

解释： $H$ 的对角元素 $H_{ii}$ 被称为 Leverage Score。它衡量了第 $i$ 个观测值对自身的拟合值 $\hat{y}_i$ 的影响程度。由于 $\sum H_{ii} = tr(H) = p$，所以平均 Leverage 是 $p/n$ 。

Question 2: Kernel PCA & Centering

核心考点：Feature Space 中心化、Kernel Trick、Linear Kernel 等价性

Question 2: Kernel PCA & Centering (Feature Space)

This focuses on the “trickiest” part of Kernel PCA: data centering in the high-dimensional space.

(a) The Centering Problem: In Kernel PCA, we cannot explicitly compute the mean in feature space to center the data. We are given a kernel matrix $K_{ij} = \mathcal{K}(x_i, x_j)$. Let $C = I_n - \frac{1}{n}1_n 1_n^\top$ be the centering matrix. Prove (or explain the row/column operations) that the centered kernel matrix $\tilde{K}$ is computed as $\tilde{K} = CKC$.
(b) Eigenvalues: Does the centering operation $CKC$ change the eigenvalues compared to the uncentered $K$? Why is this step strictly necessary before performing the eigendecomposition to find Principal Components?.
(c) Linear Kernel Equivalence: Explain why performing Kernel PCA with the linear kernel $\mathcal{K}(a, b) = \langle a, b \rangle$ (and properly centering it) is mathematically equivalent to performing standard PCA on the original dataset.

(a) The Centering Matrix (中心化矩阵)

题目： Show that the centered kernel matrix $\tilde{K} = CKC$, where $C = I_n - \frac{1}{n}1_n 1_n^\top$. 解答： 在特征空间中，设 $\Phi$ 为未中心化的特征矩阵，$\bar{\Phi}$ 为中心化后的矩阵。我们需要计算 $\tilde{K} = \bar{\Phi}\bar{\Phi}^\top$。中心化操作在代数上等价于右乘中心化矩阵 $C$，即 $\bar{\Phi} = C\Phi$（注意：通常如果是 $n \times p$ 数据矩阵 $X$，中心化是 $CX$，这里假设 $\Phi$ 是 $n \times \text{dim}$，操作类似，目的是移除列均值）。更准确的推导基于 Kernel 矩阵的元素操作。对于 $K$ 矩阵，左乘 $C$ 会移除行的均值，右乘 $C$ 会移除列的均值。

$$\tilde{K} = C K C$$

展开来看：

$$\tilde{K} = (I - \frac{1}{n}11^\top) K (I - \frac{1}{n}11^\top) = K - \frac{1}{n}11^\top K - \frac{1}{n}K11^\top + \frac{1}{n^2}11^\top K 11^\top$$

这一步操作确保了在未知的特征空间中，数据点的重心被移动到了原点。

(b) Effect on Eigenvalues (对特征值的影响)

题目： Does centering change eigenvalues? Why is it necessary? 解答： 是的，会改变。 未中心化的 Kernel Matrix $K$ 包含了原点到数据点的距离信息，而不仅仅是数据点之间的相对方差结构。如果不进行中心化，第一主成分可能会指向数据的均值方向（即从原点指向数据云团中心），而不是数据方差最大的方向。 PCA 的定义是寻找最大方差的方向。如果不减去均值（Centering），计算出的 $\frac{1}{n}X^\top X$ 只是二阶矩矩阵，而不是协方差矩阵。因此，必须使用 $\tilde{K} = CKC$ 来确保我们分解的是协方差结构。

(c) Linear Kernel Check (线性核的等价性)

题目： Explain why linear kernel PCA is equivalent to standard PCA. 解答： 线性核定义为 $\mathcal{K}(x_i, x_j) = \langle x_i, x_j \rangle = x_i^\top x_j$。对应的核矩阵为 $K = XX^\top$ 。进行中心化后：$\tilde{K} = C(XX^\top)C = (CX)(CX)^\top$。注意 $CX$ 正是中心化后的数据矩阵（设为 $\tilde{X}$）。 Kernel PCA 对 $\tilde{K} = \tilde{X}\tilde{X}^\top$ 进行特征分解。 Standard PCA (Dual Approach) 也是对 $\tilde{X}\tilde{X}^\top$ 进行特征分解（因为 $X^\top X$ 和 $XX^\top$ 共享非零特征值）。因此，使用线性核的 Kernel PCA 在数学上完全等价于标准 PCA。

Question 3: Canonical Correlation Analysis (CCA) Setup

核心考点：CCA 优化目标、Whitening 变换、与回归的联系

Question 3: Canonical Correlation Analysis (CCA) Optimization

This tests the fundamental derivation of CCA using SVD, a core concept in the notes.

(a) The Objective: We want to find vectors $u, v$ to maximize $Corr(X^\top u, Y^\top v)$. Formulate this as a constrained optimization problem: “Maximize $u^\top S_{XY} v$ subject to…”? Explain why we set the variance constraints to 1.
(b) Whitening Step: The notes define a “transformed” or “whitened” cross-covariance matrix $\tilde{S}_{xy} = S_x^{-1/2}S_{xy}S_y^{-1/2}$. Show how the singular values of this specific matrix $\tilde{S}_{xy}$ relate to the canonical correlations.
(c) Relation to Regression: If $Y$ is univariate ($q=1$), show that the first canonical direction vector $u$ is proportional to the Ordinary Least Squares (OLS) regression coefficient $\hat{\beta}$.

(a) The Objective (优化目标)

题目： Write down the optimization problem for CCA. 解答： CCA 的目标是寻找投影向量 $u$ (for $X$) 和 $v$ (for $Y$)，使得投影后的变量相关性最大。由于相关性是尺度不变的 (Scale-invariant)，我们通常固定投影后的方差为 1 来使得解唯一。 Optimization Problem:

$$\text{Maximize } u^\top S_{XY} v$$

Subject to constraints:

$$u^\top S_X u = 1$$

$$v^\top S_Y v = 1$$

其中 $S_X, S_Y$ 是样本协方差矩阵，$S_{XY}$ 是互协方差矩阵。

(b) Whitening & Solution (白化与求解)

题目： How is the solution found using $\tilde{S}_{xy}$? 解答： 为了解决上述带约束的优化问题，我们通过变量代换进行“白化” (Whitening)。定义变换后的向量：$\tilde{u} = S_X^{1/2}u$ 和 $\tilde{v} = S_Y^{1/2}v$。这使得约束条件简化为 Euclidean 范数约束：$||\tilde{u}||^2 = 1$ 和 $||\tilde{v}||^2 = 1$ 。目标函数变为：

$$u^\top S_{XY} v = (S_X^{-1/2}\tilde{u})^\top S_{XY} (S_Y^{-1/2}\tilde{v}) = \tilde{u}^\top (S_X^{-1/2} S_{XY} S_Y^{-1/2}) \tilde{v}$$

定义白化后的矩阵 $\tilde{S}_{xy} = S_X^{-1/2} S_{XY} S_Y^{-1/2}$ 。问题的解 $\tilde{u}, \tilde{v}$ 正是矩阵 $\tilde{S}_{xy}$ 的左、右奇异向量 (Singular Vectors)。对应的最大典型相关系数 (Canonical Correlation) 就是 $\tilde{S}_{xy}$ 的最大奇异值。

(c) Relation to Regression (与回归的关系)

题目： If $Y$ is univariate ($q=1$), show $u$ is proportional to OLS coefficient $\hat{\beta}$. 解答： 当 $Y$ 是单变量 $y$ 时，$S_Y$ 是标量方差 $s_y^2$，$S_{XY}$ 是向量 $s_{xy}$。 CCA 试图最大化 $Corr(X^\top u, y)$。这等价于最大化决定系数 $R^2$，即寻找 $X$ 的线性组合使其最接近 $y$。这正是多元线性回归 (Multiple Linear Regression) 的目标。回归系数公式为 $\hat{\beta} = S_X^{-1} s_{xy}$。在 CCA 中，我们最大化 $u^\top s_{xy}$ subject to $u^\top S_X u = 1$。根据柯西-施瓦茨不等式或拉格朗日乘数法，最优的 $u$ 方向平行于 $S_X^{-1} s_{xy}$。因此，CCA 的方向 $u$ 与 OLS 回归系数 $\hat{\beta}$ 是成比例 (Proportional) 的方向。

Alex，没问题。既然这是Open Book考试，且你希望资料能够直接打印带入考场，这套题目我将完全聚焦于 “Note Practice” 和 “Midterm 1/2” 中出现过的硬核推导和反例。

这套 Round 3 题目避开了上一轮的 Kernel/CCA/FA，转而攻克 Matrix Calculus（矩阵微积分）、MLE 详细推导 以及 ICA 的经典反例。这些都是 Note 中不起眼但极易考大题的地方。

STA437 Final Exam - Round 3 Prediction

Focus: Derivations, Note Practice & Properties

Question 1: Spectral Properties of the “Residual Maker” (Linear Algebra)

[Source: Simulation Pt.1 Q1 & Midterm 1 Review] This question tests the geometric interpretation of Least Squares using SVD.

(a) SVD Representation of Hat Matrix Let $H = X(X^\top X)^{-1}X^\top$ be the Hat matrix. Using the Thin Singular Value Decomposition $X = UDV^\top$ (where $U \in \mathbb{R}^{n \times p}$ is column-orthogonal, $U^\top U = I_p$), show that $H$ simplifies to $UU^\top$. (b) Eigenvalues of the Residual Maker Let $M = I_n - H$ be the matrix that generates residuals ($e = My$). Determine the eigenvalues of $M$ and their multiplicities. Based on this, explain why $M$ is Positive Semi-Definite (PSD). (c) Trace of H Show that the trace of the Hat matrix equals the rank of $X$ (assume full rank $p$).

Solution 1: Spectral Properties of the “Residual Maker”

(a) SVD Representation of $H$

Definition: $H = X(X^\top X)^{-1}X^\top$.
Substitute SVD: Let $X = UDV^\top$ (Thin SVD, $U \in \mathbb{R}^{n \times p}, D \in \mathbb{R}^{p \times p}, V \in \mathbb{R}^{p \times p}$). Note that $V$ is orthogonal ($V^\top V = VV^\top = I_p$) and $U$ is column-orthogonal ($U^\top U = I_p$).
Compute inner term: $X^\top X = (UDV^\top)^\top (UDV^\top) = V D^\top U^\top U D V^\top = V D I_p D V^\top = V D^2 V^\top$.
Compute Inverse: $(X^\top X)^{-1} = (V D^2 V^\top)^{-1} = (V^\top)^{-1} (D^2)^{-1} V^{-1} = V D^{-2} V^\top$.
Substitute back into H: $H = (UDV^\top) (V D^{-2} V^\top) (UDV^\top)^\top$ $H = U D (V^\top V) D^{-2} (V^\top V) D U^\top$ $H = U D I D^{-2} I D U^\top = U (D D^{-2} D) U^\top = U I U^\top = UU^\top$.
Result: $H = UU^\top$.

(b) Eigenvalues of $M$

Definition: $M = I_n - H = I_n - UU^\top$.
Eigenvalues of $H$: Since $H = UU^\top$ is a projection matrix onto a p-dimensional subspace (span of U), it has $p$ eigenvalues equal to 1, and $n-p$ eigenvalues equal to 0.
Eigenvalues of $M$: The eigenvalues of $I - H$ are simply $1 - \lambda_i(H)$.
- For the $p$ eigenvalues where $\lambda(H)=1$: $\lambda(M) = 1 - 1 = 0$.
- For the $n-p$ eigenvalues where $\lambda(H)=0$: $\lambda(M) = 1 - 0 = 1$.
Conclusion: The eigenvalues are 1 (with multiplicity $n-p$) and 0 (with multiplicity $p$).
PSD Property: Since all eigenvalues $\lambda_i \ge 0$, the matrix $M$ is Positive Semi-Definite (PSD).

(c) Trace of $H$

Using the property $H=UU^\top$: $tr(H) = tr(UU^\top)$.
Using the cyclic property of trace ($tr(AB) = tr(BA)$): $tr(H) = tr(U^\top U)$.
Since $U$ is column-orthogonal ($U^\top U = I_p$): $tr(H) = tr(I_p) = p$.

Question 2: Matrix Gradient & Optimization (PCA Foundation)

[Source: Simulation Pt.1 Q2(d), Exercise 13/14] This derivation is the mathematical foundation for “Minimum Reconstruction Error” in PCA.

(a) Gradient Derivation Define the objective function $f(A) = ||X - A||_F^2$. Using matrix calculus properties (specifically $\nabla_A tr(A^\top A) = 2A$), derive the gradient $\nabla_A f(A)$. (b) Optimal A Show that setting the gradient to zero implies $A=X$. (c) Rank Constraint Connection If we impose a constraint that $rank(A) = r < p$, why can’t we just set $A=X$? Briefly explain how the Eckart-Young theorem modifies the solution derived in (b).

Solution 2: Matrix Gradient & Optimization

(a) Gradient Derivation

Objective: $f(A) = ||X - A||_F^2 = tr((X-A)^\top (X-A))$.
Expand: $f(A) = tr(X^\top X - X^\top A - A^\top X + A^\top A)$. Using $tr(X^\top A) = tr(A^\top X)$, we get: $f(A) = tr(X^\top X) - 2tr(A^\top X) + tr(A^\top A)$.
Differentiate wrt A:
- $\nabla_A (constant) = 0$.
- $\nabla_A (-2tr(A^\top X)) = -2X$.
- $\nabla_A tr(A^\top A) = 2A$.
Result: $\nabla_A f(A) = 2A - 2X$.

(b) Optimal A

Set gradient to zero: $2A - 2X = 0 \implies 2A = 2X \implies A = X$.
This confirms that without constraints, the best approximation of a matrix is the matrix itself.

(c) Eckart-Young / Rank Constraint

If we require $rank(A) = r < p$, we cannot simply set $A=X$ (which has rank $p$).
The Eckart-Young theorem states that under the spectral norm or Frobenius norm, the best rank-$r$ approximation is given by the Truncated SVD: $A_{opt} = \sum_{i=1}^r d_i u_i v_i^\top = U_r D_r V_r^\top$.
This connects the optimization problem to PCA: PCA finds the subspace that minimizes this reconstruction error.

Question 3: MLE of Multivariate Normal Covariance

[Source: Note 8 FA / Midterm Review / Source 503-535] A classic “Knowledge Part” derivation that often appears to test understanding of the Trace Trick.

(a) The Trace Trick The log-likelihood for MVN includes the term $\sum_{i=1}^n (x_i - \mu)^\top \Sigma^{-1} (x_i - \mu)$. Using the cyclic property of the trace ($tr(ABC)=tr(CAB)$), prove that this sum can be rewritten as $n \cdot tr(\Sigma^{-1} S)$, where $S$ is the MLE sample covariance (using $1/n$). (b) Optimizing $\Sigma$ Considering the simplified log-likelihood $l(\Sigma) \propto -\frac{n}{2}\log|\Sigma| - \frac{n}{2}tr(\Sigma^{-1}S)$, differentiate with respect to $\Sigma^{-1}$ (or $\Sigma$) to prove that the Maximum Likelihood Estimator is $\hat{\Sigma} = S$.

Solution 3: MLE of Multivariate Normal Covariance

(a) The Trace Trick

Log-likelihood term: $\text{Sum} = \sum_{i=1}^n (x_i - \mu)^\top \Sigma^{-1} (x_i - \mu)$.
Note that $(x_i - \mu)^\top \Sigma^{-1} (x_i - \mu)$ is a scalar, so it equals its own trace.
$\text{Sum} = \sum_{i=1}^n tr((x_i - \mu)^\top \Sigma^{-1} (x_i - \mu))$.
Cyclic Property ($tr(ABC)=tr(CAB)$): $= \sum_{i=1}^n tr(\Sigma^{-1} (x_i - \mu)(x_i - \mu)^\top)$.
Linearity of Trace: Move summation inside. $= tr(\Sigma^{-1} \sum_{i=1}^n (x_i - \mu)(x_i - \mu)^\top)$.
Substitute Sample Covariance: Since $nS = \sum (x_i - \mu)(x_i - \mu)^\top$ (assuming $\mu = \bar{x}$ for MLE): $= tr(\Sigma^{-1} (nS)) = n \cdot tr(\Sigma^{-1} S)$.

(b) Optimizing $\Sigma$

Simplified Log-Likelihood: $l(\Sigma) = -\frac{n}{2}\log|\Sigma| - \frac{n}{2}tr(\Sigma^{-1}S)$.
Let $\Lambda = \Sigma^{-1}$ (Precision Matrix) for easier differentiation. $l(\Lambda) = \frac{n}{2}\log|\Lambda| - \frac{n}{2}tr(\Lambda S)$. (Since $\log|\Sigma| = -\log|\Sigma^{-1}|$).
Differentiate wrt $\Lambda$:
- $\frac{\partial}{\partial \Lambda} \log|\Lambda| = \Lambda^{-1} = \Sigma$.
- $\frac{\partial}{\partial \Lambda} tr(\Lambda S) = S^\top = S$ (S is symmetric).
First Order Condition: $\nabla_\Lambda l = \frac{n}{2}\Sigma - \frac{n}{2}S = 0$.
Result: $\Sigma = S$. Thus, the MLE $\hat{\Sigma} = S_{MLE} = \frac{1}{n}\sum (x_i - \bar{x})(x_i - \bar{x})^\top$.

Question 4: The “Diamond” Distribution (ICA vs PCA)

[Source: Note 9 ICA / Simulation Pt.3 Q2] This is the specific counter-example from the Note Practice used to justify ICA.

(a) Covariance Calculation Consider a random vector $S = (S_1, S_2)^\top$ uniformly distributed on 4 points: $(1,0), (-1,0), (0,1), (0,-1)$. Calculate the covariance matrix $Cov(S)$ and show that $S_1$ and $S_2$ are uncorrelated. (b) Independence Check Prove that despite being uncorrelated, $S_1$ and $S_2$ are not independent (Check $P(S_1=1, S_2=1)$ vs marginal probabilities). (c) PCA vs ICA Explain why PCA cannot separate these signals (hint: look at the covariance matrix from part a), whereas ICA can.

Solution 4: The “Diamond” Distribution (ICA vs PCA)

(a) Covariance Calculation

Data: $P(1,0)=P(-1,0)=P(0,1)=P(0,-1) = 1/4$.
Means: $E[S_1] = 1(1/4) + (-1)(1/4) + 0 + 0 = 0$. Similarly $E[S_2] = 0$.
Covariance: $Cov(S_1, S_2) = E[S_1 S_2] - E[S_1]E[S_2]$. $E[S_1 S_2] = (1)(0)\frac{1}{4} + (-1)(0)\frac{1}{4} + (0)(1)\frac{1}{4} + (0)(-1)\frac{1}{4} = 0$.
Result: $Cov(S_1, S_2) = 0$. The variables are uncorrelated. The covariance matrix is $I$ (Identity).

(b) Independence Check

Check joint probability at $(1, 1)$. The point $(1,1)$ does not exist in the dataset. So $P(S_1=1, S_2=1) = 0$.
Check marginal probabilities: $P(S_1=1) = P((1,0)) = 1/4$. $P(S_2=1) = P((0,1)) = 1/4$.
Test: $P(S_1=1, S_2=1) = 0 \neq P(S_1=1)P(S_2=1) = 1/16$.
Conclusion: They are dependent (Not independent).

(c) PCA vs ICA

PCA: PCA looks at the covariance matrix $S$. Here $S = I$. Since eigenvalues are equal ($\lambda_1 = \lambda_2 = 1$), PCA cannot find a unique “principal direction”. Any orthogonal rotation is equally valid for PCA. It cannot distinguish the “corners” of the diamond.
ICA: ICA maximizes Non-Gaussianity (Kurtosis). The projection onto the axes (e.g., $w=(1,0)$) gives a distribution taking values $\{-1, 0, 1\}$. The projection onto a 45-degree line ($w=(1/\sqrt{2}, 1/\sqrt{2})$) gives a distribution with different values. The kurtosis is maximized (or extremized) exactly at the axes (the original sources), allowing ICA to recover the structure that PCA misses.

Question 5: Precision Matrix & Conditional Independence

[Source: Knowledge Part / Block Matrix Inverse] Tests the relationship between the inverse covariance matrix and graphical models.

(a) Block Inverse Formula For a partitioned covariance matrix $\Sigma = \begin{pmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{pmatrix}$, use the Schur Complement to write down the expression for the top-left block of the precision matrix $(\Sigma^{-1})_{11}$. (b) Conditional Independence If the $(i,j)$-th entry of the precision matrix is zero ($\omega_{ij} = (\Sigma^{-1})_{ij} = 0$), prove (or explain using the PDF exponent) that variables $X_i$ and $X_j$ are conditionally independent given all other variables.

Solution 5: Precision Matrix & Conditional Independence

(a) Block Inverse Formula

Given $\Sigma = \begin{pmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{pmatrix}$.
The inverse $\Omega = \Sigma^{-1}$ has its top-left block given by the inverse of the Schur Complement: $(\Sigma^{-1})_{11} = (\Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21})^{-1}$.
(Alternatively denoted as $(\Sigma / \Sigma_{22})^{-1}$).

(b) Conditional Independence

The joint PDF of MVN is proportional to $\exp(-\frac{1}{2} x^\top \Sigma^{-1} x)$.
Let $\Omega = \Sigma^{-1}$. The term in the exponent is $-\frac{1}{2} \sum_i \sum_j x_i x_j \omega_{ij}$.
The interaction term between $x_i$ and $x_j$ is determined solely by $\omega_{ij}$.
If $\omega_{ij} = 0$, there is no $x_i x_j$ term in the exponent.
This means the conditional density $f(x_i, x_j | \text{others})$ factors into $g(x_i)h(x_j)$, which implies conditional independence.

Question 6: Derivation of Two-Sample Hotelling’s $T^2$

[Source: Simulation Pt.1 MVN / Source 560-581] The most complex derivation in the “MVN Testing” section.

(a) The Z Vector We want to test $H_0: \mu_x = \mu_y$. Construct a standardized vector $Z$ involving $(\bar{x} - \bar{y})$ and $\Sigma$ that follows $N_p(0, I_p)$. (b) The Wishart Matrix Write down the definition of the pooled covariance $S_{pooled}$ (using the definition $(n+m-2)S_{pooled} = nS_x + mS_y$) and the corresponding Wishart matrix $M$. (c) Constructing $T^2$ Combine $Z$ and $M$ into the definition $T^2 = (n+m-2)Z^\top M^{-1} Z$. Substitute back to derive the final formula: $T^2 = \frac{nm}{n+m}(\bar{x} - \bar{y})^\top S_{pooled}^{-1} (\bar{x} - \bar{y})$.

Solution 6: Derivation of Two-Sample Hotelling’s $T^2$

(a) The Z Vector

Goal: Standardize the difference of means $(\bar{x} - \bar{y})$ under $H_0: \mu_x = \mu_y$.
Variance of Difference: $Var(\bar{x} - \bar{y}) = Var(\bar{x}) + Var(\bar{y}) = \frac{\Sigma}{n} + \frac{\Sigma}{m} = (\frac{1}{n} + \frac{1}{m})\Sigma = (\frac{n+m}{nm})\Sigma$.
Construct Z: We need to multiply by the inverse square root of the variance scaler and matrix. $Z = \sqrt{\frac{nm}{n+m}} \Sigma^{-1/2} (\bar{x} - \bar{y})$.
Distribution: $Z \sim N_p(0, I_p)$.

(b) The Wishart Matrix

Pooled Covariance: $(n+m-2)S_{pooled} = nS_x + mS_y = \sum (x_i-\bar{x})(x_i-\bar{x})^\top + \sum (y_i-\bar{y})(y_i-\bar{y})^\top$.
Wishart Matrix M: We define $W = (n+m-2)S_{pooled}$. We need the “Whitened” Wishart matrix $M$ corresponding to $Z$’s scale. $M = \Sigma^{-1/2} W \Sigma^{-1/2}$. $M \sim W_p(I_p, n+m-2)$.

(c) Constructing $T^2$

Definition: The Hotelling $T^2$ is defined as ratio of Normal vector to Wishart matrix (analogous to $t^2 = z^2 / (s^2/\sigma^2)$). $T^2 = (df) Z^\top M^{-1} Z = (n+m-2) Z^\top M^{-1} Z$.
Substitution: Substitute $Z$ and $M$ back: $T^2 = (n+m-2) \left[ \sqrt{\frac{nm}{n+m}} (\bar{x}-\bar{y})^\top \Sigma^{-1/2} \right] \left[ \Sigma^{-1/2} W \Sigma^{-1/2} \right]^{-1} \left[ \sqrt{\frac{nm}{n+m}} \Sigma^{-1/2} (\bar{x}-\bar{y}) \right]$.
Simplify: The $\sqrt{\dots}$ terms square to $\frac{nm}{n+m}$. The inverse term: $[\Sigma^{-1/2} W \Sigma^{-1/2}]^{-1} = \Sigma^{1/2} W^{-1} \Sigma^{1/2}$. The $\Sigma$ terms cancel: $\Sigma^{-1/2} \Sigma^{1/2} = I$. We are left with: $T^2 = (n+m-2) \frac{nm}{n+m} (\bar{x}-\bar{y})^\top W^{-1} (\bar{x}-\bar{y})$.
Final Form: Since $W^{-1} = ((n+m-2)S_{pooled})^{-1} = \frac{1}{n+m-2} S_{pooled}^{-1}$, the $(n+m-2)$ cancels out. $T^2 = \frac{nm}{n+m} (\bar{x} - \bar{y})^\top S_{pooled}^{-1} (\bar{x} - \bar{y})$.

Factor Analysis (FA): New Predictions

Question 1: Factor Scoring - Bartlett vs. Regression Method

[Source: Note 8, Source 941-949] This question tests your ability to derive the estimators for the latent factors $Z$, a crucial step after fitting the model.

(a) The Weighted Regression (Bartlett’s) Approach: Assume the model $X \approx LZ + \epsilon$, where $\epsilon \sim \mathcal{N}(0, \Psi)$. We want to estimate $Z$ given an observation $x$. Set up the weighted least squares objective function: $$J(z) = (x - Lz)^\top \Psi^{-1} (x - Lz)$$ Derive the estimator $\hat{z}_{Bartlett}$ by minimizing this function with respect to $z$.
(b) The PCA Approach (Unweighted): If we fit FA via PCA (assuming $\Psi = \sigma^2 I$ or ignoring specific variances), the objective simplifies to unweighted least squares. Write down the objective and the resulting estimator $\hat{z}_{PCA}$.
(c) Comparison: Explain why the Bartlett method is generally preferred for Factor Analysis compared to the PCA method when $\Psi$ is not a scaled identity matrix.

FA Question 1 Solution: Factor Scoring

(a) Bartlett’s Method (Weighted Regression)

Objective: Minimize the squared error weighted by the inverse of the specific variances (noise). $$J(z) = (x - Lz)^\top \Psi^{-1} (x - Lz)$$
Gradient: Differentiate w.r.t $z$: $$\nabla_z J(z) = -2 L^\top \Psi^{-1} (x - Lz)$$
Solve: Set gradient to 0: $$L^\top \Psi^{-1} L z = L^\top \Psi^{-1} x$$ $$\hat{z}_{Bartlett} = (L^\top \Psi^{-1} L)^{-1} L^\top \Psi^{-1} x$$

(b) PCA Method (Unweighted)

Objective: Standard least squares (assuming $\Psi = I$ or similar). $$J_{PCA}(z) = ||x - Lz||^2 = (x - Lz)^\top (x - Lz)$$
Estimator: From standard OLS results: $$\hat{z}_{PCA} = (L^\top L)^{-1} L^\top x$$

(c) Comparison

Bartlett’s method accounts for heteroscedasticity (异方差性). Since $\Psi = \text{diag}(\psi_1, ..., \psi_p)$, variables with high specific variance (high noise, large $\psi_i$) are given less weight ($\psi_i^{-1}$ is small) in determining the factor score.
The PCA method treats all variables as equally reliable, which is suboptimal if the uniquenesses ($\psi_i$) vary significantly across variables.

Question 2: Non-Existence of FA Solution (Counter-Example)

[Source: Note 8, Source 120-128] The notes explicitly provide an example where a covariance matrix cannot be generated by a 1-factor model. This tests “Model Checking”.

(a) System of Equations: Consider a 3-dimensional random vector $X$ with covariance matrix: $$\Sigma = \begin{pmatrix} 1 & 0.9 & 0.7 \\ 0.9 & 1 & 0.4 \\ 0.7 & 0.4 & 1 \end{pmatrix}$$ We wish to fit a 1-factor model ($r=1$) such that $\Sigma = LL^\top + \Psi$. Write down the three equations relating the off-diagonal entries ($\sigma_{12}, \sigma_{23}, \sigma_{13}$) to the loading vector $L = (l_{11}, l_{21}, l_{31})^\top$.
(b) Inconsistency Proof: Solve this system for the absolute values of the loadings. Show that the value for $l_{11}$ implies a contradiction or an impossible statistical property (specifically, check if the resulting specific variance $\psi_1 = \sigma_{11} - l_{11}^2$ is valid, or if the correlation structure itself is violated). (Self-Correction based on Note: The note asks to “Find value of $l_{11}$” and “Show that it’s not possible for $\psi_1 = Var(\epsilon_1)$” .)

FA Question 2 Solution: Non-Existence of Solution

(a) System of Equations The 1-factor model implies $\Sigma \approx LL^\top$ for off-diagonal elements (since $\Psi$ is diagonal). Let $L = [l_1, l_2, l_3]^\top$. The equations are:

$l_1 l_2 = 0.9$
$l_1 l_3 = 0.7$
$l_2 l_3 = 0.4$

(b) Inconsistency Proof

Multiply the first two equations: $(l_1 l_2)(l_1 l_3) = 0.9 \times 0.7 = 0.63$. $$l_1^2 (l_2 l_3) = 0.63$$
Substitute equation 3 ($l_2 l_3 = 0.4$): $$l_1^2 (0.4) = 0.63 \implies l_1^2 = \frac{0.63}{0.4} = 1.575$$
Check Variance Constraint: The model states $\Sigma_{11} = l_1^2 + \psi_1 = 1$. Since specific variance $\psi_1$ must be non-negative ($\psi_1 \ge 0$), we require $l_1^2 \le 1$.
Contradiction: We found $l_1^2 = 1.575 > 1$. This implies $\psi_1 = 1 - 1.575 = -0.575$, which is impossible for a variance.
Conclusion: No valid 1-factor model exists for this covariance matrix.

Independent Component Analysis (ICA): New Predictions

Question 3: Kurtosis of Sums (Derivation)

[Source: Note 9, Source 99-103, 149-151] This is a core property used to justify maximizing kurtosis. The notes list this derivation as “Exercise 10 Q3”.

(a) The Formula: Let $y_1$ and $y_2$ be independent random variables with zero mean. Let their variances be $\sigma_1^2, \sigma_2^2$ and their excess kurtoses be $\mathcal{K}(y_1), \mathcal{K}(y_2)$. Derive the formula for the kurtosis of their sum: $$\mathcal{K}(y_1 + y_2) = \frac{\sigma_1^4 \mathcal{K}(y_1) + \sigma_2^4 \mathcal{K}(y_2)}{(\sigma_1^2 + \sigma_2^2)^2}$$ (Hint: Expand $E[(y_1+y_2)^4]$ using independence and the binomial theorem).
(b) CLT Implication: If $y_1$ and $y_2$ are identically distributed with variance 1 and kurtosis $\kappa$, what is $\mathcal{K}(y_1 + y_2)$? Does it move closer to 0 (Gaussian)?

ICA Question 3 Solution: Kurtosis of Sums

(a) Derivation

Definition: Excess Kurtosis $\mathcal{K}(y) = E[y^4]/(\sigma^2)^2 - 3$.
Setup: Let $S = y_1 + y_2$. Since $y_i$ are independent with mean 0: $Var(S) = \sigma_1^2 + \sigma_2^2$.
Fourth Moment: $E[(y_1+y_2)^4] = E[y_1^4 + 4y_1^3y_2 + 6y_1^2y_2^2 + 4y_1y_2^3 + y_2^4]$ Using independence ($E[y_1 y_2] = E[y_1]E[y_2]$) and zero mean ($E[y_i]=0, E[y_i^3]=0$ usually assumed or terms vanish due to independence with mean 0 component): Cross terms with power 1 (e.g., $E[y_1^3]E[y_2]$) vanish because $E[y_2]=0$. The only surviving cross term is $6E[y_1^2]E[y_2^2] = 6\sigma_1^2 \sigma_2^2$. So, $E[S^4] = E[y_1^4] + 6\sigma_1^2\sigma_2^2 + E[y_2^4]$.
Substitute Kurtosis: Since $E[y_i^4] = (\mathcal{K}(y_i)+3)\sigma_i^4$: $E[S^4] = (\mathcal{K}(y_1)+3)\sigma_1^4 + 6\sigma_1^2\sigma_2^2 + (\mathcal{K}(y_2)+3)\sigma_2^4$ $= \mathcal{K}(y_1)\sigma_1^4 + \mathcal{K}(y_2)\sigma_2^4 + 3(\sigma_1^4 + 2\sigma_1^2\sigma_2^2 + \sigma_2^4)$ $= \mathcal{K}(y_1)\sigma_1^4 + \mathcal{K}(y_2)\sigma_2^4 + 3(\sigma_1^2 + \sigma_2^2)^2$.
Final Kurtosis: $\mathcal{K}(S) = \frac{E[S^4]}{(\sigma_1^2+\sigma_2^2)^2} - 3$ $\mathcal{K}(S) = \frac{\sigma_1^4 \mathcal{K}(y_1) + \sigma_2^4 \mathcal{K}(y_2) + 3(\text{Var})^2}{(\text{Var})^2} - 3$ $\mathcal{K}(S) = \frac{\sigma_1^4 \mathcal{K}(y_1) + \sigma_2^4 \mathcal{K}(y_2)}{(\sigma_1^2+\sigma_2^2)^2} + 3 - 3$ $\mathcal{K}(S) = \frac{\sigma_1^4 \mathcal{K}(y_1) + \sigma_2^4 \mathcal{K}(y_2)}{(\sigma_1^2+\sigma_2^2)^2}$.

(b) CLT Implication If $\sigma_1 = \sigma_2 = 1$ and $\mathcal{K}(y_1) = \mathcal{K}(y_2) = \kappa$:

$$\mathcal{K}(Sum) = \frac{1 \cdot \kappa + 1 \cdot \kappa}{(1+1)^2} = \frac{2\kappa}{4} = \frac{\kappa}{2}$$

The kurtosis is halved. As we add more variables, the kurtosis approaches 0. This confirms that sums of independent variables become more Gaussian (Central Limit Theorem).

Question 4: The Permutation Ambiguity (Matrix Algebra)

[Source: Note 9, Source 135-142] ICA cannot recover the order of sources. This question formalizes that using Permutation Matrices.

(a) Permutation Matrix: Define a permutation matrix $P$ that swaps the $i$-th and $j$-th elements of a vector. Write down $P$ explicitly for a 2D case where it swaps the 1st and 2nd elements.
(b) Invariance: Let the ICA model be $X = LZ$. Suppose we permute the sources to define $\tilde{Z} = PZ$. Find the new mixing matrix $\tilde{L}$ such that $X = \tilde{L}\tilde{Z}$ holds.
(c) Orthogonality: Show that if the original sources $Z$ had $Cov(Z)=I$, the permuted sources $\tilde{Z}$ also satisfy $Cov(\tilde{Z})=I$. (Hint: Use the property $P P^\top = I$).

ICA Question 4 Solution: Permutation Ambiguity

(a) Permutation Matrix A matrix $P$ that swaps the 1st and 2nd elements in $\mathbb{R}^2$ is:

$$P = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}$$

(b) New Mixing Matrix

Model: $X = LZ$.
New sources: $\tilde{Z} = PZ \implies Z = P^{-1}\tilde{Z} = P^\top \tilde{Z}$ (Since P is orthogonal).
Substitute back: $X = L (P^\top \tilde{Z}) = (L P^\top) \tilde{Z}$.
Therefore, the new mixing matrix is $\tilde{L} = L P^\top$. This effectively permutes the columns of $L$. (If $L = [c_1, c_2]$, then $\tilde{L} = [c_2, c_1]$).

(c) Orthogonality Check

Given $Cov(Z) = I$.
$Cov(\tilde{Z}) = Cov(PZ) = P Cov(Z) P^\top = P I P^\top = P P^\top$.
Since permutation matrices are orthogonal ($P^{-1} = P^\top$): $P P^\top = I$.
Thus, the permuted sources remain uncorrelated with unit variance. This is why ICA cannot determine the order of the components. \

关于 负熵 (Negentropy) 的考法通常集中在 ICA (独立成分分析) 的部分。它是用来替代 Kurtosis（峰度）作为衡量“非高斯性” (Non-Gaussianity) 的另一种更稳健的指标。负熵主要有以下 三种考法，按可能性从高到低排列：

考法 1：定义与基本性质 (概念题)

考察点：理解为什么我们要用负熵，以及它为什么非负。

预测题目：

Definition: Define the differential entropy $H(y)$ of a random vector $y$ with density $f(y)$.
Negentropy: Define Negentropy $J(y)$ in terms of $H(y)$. Explain why $J(y) \ge 0$ always holds and under what condition $J(y) = 0$.
Motivation: Why is maximizing Negentropy equivalent to finding independent components in ICA?

参考答案：

Differential Entropy: $H(y) = - \int f(y) \log f(y) dy = E[-\log f(y)]$.
Negentropy definition: $J(y) = H(y_{gauss}) - H(y)$, where $y_{gauss}$ is a Gaussian random variable with the same covariance matrix as $y$.
Properties:
- 总是非负 ($J(y) \ge 0$): 这是一个基于信息论的定理：在方差固定的情况下，高斯分布具有最大的微分熵 (Maximum Entropy)。因此 $H(y_{gauss}) \ge H(y)$。
- 零值条件: 当且仅当 $y$ 本身服从高斯分布时，$J(y) = 0$。
ICA Connection: ICA 的目标是寻找最“非高斯”的方向（因为根据中心极限定理，混合信号趋向于高斯）。由于 $J(y)$ 度量了 $y$ 与高斯分布的“距离”，最大化负熵 $J(y)$ 等价于最大化非高斯性，从而分离出独立的源信号。

考法 2：负熵与峰度的关系 (推导/计算题)

考察点：资料中明确给出了负熵的近似公式，并指出在特定情况下它等价于峰度的平方。这是最容易出计算推导的地方。

预测题目： Approximating Negentropy. In practice, $f(y)$ is unknown, so we estimate $J(y)$ using expectations. The general approximation is:

$$J(y) \approx [E(G(y)) - E(G(\nu))]^2$$

where $\nu \sim N(0,1)$ and $G$ is a non-quadratic function. Question: If we choose the function $G(y) = y^4$, show that maximizing Negentropy is equivalent to maximizing the squared Excess Kurtosis. (Assume $y$ has mean 0 and variance 1).

详细推导 (必背):

设定 $G(y) = y^4$。
计算高斯部分的期望 $E[G(\nu)]$：对于标准正态分布 $\nu \sim N(0,1)$，其四阶矩 $E[\nu^4] = 3$。
计算 $y$ 的期望 $E[G(y)]$：$E[y^4]$。
代入近似公式： $$J(y) \propto [E(y^4) - 3]^2$$
回顾 Excess Kurtosis 定义：$\mathcal{K}(y) = E[y^4] - 3$ (因为方差为1)。
结论： $$J(y) \propto (\mathcal{K}(y))^2$$ 因此，在使用 $y^4$ 作为非线性函数时，最大化负熵就在最大化峰度的平方。这解释了为什么 FastICA 有时候直接用峰度作为目标函数。

考法 3：计算比较 (简答题)

考察点：ICA 中选择 Negentropy 还是 Kurtosis 的优缺点。虽然 PDF 中这部分较少，但 Source 1013 提到了 “non-quadratic functions”，暗示了鲁棒性。

预测题目： We can define Negentropy using different $G$ functions, such as $G(y) = y^4$ or $G(y) = \log(\cosh(y))$. Question: Why might we prefer a function like $\log(\cosh(y))$ over $y^4$ in robust ICA algorithms?

参考答案 (基于统计常识补充，资料隐含在 ):

使用 $G(y) = y^4$ (Kurtosis) 对离群值 (Outliers) 非常敏感，因为通过 4 次方放大了尾部的值。
使用 $G(y) = \log(\cosh(y))$ (Negentropy approximation) 增长较慢（近似线性或二次），因此对数据中的噪声和离群值更稳健 (Robust)。
(资料提到 $g$ 是 “some non-quadratic functions”，这正是为了获得比单纯矩估计更好的性质)。

总结：你现在要做什么？

记住公式：$J(y) = H(y_{gauss}) - H(y)$。
记住定理：固定方差下，高斯分布熵最大。
会推导：当 $G(y)=y^4$ 时，$J(y) \approx (\text{Kurtosis})^2$。

Alex，针对 SVD (奇异值分解)、ED (特征分解/谱分解) 和 QR Decomposition，这些内容在 Everything437.pdf 的 Linear Algebra 部分有详细的定义和性质描述。

STA437 Final Exam Prediction - Linear Algebra Special

Topics: SVD, Spectral Decomposition, QR in Statistics

Question 1: SVD vs. Eigen-Decomposition in PCA (The “Dual” Relationship)

[Source: Simulation Pt.1 Q2, Source 161, 713-723] This question connects SVD directly to the Sample Covariance Matrix, explaining why SVD is the preferred computational method for PCA.

(a) SVD to Covariance: Let the centered data matrix be $X \in \mathbb{R}^{n \times p}$. Its Singular Value Decomposition is $X = UDV^\top$. Show that the sample covariance matrix $S = \frac{1}{n}X^\top X$ can be diagonalized as $S = V \Lambda V^\top$. Express the eigenvalues $\Lambda$ of $S$ explicitly in terms of the singular values $D$ of $X$.
(b) The “Dual” PCA (High Dimension Case): Suppose $p \gg n$ (more features than samples, e.g., Genomics). The matrix $X^\top X$ is $p \times p$ (huge), but $XX^\top$ is $n \times n$ (small). Show how to compute the right singular vectors $V$ (the Principal Components) using only the eigenvalues/vectors of the smaller matrix $XX^\top$. (Hint: Use the relationship $X^\top (u_i) = \dots$?)

Solution 1: SVD vs. Eigen-Decomposition

(a) SVD to Covariance

Definitions: $X = UDV^\top$, where $U^\top U = I_n$ (or $I_p$ for thin SVD), $V^\top V = VV^\top = I_p$, and $D = \text{diag}(d_1, \dots, d_p)$.
Covariance Formula: $S = \frac{1}{n} X^\top X$.
Substitution: $$S = \frac{1}{n} (UDV^\top)^\top (UDV^\top) = \frac{1}{n} (V D^\top U^\top) (U D V^\top)$$
Simplify: Since $U$ is column-orthogonal, $U^\top U = I$. $$S = \frac{1}{n} V D (I) D V^\top = V (\frac{1}{n}D^2) V^\top$$
Conclusion: This matches the spectral decomposition form $S = V \Lambda V^\top$. The eigenvalues of $S$ are related to singular values of $X$ by: $$\lambda_i = \frac{d_i^2}{n}$$

(b) The “Dual” PCA (Small $n$, Large $p$)

Problem: We want $V$ (eigenvectors of $X^\top X$), but $X^\top X$ is too big to compute.
Use $XX^\top$: Compute eigenvalues/vectors of the smaller $n \times n$ matrix $XX^\top$. $$XX^\top = (UDV^\top)(UDV^\top)^\top = UDV^\top V D U^\top = U D^2 U^\top$$ This gives us $U$ (left singular vectors) and $D^2$ (squared singular values).
Recover $V$: From the SVD equation $X = UDV^\top$, right multiply by $V$: $$XV = UD$$ Left multiply by $D^{-1}U^\top$: $$V = X^\top U D^{-1}$$
Result: We can compute $v_i = \frac{1}{d_i} X^\top u_i$. This allows us to find Principal Components $V$ without ever forming the huge covariance matrix.

Question 2: OLS Estimation via QR Decomposition

[Source: Note Linear Algebra, Source 311-316] Although the course focuses on covariance structures, QR decomposition is introduced as a tool for linear models. This question tests algebraic simplification.

(a) QR Setup: Let $X \in \mathbb{R}^{n \times p}$ be a full-rank design matrix. We decompose it as $X = QR$, where $Q \in \mathbb{R}^{n \times p}$ has orthonormal columns ($Q^\top Q = I_p$) and $R \in \mathbb{R}^{p \times p}$ is an upper-triangular invertible matrix.
(b) Simplifying OLS: The Ordinary Least Squares (OLS) estimator is given by $\hat{\beta} = (X^\top X)^{-1} X^\top y$. Substitute $X=QR$ into this equation and derive the simplified expression for $\hat{\beta}$ that does not involve any matrix inversions of the form $(\cdot)^{-1}$ except for $R^{-1}$ (which is easy to compute via back-substitution).
(c) Projection Matrix: Show that the “Hat Matrix” $H = X(X^\top X)^{-1}X^\top$ simplifies to $QQ^\top$ using the QR decomposition.

Solution 2: OLS Estimation via QR Decomposition

(a) QR Setup $X = QR$, $Q^\top Q = I$, $R$ is upper triangular.

(b) Simplifying OLS

OLS Formula: $\hat{\beta} = (X^\top X)^{-1} X^\top y$.
Substitute $X=QR$: $$\hat{\beta} = ((QR)^\top (QR))^{-1} (QR)^\top y$$ $$\hat{\beta} = (R^\top Q^\top Q R)^{-1} R^\top Q^\top y$$
Use Orthogonality ($Q^\top Q = I$): $$\hat{\beta} = (R^\top I R)^{-1} R^\top Q^\top y = (R^\top R)^{-1} R^\top Q^\top y$$
Expand Inverse: Note that $(AB)^{-1} = B^{-1}A^{-1}$. $$(R^\top R)^{-1} = R^{-1} (R^\top)^{-1}$$
Final Simplification: $$\hat{\beta} = R^{-1} (R^\top)^{-1} R^\top Q^\top y$$ Since $(R^\top)^{-1} R^\top = I$: $$\hat{\beta} = R^{-1} Q^\top y$$ (Benefit: solving $R\beta = Q^\top y$ is computationally strictly solving a triangular system, very stable).

(c) Projection Matrix $H$

Definition: $H = X(X^\top X)^{-1}X^\top$.
From (b), we know: $(X^\top X)^{-1}X^\top = R^{-1}Q^\top$. (Since $\hat{\beta} = (X^\top X)^{-1}X^\top y = R^{-1}Q^\top y$).
Substitute into H: $$H = X (R^{-1} Q^\top)$$ $$H = (QR) R^{-1} Q^\top$$ $$H = Q (R R^{-1}) Q^\top = Q I Q^\top = Q Q^\top$$

Question 3: Matrix Powers & “Whitening” (Sphering)

[Source: Linear Algebra, Source 359-360, 466-469] This tests the application of Spectral Decomposition (ED) in standardizing multivariate data, a prerequisite for ICA and CCA.

(a) The Inverse Square Root: Let $\Sigma$ be a $p \times p$ symmetric Positive Definite (PD) matrix with spectral decomposition $\Sigma = U \Lambda U^\top$. Define the matrix $\Sigma^{-1/2}$. Show that $\Sigma^{-1/2}$ is symmetric.
(b) Whitening Transformation: Let $X$ be a random vector with mean 0 and covariance $\Sigma$. Define the transformed vector $Z = \Sigma^{-1/2}X$. Prove that the covariance matrix of $Z$ is the Identity matrix $I_p$. (This process is called “Whitening” or “Sphering”).
(c) Mahalanobis Distance: Show that the squared Euclidean norm of the whitened vector, $||Z||^2$, is exactly equal to the squared Mahalanobis distance of the original vector $X$ from the origin: $X^\top \Sigma^{-1} X$.

Solution 3: Matrix Powers & “Whitening”

(a) The Inverse Square Root

Spectral Decomposition: $\Sigma = U \Lambda U^\top$, where $\Lambda = \text{diag}(\lambda_1, \dots, \lambda_p)$ with $\lambda_i > 0$.
Definition: $\Sigma^{-1/2} = U \Lambda^{-1/2} U^\top$, where $\Lambda^{-1/2} = \text{diag}(1/\sqrt{\lambda_1}, \dots, 1/\sqrt{\lambda_p})$.
Symmetry Check: $$(\Sigma^{-1/2})^\top = (U \Lambda^{-1/2} U^\top)^\top = (U^\top)^\top (\Lambda^{-1/2})^\top U^\top = U \Lambda^{-1/2} U^\top$$ (Since diagonal matrices are symmetric). Thus, $(\Sigma^{-1/2})^\top = \Sigma^{-1/2}$, so it is symmetric.

(b) Whitening Transformation

Setup: $Z = \Sigma^{-1/2}X$. We need $Cov(Z)$.
Covariance Calculation: $$Cov(Z) = Cov(\Sigma^{-1/2}X) = \Sigma^{-1/2} Cov(X) (\Sigma^{-1/2})^\top$$
Substitute $Cov(X) = \Sigma$: $$Cov(Z) = \Sigma^{-1/2} \Sigma \Sigma^{-1/2}$$ (Using symmetry from part a).
Use Eigen-decomposition: $$Cov(Z) = (U \Lambda^{-1/2} U^\top) (U \Lambda U^\top) (U \Lambda^{-1/2} U^\top)$$ Using $U^\top U = I$: $$Cov(Z) = U (\Lambda^{-1/2} \Lambda \Lambda^{-1/2}) U^\top$$ The diagonal term: $\frac{1}{\sqrt{\lambda}} \cdot \lambda \cdot \frac{1}{\sqrt{\lambda}} = 1$. So inner term is $I$. $$Cov(Z) = U I U^\top = U U^\top = I$$

(c) Mahalanobis Distance

Squared Norm: $||Z||^2 = Z^\top Z$.
Substitute Z: $$||Z||^2 = (\Sigma^{-1/2}X)^\top (\Sigma^{-1/2}X) = X^\top (\Sigma^{-1/2})^\top \Sigma^{-1/2} X$$
Simplify Matrix Product: $$(\Sigma^{-1/2})^\top \Sigma^{-1/2} = \Sigma^{-1/2} \Sigma^{-1/2} = \Sigma^{-1}$$ (Since $A^{1/2}A^{1/2} = A$, so $A^{-1/2}A^{-1/2} = A^{-1}$).
Result: $$||Z||^2 = X^\top \Sigma^{-1} X$$ This is the definition of the squared Mahalanobis distance.

Alex，求矩阵的 特征值 (Eigenvalues) 和 特征向量 (Eigenvectors) 是这门课（特别是 PCA, FA, 和 MVN Testing）最基础的计算技能。在考场上手算通常只涉及 $2 \times 2$ 或简单的 $3 \times 3$ 矩阵。

这是一个绝对不会出错的标准化三步流程，请把它写在你的 Cheat Sheet 上：

核心定义

对于矩阵 $A$，如果在变换后向量的方向不变，只改变长度，则满足：

$$A v = \lambda v$$

变换为解方程的形式：

$$(A - \lambda I)v = 0$$

第一步：求特征值 $\lambda$ (The Characteristic Equation)

目标：找到让矩阵 $(A - \lambda I)$ 变得“不可逆”（行列式为 0）的 $\lambda$ 值。

公式：

$$\det(A - \lambda I) = 0$$

实战举例：假设协方差矩阵 $S = \begin{pmatrix} 4 & 2 \\ 2 & 7 \end{pmatrix}$。

写出 $S - \lambda I$： $$\begin{pmatrix} 4 - \lambda & 2 \\ 2 & 7 - \lambda \end{pmatrix}$$
计算行列式 (对角线相乘减去反对角线)： $$(4 - \lambda)(7 - \lambda) - (2)(2) = 0$$
解一元二次方程： $$\lambda^2 - 11\lambda + 28 - 4 = 0$$ $$\lambda^2 - 11\lambda + 24 = 0$$ $$(\lambda - 3)(\lambda - 8) = 0$$ 结果：特征值为 $\lambda_1 = 8, \lambda_2 = 3$。（通常按从大到小排列）。

第二步：求特征向量 $v$ (The Null Space)

目标：把求出的 $\lambda$ 带回方程 $(A - \lambda I)v = 0$，解出 $v$。 关键点：这里的方程组必须有无穷多解（即行与行之间是倍数关系）。如果你算出了唯一解 $v=0$，说明第一步算错了。

继续实战： 情况 1：当 $\lambda_1 = 8$ 时

代入 $(S - 8I)v = 0$： $$\begin{pmatrix} 4-8 & 2 \\ 2 & 7-8 \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = \begin{pmatrix} 0 \\ 0 \end{pmatrix}$$ $$\begin{pmatrix} -4 & 2 \\ 2 & -1 \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = 0$$
观察方程：第一行 $-4x_1 + 2x_2 = 0$ 和第二行 $2x_1 - 1x_2 = 0$ 其实是同一个方程（第二行是第一行的 -0.5 倍）。
解得：$2x_1 = x_2$。
取一个简单的整数解（比如令 $x_1=1$）： $$v_1 = \begin{pmatrix} 1 \\ 2 \end{pmatrix}$$

情况 2：当 $\lambda_2 = 3$ 时

代入 $(S - 3I)v = 0$： $$\begin{pmatrix} 4-3 & 2 \\ 2 & 7-3 \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = 0$$ $$\begin{pmatrix} 1 & 2 \\ 2 & 4 \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = 0$$
观察方程：$x_1 + 2x_2 = 0$。
解得：$x_1 = -2x_2$。
取简单整数解（比如令 $x_2=1$）： $$v_2 = \begin{pmatrix} -2 \\ 1 \end{pmatrix}$$

第三步：标准化 (Normalization) —— STA437 必做

在统计学（PCA/FA）中，我们要求特征向量的模长为 1 ($||v||=1$)。操作：计算向量长度，然后除以长度。

对于 $v_1 = \begin{pmatrix} 1 \\ 2 \end{pmatrix}$:
- 长度 $||v_1|| = \sqrt{1^2 + 2^2} = \sqrt{5}$。
- 标准化后：$u_1 = \begin{pmatrix} 1/\sqrt{5} \\ 2/\sqrt{5} \end{pmatrix}$。
对于 $v_2 = \begin{pmatrix} -2 \\ 1 \end{pmatrix}$:
- 长度 $||v_2|| = \sqrt{(-2)^2 + 1^2} = \sqrt{5}$。
- 标准化后：$u_2 = \begin{pmatrix} -2/\sqrt{5} \\ 1/\sqrt{5} \end{pmatrix}$。

考场上的“作弊”速查技巧 (Sanity Checks)

如果你算完了不确定对不对，用这三条性质检查:

Trace Rule (迹的性质)：特征值之和 = 矩阵对角线元素之和。
- 例子：$\lambda_1 + \lambda_2 = 8 + 3 = 11$。
- 矩阵：$4 + 7 = 11$。 ✅ 对上了。
Determinant Rule (行列式性质)：特征值之积 = 矩阵的行列式。
- 例子：$\lambda_1 \cdot \lambda_2 = 8 \times 3 = 24$。
- 矩阵：$(4)(7) - (2)(2) = 28 - 4 = 24$。 ✅ 对上了。
Symmetric Matrix Rule (对称矩阵性质)：如果矩阵是对称的（如协方差矩阵），不同特征值对应的特征向量必须正交（点积为 0）。
- 检查：$v_1 \cdot v_2 = (1)(-2) + (2)(1) = -2 + 2 = 0$。 ✅ 对上了。

特殊情况提醒：投影矩阵 如果题目中出现 Projection Matrix $P$（如 Hat Matrix $H$），根据 PDF ，你不需要算行列式。

$P$ 的特征值只能是 1 或 0。
特征值 1 的个数 = 矩阵的 Rank。
特征值 0 的个数 = 维度 $n$ - Rank。

重点提示#

Question 4: Factor Analysis (FA) vs. PCA#

Question 4: Factor Analysis (FA) vs. PCA#

(a) The Model (模型定义)#

(b) Heywood Case (边界解问题)#

(c) Rotation Invariance (旋转不变性)#

Question 5: Independent Component Analysis (ICA)#

Question 5: Independent Component Analysis (ICA)#

(a) Kurtosis Scale Invariance (峰度的尺度不变性)#

(b) Maximizing Non-Gaussianity (非高斯性最大化)#

(c) Why not PCA? (为什么 PCA 无法分离)#

Question 6: Multivariate Hypothesis Testing & Conditional Distributions#

Question 6: Multivariate Hypothesis Testing#

(a) Conditional Distribution (条件分布)#

(b) Independence vs. Correlation (独立性与相关性)#

(c) Two-Sample $T^2$ Test (双样本检验)#

Question 1: Generalized Least Squares & Ridge Regression (Linear Models)#

(a) GLS Transformation (GLS 变换)#

(b) Ridge Estimator Bias (Ridge 回归的偏差)#

(c) Trace of Hat Matrix (Hat Matrix 的迹)#

Question 2: Kernel PCA & Centering#

Question 2: Kernel PCA & Centering (Feature Space)#

(a) The Centering Matrix (中心化矩阵)#

(b) Effect on Eigenvalues (对特征值的影响)#

(c) Linear Kernel Check (线性核的等价性)#

Question 3: Canonical Correlation Analysis (CCA) Setup#

Question 3: Canonical Correlation Analysis (CCA) Optimization#

(a) The Objective (优化目标)#

(b) Whitening & Solution (白化与求解)#

(c) Relation to Regression (与回归的关系)#

STA437 Final Exam - Round 3 Prediction#

Question 1: Spectral Properties of the “Residual Maker” (Linear Algebra)#

Solution 1: Spectral Properties of the “Residual Maker”#

Question 2: Matrix Gradient & Optimization (PCA Foundation)#

Solution 2: Matrix Gradient & Optimization#

Question 3: MLE of Multivariate Normal Covariance#

Solution 3: MLE of Multivariate Normal Covariance#

Question 4: The “Diamond” Distribution (ICA vs PCA)#

Solution 4: The “Diamond” Distribution (ICA vs PCA)#

Question 5: Precision Matrix & Conditional Independence#

Solution 5: Precision Matrix & Conditional Independence#

Question 6: Derivation of Two-Sample Hotelling’s $T^2$#

Solution 6: Derivation of Two-Sample Hotelling’s $T^2$#

Factor Analysis (FA): New Predictions#

Question 1: Factor Scoring - Bartlett vs. Regression Method#

FA Question 1 Solution: Factor Scoring#

Question 2: Non-Existence of FA Solution (Counter-Example)#

FA Question 2 Solution: Non-Existence of Solution#

Independent Component Analysis (ICA): New Predictions#

Question 3: Kurtosis of Sums (Derivation)#

ICA Question 3 Solution: Kurtosis of Sums#

Question 4: The Permutation Ambiguity (Matrix Algebra)#

ICA Question 4 Solution: Permutation Ambiguity#

考法 1：定义与基本性质 (概念题)#

考法 2：负熵与峰度的关系 (推导/计算题)#

考法 3：计算比较 (简答题)#

总结：你现在要做什么？#

STA437 Final Exam Prediction - Linear Algebra Special#

Question 1: SVD vs. Eigen-Decomposition in PCA (The “Dual” Relationship)#

Solution 1: SVD vs. Eigen-Decomposition#

Question 2: OLS Estimation via QR Decomposition#

Solution 2: OLS Estimation via QR Decomposition#

Question 3: Matrix Powers & “Whitening” (Sphering)#

Solution 3: Matrix Powers & “Whitening”#

核心定义#

第一步：求特征值 $\lambda$ (The Characteristic Equation)#

第二步：求特征向量 $v$ (The Null Space)#

第三步：标准化 (Normalization) —— STA437 必做#

考场上的“作弊”速查技巧 (Sanity Checks)#

重点提示

Question 4: Factor Analysis (FA) vs. PCA

Question 4: Factor Analysis (FA) vs. PCA

(a) The Model (模型定义)

(b) Heywood Case (边界解问题)

(c) Rotation Invariance (旋转不变性)

Question 5: Independent Component Analysis (ICA)

Question 5: Independent Component Analysis (ICA)

(a) Kurtosis Scale Invariance (峰度的尺度不变性)

(b) Maximizing Non-Gaussianity (非高斯性最大化)

(c) Why not PCA? (为什么 PCA 无法分离)

Question 6: Multivariate Hypothesis Testing & Conditional Distributions

Question 6: Multivariate Hypothesis Testing

(a) Conditional Distribution (条件分布)

(b) Independence vs. Correlation (独立性与相关性)

(c) Two-Sample $T^2$ Test (双样本检验)

Question 1: Generalized Least Squares & Ridge Regression (Linear Models)

(a) GLS Transformation (GLS 变换)

(b) Ridge Estimator Bias (Ridge 回归的偏差)

(c) Trace of Hat Matrix (Hat Matrix 的迹)

Question 2: Kernel PCA & Centering

Question 2: Kernel PCA & Centering (Feature Space)

(a) The Centering Matrix (中心化矩阵)

(b) Effect on Eigenvalues (对特征值的影响)

(c) Linear Kernel Check (线性核的等价性)

Question 3: Canonical Correlation Analysis (CCA) Setup

Question 3: Canonical Correlation Analysis (CCA) Optimization

(a) The Objective (优化目标)

(b) Whitening & Solution (白化与求解)

(c) Relation to Regression (与回归的关系)

STA437 Final Exam - Round 3 Prediction

Question 1: Spectral Properties of the “Residual Maker” (Linear Algebra)

Solution 1: Spectral Properties of the “Residual Maker”

Question 2: Matrix Gradient & Optimization (PCA Foundation)

Solution 2: Matrix Gradient & Optimization

Question 3: MLE of Multivariate Normal Covariance

Solution 3: MLE of Multivariate Normal Covariance

Question 4: The “Diamond” Distribution (ICA vs PCA)

Solution 4: The “Diamond” Distribution (ICA vs PCA)

Question 5: Precision Matrix & Conditional Independence

Solution 5: Precision Matrix & Conditional Independence

Question 6: Derivation of Two-Sample Hotelling’s $T^2$

Solution 6: Derivation of Two-Sample Hotelling’s $T^2$

Factor Analysis (FA): New Predictions

Question 1: Factor Scoring - Bartlett vs. Regression Method

FA Question 1 Solution: Factor Scoring

Question 2: Non-Existence of FA Solution (Counter-Example)

FA Question 2 Solution: Non-Existence of Solution

Independent Component Analysis (ICA): New Predictions

Question 3: Kurtosis of Sums (Derivation)

ICA Question 3 Solution: Kurtosis of Sums

Question 4: The Permutation Ambiguity (Matrix Algebra)

ICA Question 4 Solution: Permutation Ambiguity

考法 1：定义与基本性质 (概念题)

考法 2：负熵与峰度的关系 (推导/计算题)

考法 3：计算比较 (简答题)

总结：你现在要做什么？

STA437 Final Exam Prediction - Linear Algebra Special

Question 1: SVD vs. Eigen-Decomposition in PCA (The “Dual” Relationship)

Solution 1: SVD vs. Eigen-Decomposition

Question 2: OLS Estimation via QR Decomposition

Solution 2: OLS Estimation via QR Decomposition

Question 3: Matrix Powers & “Whitening” (Sphering)

Solution 3: Matrix Powers & “Whitening”

核心定义

第一步：求特征值 $\lambda$ (The Characteristic Equation)

第二步：求特征向量 $v$ (The Null Space)

第三步：标准化 (Normalization) —— STA437 必做

考场上的“作弊”速查技巧 (Sanity Checks)