Part A: Matrix Algebra & Spectral Analysis (The “Hat” Matrix)
Context: In Midterm 1, you analyzed the centering matrix $C$. Now, consider the general “Hat Matrix” $H$ used in regression. Let $X \in \mathbb{R}^{n \times p}$ be a data matrix with full column rank ($p < n$). Let $H = X(X^\top X)^{-1}X^\top$. Let $M = I_n - H$ (The “Residual Maker”).
Q1. [8 marks] Spectral Properties of the Residual Maker
(a) [2] Using the Singular Value Decomposition (SVD) of $X = UDV^\top$, show that $H$ can be written simply as $UU^\top$. (Hint: Substitute the SVD into the definition of H. This generalizes the $XX^\top$ property you saw in Midterm 1 Q3)
(b) [2] Determine the eigenvalues of $M$ and their multiplicities. Based on this, explain why $M$ is Positive Semi-Definite (PSD). (Connection: Similar to Midterm Q1(c) where you found eigenvalues of C, but now for the general regression case).
(c) [2] In Exercise 7, we showed that for a projection matrix $P$, the diagonal elements satisfy $0 \le P_{ii} \le 1$. Show that the trace of the Hat matrix is equal to the number of predictors: $\text{tr}(H) = p$. Consequently, what is the average value of the “leverage” (the diagonal elements $H_{ii}$)?
(d) [2] Let $y = X\beta + \epsilon$. We know the residuals are $r = My$. Prove that $\|r\|^2 = \text{tr}(y^\top M y)$. (Hint: Use the cyclic property of the trace).
Part B: Data Transformation & SVD (Generalizing Variance)
Context: In Midterm 1 Q2, you calculated sample covariance for a column-orthogonal matrix. Here, we look at the general case using SVD. Let $X \in \mathbb{R}^{n \times p}$ be a column-centered data matrix (but NOT necessarily orthogonal). Let its SVD be $X = UDV^\top$. Let $S = \frac{1}{n}X^\top X$ be the sample covariance matrix.
Q2. [8 marks] Total Variance and Principal Components
(a) [2] Show that the sample covariance matrix can be written as $S = \frac{1}{n} V D^2 V^\top$. (Connection: This generalizes the result in Midterm Q2(b) where $S$ was diagonal. Now $S$ is dense, but diagonalizable).
(b) [2] The “Total Variance” of the dataset is defined as the sum of the variances of all variables: $\text{tr}(S)$. Using the Frobenius Norm relation from Exercise 12, show that:
$$\text{Total Variance} = \frac{1}{n} \sum_{i=1}^p d_i^2$$where $d_i$ are the singular values of $X$.
(c) [2] Consider the transformed data $Z = X V$. Find the sample covariance matrix of $Z$. Is $Z$ correlated or uncorrelated? (Prediction: This tests if you understand that $V$ contains the eigenvectors that diagonalize the covariance).
(d) [2] Gradient Application. Define the objective function $f(A) = \|X - A\|_F^2$. Using the gradient result from Exercise 13 or 14, explain (or show) why the derivative is zero when $A=X$. (Note: Since it’s open book, don’t just write the answer. Briefly verify the gradient step).
Part C: Statistical Inference (Ridge Regression Context)
Context: Midterm 1 Q3 dealt with OLS. Final exams often introduce a slight variation, like “Ridge Regression” (penalized least squares), to see if you can apply the same MVN rules. Consider the model $y \sim \mathcal{N}_n(X\beta, \sigma^2 I_n)$. Instead of the standard estimator, consider the Ridge Estimator:
$$\hat{\beta}_R = (X^\top X + \lambda I_p)^{-1} X^\top y$$where $\lambda > 0$ is a fixed scalar constant.
Q3. [8 marks] Distribution of the Ridge Estimator
(a) [2] Is $\hat{\beta}_R$ an unbiased estimator of $\beta$?
Calculate $\mathbb{E}[\hat{\beta}_R]$. (Hint: Use $\mathbb{E}[Ay] = A\mathbb{E}[y]$). (Contrast: In Midterm 1, OLS was unbiased. Here, you should find a bias term dependent on $\lambda$).
(b) [2] Derive the covariance matrix of $\hat{\beta}_R$, denoted as $\text{Var}(\hat{\beta}_R)$. (Hint: Use $\text{Var}(Ay) = A \text{Var}(y) A^\top$. Remember $\text{Var}(y) = \sigma^2 I$).
(c) [2] Let the “Ridge Fitted Values” be $\hat{y}_R = X \hat{\beta}_R$. Write $\hat{y}_R$ as $A_\lambda y$ for some matrix $A_\lambda$. Is $A_\lambda$ a projection matrix? Explain why or why not. (Hint: Check if $A_\lambda^2 = A_\lambda$. In Midterm Q3, $P$ was a projection. Here, the $\lambda$ might break that property).
(d) [2] Joint Distribution. Let $u$ be a fixed new input vector. Let $\hat{y}_{new} = u^\top \hat{\beta}_R$. State the full distribution of $\hat{y}_{new}$.
$$\hat{y}_{new} \sim \mathcal{N}(?, ?)$$FINAL PT.2 Simulation
Part A: Advanced Inference in Linear Models (GLS & Constraints)
Context: In Midterm 2, you derived the MLE for the standard OLS model $y = X\beta + \epsilon$ where $\epsilon \sim \mathcal{N}(0, \sigma^2 I)$. For the Final, assume the noise terms are correlated/heteroscedastic: $\epsilon \sim \mathcal{N}(0, \sigma^2 \Psi)$, where $\Psi$ is a known positive definite matrix.
Q1. [8 marks] Generalized Least Squares (GLS)
(a) [2] Transformation: Find a matrix $L$ such that the transformed model $L y = L X \beta + L \epsilon$ satisfies the standard OLS assumptions (i.e., the new noise is white noise $\sim \mathcal{N}(0, \sigma^2 I)$). (Hint: Use the Cholesky or Eigen-decomposition of $\Psi^{-1}$).
(b) [2] The Estimator: Using the transformation from (a) and the standard OLS result from Midterm 2 ($\hat{\beta} = (X^\top X)^{-1}X^\top y$), derive the Generalized Least Squares estimator $\hat{\beta}_{GLS}$ in terms of $X, y, \text{and } \Psi$.
(c) [2] Distribution: What is the distribution of $\hat{\beta}_{GLS}$? State its mean and covariance matrix.
(d) [2] Hypothesis Testing (General Constraint): Suppose we want to test $H_0: A\beta = 0$ vs $H_1: A\beta \neq 0$, where $A$ is a full-rank $r \times p$ matrix. Construct a test statistic using $\hat{\beta}_{GLS}$ that follows a $\chi^2$ distribution under the null. (Hint: Recall Midterm 2 Q1(c) used $\hat{\beta}^\top (\text{Var}(\hat{\beta}))^{-1} \hat{\beta}$. Apply the same logic to the vector $A\hat{\beta}_{GLS}$).
Part B: PCA Duality & Reconstruction (The “SVD” View)
Context: Midterm 2 Q2 asked you to compute PCA on a small dataset $X \in \mathbb{R}^{3 \times 2}$. In high-dimensional settings (e.g., genetics), we often have $p \gg n$. Let $X$ be centered with SVD $X = UDV^\top$.
Q2. [8 marks] The Dual PCA Approach
(a) [2] Eigenvalues: Midterm 2 Q3(b) mentioned that $X^\top X$ and $XX^\top$ share non-zero eigenvalues. Let $\lambda_i$ be the eigenvalues of the sample covariance $S = \frac{1}{n}X^\top X$. Express $\lambda_i$ in terms of the singular values $d_i$ of $X$.
(b) [2] Scores vs. Loadings: In PCA, the “Scores” are $Z = XV$ and the “Loadings” are $V$. Show that the scores can be computed directly from the left singular vectors $U$ and singular values $D$, without ever computing the $p \times p$ covariance matrix. (This is crucial for computational efficiency when $p$ is large).
(c) [2] Eckart-Young Theorem (Reconstruction): Let $\hat{X}_k$ be the reconstruction of data $X$ using only the first $k$ principal components. Write $\hat{X}_k$ in terms of $U, D, V$. What is the squared Frobenius norm of the reconstruction error, $\|X - \hat{X}_k\|_F^2$, in terms of the singular values?
(d) [2] Geometry: If $X$ is centered, explain why $\|X\|_F^2$ is proportional to the total variance. Combine this with (c) to explain why maximizing “Variance Explained” is mathematically equivalent to minimizing “Reconstruction Error”.
Part C: Kernel Methods (The “Centering” Problem)
Context: Midterm 2 Q3 verified that a Kernel matrix is PSD. However, standard PCA requires centered data ($X$ has column means of 0). In Kernel PCA, we only have the matrix $K$, and we cannot explicit compute the feature vectors $\Phi(x)$ to center them.
Q3. [8 marks] Centering in Feature Space
(a) [2] Distance in Feature Space: The “Kernel Trick” allows us to compute distances without knowing $\Phi$. Show that the squared Euclidean distance between two points in feature space, $\|\Phi(x_i) - \Phi(x_j)\|^2$, can be computed solely using the kernel function values $K(x_i, x_i), K(x_j, x_j), \text{and } K(x_i, x_j)$.
(b) [2] The Centering Matrix: Let $\tilde{\Phi}$ be the centered feature matrix. We want the centered kernel matrix $\tilde{K} = \tilde{\Phi}\tilde{\Phi}^\top$. Show that $\tilde{K} = C K C$, where $C = I_n - \frac{1}{n}1_n 1_n^\top$ is the centering matrix from Midterm 1. (Hint: You don’t need to do the full algebra derivation if you can explain what multiplying by $C$ on the left and right does to the rows and columns).
(c) [2] Effect on Eigenvalues: Does centering the kernel matrix change its eigenvalues? Why is this step necessary before performing the eigendecomposition to find Kernel PCs?
(d) [2] Linear Kernel Check: If we use the linear kernel $K = XX^\top$ (where $X$ is NOT centered), does computing $\tilde{K} = CKC$ effectively center the original data $X$? (Hint: Recall $CX$ centers the columns of X).
Part D: CCA Optimization (Lagrange Multipliers)
Context: Midterm 2 Q4 solved CCA by inspection for a perfectly correlated case. For the Final, consider the general optimization formulation (Note 8 Q4).
Q4. [6 marks] Canonical Correlation Analysis Setup
(a) [2] The Objective: We want to find vectors $u, v$ to maximize $\text{Corr}(X^\top u, Y^\top v)$. Since correlation is scale-invariant, we can fix the variances to 1. Write down the optimization problem: “Maximize $u^\top S_{XY} v$ subject to …?”
(b) [2] Independence: If $X$ and $Y$ are statistically independent, what should theoretically happen to the solution of the optimization problem in (a)? What would the maximum canonical correlation be?
(c) [2] Relation to Regression: If $Y$ is just a single variable $y$ (dimension $q=1$), show that the CCA direction $u$ is proportional to the Ordinary Least Squares regression coefficient $\hat{\beta}$. (Hint: This connects CCA back to Regression, bridging Part A and Part D).
FINAL Pt.3
Part A: Factor Analysis - Model Existence & Rotation
Source Mapping: Exercise 9 Q4 (Impossible Solution) & FA Notes “Identifiability”.
Context: The FA model decomposes covariance as $\Sigma = LL^\top + \Psi$. In Exercise 9, you saw a case where a solution did not exist. This is a classic “Open Book” trap—giving you a matrix that breaks the model.
Q1. [8 marks] The One-Factor Model Constraints Consider a 3-dimensional random vector $X$ with correlation matrix $R$. We wish to fit a 1-factor model ($r=1$):
$$X_i = l_i Z + \epsilon_i, \quad i=1,2,3$$Assume standardized variables so that $\Sigma = R$. Let the observed correlations be $r_{12} = 0.6$, $r_{13} = 0.6$, and $r_{23} = 0.6$.
(a) [2] System of Equations: Write down the equations relating the loadings $l_1, l_2, l_3$ to the observed correlations $r_{ij}$, assuming the specific variances $\psi_i > 0$. (Hint: Recall $\Sigma_{ij} = (LL^\top)_{ij}$ for $i \neq j$).
(b) [3] The “Heywood Case” (Boundary Solution): Solve for the values of $l_1^2, l_2^2, l_3^2$. Using these values, calculate the specific variance $\psi_1 = 1 - l_1^2$. Is this a valid valid statistical model? Explain why or why not. (Prediction: This mimics Exercise 9 Q4 where you check if parameters make sense. If $\psi < 0$, variance is negative, which is impossible).
(c) [3] Rotation Invariance: Exercise 9 Q3 asks if PVE is rotation invariant.
Let $L \in \mathbb{R}^{p \times r}$ be the loadings. Let $Q$ be an orthogonal rotation matrix. Define the “Total Communality” as $\sum_{i=1}^p h_i$. Prove that the Total Communality is invariant to rotation (i.e., it is the same for $L$ and $L^* = LQ$). (Proof utilizes the Trace property).
Part B: ICA - The “Uncorrelated vs Independent” Trap
Source Mapping: Exercise 10 Q2 (Uncorrelated Dependent)& ICA Notes “Preprocessing”.
Context: ICA works because it looks for Independence, not just Decorrelation. Exercise 10 Q2 provides a “Diamond” distribution that is uncorrelated but dependent. This is the perfect counter-example to test if you understand why we need ICA over PCA.
Q2. [8 marks] Why PCA Fails for Blind Source Separation Consider the random vector $S = (S_1, S_2)^\top$ defined in Exercise 10 Q2, where the joint probability mass is distributed equally on the four points $(0,1), (0,-1), (1,0), (-1,0)$.
(a) [2] Covariance Check: You proved in the exercise that $\text{Cov}(S_1, S_2) = 0$. If we apply PCA (Whitening) to this data, what will the rotation matrix be? Will PCA change the orientation of this data? (Hint: Since it’s already uncorrelated with unit variance, PCA is effectively the identity).
(b) [2] Independence Check: Are $S_1$ and $S_2$ independent? Justify your answer using the definition $P(S_1, S_2) = P(S_1)P(S_2)$ for the point $(1,1)$. (Connection: This confirms PCA cannot separate dependent signals that happen to be uncorrelated).
(c) [2] Kurtosis Calculation: Using the formula $\mathcal{K}(y) = E[y^4] - 3$ (assuming standardized), calculate the kurtosis of $S_1$. Does this distribution have positive (super-Gaussian) or negative (sub-Gaussian) kurtosis? (Note: $S_1$ takes values 0, 1, -1. This calculation justifies using ICA algorithms that maximize non-Gaussianity).
(d) [2] The ICA Objective: We want to find a direction $w$ such that $w^\top S$ maximizes non-Gaussianity. Explain, using the Central Limit Theorem logic provided in the notes, why a mixture of independent signals usually has a kurtosis closer to 0 (Gaussian) than the original signals.
FINAL3-Q3: ICA, Scale-invariant, Optimization, EVT.
In Exercise 10 Q3 you derived the formula for the kurtosis of a sum of independent variables.
The conclusion was that mixing signals generally decreases the absolute kurtosis, making the distribution “more Gaussian”.
Goal: In ICA, our goal is to reverse this process. We seek a linear combination that maximizes non-Gaussianity to recover the original independent sources.
Setup:
Let $z_1, z_2$ be independent random variables with zero mean, unit variance ($Var(z_i)=1$), and identical kurtosis $\mathcal{K}(z_1) = \mathcal{K}(z_2) = \kappa > 0$. We are searching for an unmixing vector $w = (w_1, w_2)^\top$ such that the linear combination $y = w_1 z_1 + w_2 z_2$ recovers one of the sources. We impose the whitening constraint: $w_1^2 + w_2^2 = 1$.
(a) [3 marks] Proof of Scale-Invariance
Before analyzing the mixture, we must understand how scaling affects non-Gaussianity. Let $X$ be a random variable and let $w$ be a non-zero scalar constant. Using the definition of excess kurtosis:
$$\mathcal{K}(X) = \frac{E[(X - \mu_X)^4]}{(\sigma_X^2)^2} - 3$$Prove that $\mathcal{K}(wX) = \mathcal{K}(X)$.
(This implies that simply amplifying a signal does not make it more or less Gaussian).
(b) [3 marks] Derivation of the Objective Function
We want to express the kurtosis of our recovered signal $y$ in terms of the weights $w_1, w_2$. Recall the formula you derived in Exercise 10 Q3 for the sum of two independent variables $A$ and $B$:
$$\mathcal{K}(A+B) = \frac{\sigma_A^4 \mathcal{K}(A) + \sigma_B^4 \mathcal{K}(B)}{(\sigma_A^2 + \sigma_B^2)^2}$$Let $A = w_1 z_1$ and $B = w_2 z_2$.
Derive the expression for $\mathcal{K}(y)$ solely in terms of $\kappa, w_1,$ and $w_2$. (Hint: You must apply the Scale-Invariance property proven in part (a) and the Whitening constraint provided in the Setup).
(c) [4 marks] Optimization & Source Recovery
We define our ICA objective function as maximizing the absolute non-Gaussianity: $J(w) = |\mathcal{K}(y)|$. Based on your result from (b):
Show mathematically that under the constraint $w_1^2 + w_2^2 = 1$, the maximum value of $J(w)$ occurs only at the boundaries (endpoints) of the parameter space: $w = (1, 0)$ or $w = (0, 1)$. (Hint: You may treat $a = w_1^2$ as a variable on the interval $[0,1]$).
Explain the physical meaning: What does the solution $w=(1,0)$ imply about the relationship between our recovered signal $y$ and the original sources $z_1, z_2$?