Question 1
Consider the linear regression model
$$ y = X\beta + \epsilon, $$where $X \in \mathbb{R}^{n \times p}$ is a fixed column-orthogonal design matrix (i.e., $X^\top X = I_p$), $\beta \in \mathbb{R}^p$ is a vector of unknown regression coefficients, and $\epsilon \sim N_n(0, \sigma^2 I_n)$ is a Gaussian noise vector. Assume that $\sigma^2$ is fixed and known. The response vector follows a multivariate normal distribution:
$$ y \sim N_n(X\beta, \sigma^2 I_n). $$(a) Write down the log-likelihood function $\ell(y_1, \dots, y_n; \beta)$, and explain why finding the maximum likelihood estimate (MLE) for $\beta$ is equivalent to solving the least-squares problem, i.e., minimizing $\|y - X\beta\|^2$ with respect to $\beta$.
(b) From the previous part, we conclude that the MLE is $\hat{\beta} = X^\top y$. What is the distribution of $\hat{\beta}$?
(c) Suppose you would like to test the hypothesis $H_0: \beta = 0$ vs. $H_1: \beta \neq 0$. Compute the test statistic corresponding to this hypothesis and indicate what distribution this statistic follows under the null hypothesis.
(d) Suppose the p-value you obtained is less than 0.05. What can we conclude about the components of $\beta = (\beta_1, \dots, \beta_p)^\top$?
Solution 1
(a) The probability density function for $y$ is
$$ f(y; \beta) = (2\pi \sigma^2)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} \|y - X\beta\|^2 \right). $$Taking the logarithm, the log-likelihood function is:
$$ \ell(y; \beta) = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log(\sigma^2) - \frac{1}{2\sigma^2} \|y - X\beta\|^2. $$Since the first two terms are constants with respect to $\beta$ and $\frac{1}{2\sigma^2} > 0$, maximizing $\ell(y; \beta)$ is equivalent to minimizing the term $\|y - X\beta\|^2$. Thus, the MLE is equivalent to the least-squares solution.
(b) Since $y \sim N_n(X\beta, \sigma^2 I_n)$ and $\hat{\beta} = X^\top y$ is a linear transformation of $y$:
$$ E[\hat{\beta}] = X^\top E[y] = X^\top (X\beta) = (X^\top X)\beta = \beta, $$$$ Var(\hat{\beta}) = X^\top Var(y) X = X^\top (\sigma^2 I_n) X = \sigma^2 (X^\top X) = \sigma^2 I_p. $$Therefore, the distribution is:
$$ \hat{\beta} \sim N_p(\beta, \sigma^2 I_p). $$(c) Under the null hypothesis $H_0: \beta = 0$, we have $\hat{\beta} \sim N_p(0, \sigma^2 I_p)$. The test statistic is:
$$ \chi^2 = (\hat{\beta} - 0)^\top (\sigma^2 I_p)^{-1} (\hat{\beta} - 0) = \frac{\|\hat{\beta}\|^2}{\sigma^2}. $$This statistic follows a Chi-squared distribution with $p$ degrees of freedom:
$$ \frac{\|\hat{\beta}\|^2}{\sigma^2} \sim \chi^2_p. $$(d) Rejecting the null hypothesis $H_0: \beta = 0$ implies that $\beta$ is not the zero vector. Therefore, we conclude that at least one component $\beta_i$ is non-zero.
Question 2
Consider the following dataset with 3 observations and 2 features:
$$ X = \begin{pmatrix} 0 & 0 \\ 1 & 1 \\ -1 & -1 \end{pmatrix} \in \mathbb{R}^{3 \times 2}. $$(a) Find the first principal component direction $v_1$. (Hint: drawing a scatterplot may help.)
(b) How much variance is explained by the first principal component?
(c) Find the second principal component direction $v_2$. How much variance is explained by the second component?
(d) Draw a biplot using the first and second principal components, and indicate the proportion of variance explained by each component.
Solution 2
(a) By inspecting the data or drawing a scatterplot, we observe that the points $x_1=(0,0)$, $x_2=(1,1)$, and $x_3=(-1,-1)$ are centered at the origin and lie perfectly on the line $x_2 = x_1$. The direction of maximum variance aligns with this line. The direction vector is $(1, 1)^\top$. Normalizing this to unit length, we get the first principal component direction:
$$ v_1 = \begin{pmatrix} \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} \end{pmatrix}. $$(b) The projected scores $z_{i(1)} = x_i^\top v_1$ on the first principal component are:
$$ \begin{aligned} z_{1(1)} &= 0 \cdot \frac{1}{\sqrt{2}} + 0 \cdot \frac{1}{\sqrt{2}} = 0 \\ z_{2(1)} &= 1 \cdot \frac{1}{\sqrt{2}} + 1 \cdot \frac{1}{\sqrt{2}} = \sqrt{2} \\ z_{3(1)} &= -1 \cdot \frac{1}{\sqrt{2}} - 1 \cdot \frac{1}{\sqrt{2}} = -\sqrt{2} \end{aligned} $$The variance explained by the first component is the variance of these projections:
$$ \lambda_1 = \text{Var}(z_{(1)}) = \frac{1}{3} \sum_{i=1}^3 (z_{i(1)} - \bar{z}_{(1)})^2 = \frac{1}{3} (0^2 + (\sqrt{2})^2 + (-\sqrt{2})^2) = \frac{4}{3}. $$(c) The second principal component direction $v_2$ must be orthogonal to $v_1$ and have unit norm. In $\mathbb{R}^2$, the direction orthogonal to $(1, 1)^\top$ is $(-1, 1)^\top$ (or $(1, -1)^\top$). Normalizing this:
$$ v_2 = \begin{pmatrix} -\frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} \end{pmatrix}. $$Calculating the projections $z_{i(2)} = x_i^\top v_2$:
$$ \begin{aligned} z_{1(2)} &= 0 \\ z_{2(2)} &= 1(-\frac{1}{\sqrt{2}}) + 1(\frac{1}{\sqrt{2}}) = 0 \\ z_{3(2)} &= -1(-\frac{1}{\sqrt{2}}) - 1(\frac{1}{\sqrt{2}}) = 0 \end{aligned} $$Since all projected values are 0, the variance explained is:
$$ \lambda_2 = \text{Var}(z_{(2)}) = 0. $$(d) The total variance is $\lambda_1 + \lambda_2 = 4/3 + 0 = 4/3$. The proportion of variance explained (PVE) is:
$$ \begin{aligned} \text{PVE}_1 &= \frac{4/3}{4/3} = 100\% \\ \text{PVE}_2 &= \frac{0}{4/3} = 0\% \end{aligned} $$The biplot (score plot) would display the observations in the new coordinate system $(z_{(1)}, z_{(2)})$:
- Observation 1: $(0, 0)$
- Observation 2: $(\sqrt{2}, 0)$
- Observation 3: $(-\sqrt{2}, 0)$
All points lie on the PC1 axis, reflecting that 100% of the variance is captured by the first component.
Here is the content for Question 3 and Question 4, formatted in LaTeX for your website.
Question 3
[4] Suppose we apply kernel PCA to a dataset
$$ X = \begin{pmatrix} -x_1^\top- \\ \vdots \\ -x_n^\top- \end{pmatrix} \in \mathbb{R}^{n \times p} $$using a kernel function $K(a, b)$. We compute the kernel matrix $K \in \mathbb{R}^{n \times n}$ with entries $K_{ij} = K(x_i, x_j)$.
(a) [2] Is the kernel matrix $K$ always positive semidefinite? Explain why or why not.
(b) [2] Explain why using the linear kernel $K(a, b) = \langle a, b \rangle$ in kernel PCA is equivalent to performing standard PCA on the original dataset.
Solution 3
(a) Yes, the kernel matrix $K$ is always positive semidefinite. By definition, a valid kernel corresponds to an inner product in some feature space with a mapping $\Phi$. We can write the kernel matrix as $K = \Phi \Phi^\top$, where $\Phi$ contains the feature vectors. For any non-zero vector $v \in \mathbb{R}^n$:
$$ v^\top K v = v^\top \Phi \Phi^\top v = (\Phi^\top v)^\top (\Phi^\top v) = \|\Phi^\top v\|^2 \geq 0. $$Since the quadratic form is non-negative for any $v$, $K$ is positive semidefinite.
(b) The linear kernel is defined as $K(a, b) = \langle a, b \rangle$, which implies the identity map $\Phi(x) = x$. In this case, the kernel matrix becomes the Gram matrix $K = XX^\top$. While standard PCA diagonalizes the covariance matrix $S \propto X^\top X$, Kernel PCA diagonalizes $K = XX^\top$. From SVD duality, $X^\top X$ and $XX^\top$ share the same non-zero eigenvalues (variances). Therefore, performing Kernel PCA with a linear kernel yields the same principal components and projections as standard PCA on the original data.
Question 4
[6] Consider two centered feature columns $f_1 \in \mathbb{R}^n$ and $f_2 \in \mathbb{R}^n$ that are uncorrelated and have unit sample variance:
$$ s_{f_1}^2 = s_{f_2}^2 = 1, \quad s_{f_1 f_2} = 0. $$We observe two datasets:
$$ X = \begin{pmatrix} | & | \\ f_1 & f_2 \\ | & | \end{pmatrix} \in \mathbb{R}^{n \times 2}, \quad Y = \begin{pmatrix} | \\ f_1 \\ | \end{pmatrix} \in \mathbb{R}^{n \times 1}. $$(a) [2] Write down the sample covariance matrices $S_X$, $S_Y$, and $S_{XY}$ required for canonical correlation analysis.
(b) [2] What is the first canonical correlation between $X$ and $Y$? (Hint: you can solve this question without computing an SVD, just use the structure of the data.)
(c) [2] Determine the corresponding canonical directions $u_1$ for $X$ and $v_1$ for $Y$.
Solution 4
(a) Given that $f_1$ and $f_2$ have unit variance and are uncorrelated:
$$ S_X = \begin{pmatrix} Var(f_1) & Cov(f_1, f_2) \\ Cov(f_2, f_1) & Var(f_2) \end{pmatrix} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} = I_2, $$$$ S_Y = Var(f_1) = 1. $$The cross-covariance matrix $S_{XY}$ contains the covariance between columns of $X$ and $Y$:
$$ S_{XY} = \begin{pmatrix} Cov(f_1, f_1) \\ Cov(f_2, f_1) \end{pmatrix} = \begin{pmatrix} 1 \\ 0 \end{pmatrix}. $$(b) CCA seeks linear combinations of $X$ and $Y$ that maximize correlation. Notice that the single column in $Y$ ($f_1$) is identical to the first column of $X$. If we choose the linear combination $U = 1 \cdot f_1 + 0 \cdot f_2$ from $X$ and $V = 1 \cdot f_1$ from $Y$, the variables $U$ and $V$ are identical. The correlation of a variable with itself is 1. Since correlation cannot exceed 1, the first canonical correlation is 1.
(c) The canonical directions $u_1 \in \mathbb{R}^2$ and $v_1 \in \mathbb{R}^1$ must maximize the correlation subject to the variance constraints $u^\top S_X u = 1$ and $v^\top S_Y v = 1$. Based on part (b), we select the first feature of $X$ and the only feature of $Y$:
$$ u_1 = \begin{pmatrix} 1 \\ 0 \end{pmatrix}, \quad v_1 = 1. $$Checking constraints:
$$ u_1^\top S_X u_1 = \begin{pmatrix} 1 & 0 \end{pmatrix} \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} 1 \\ 0 \end{pmatrix} = 1. $$$$ v_1^\top S_Y v_1 = 1 \cdot 1 \cdot 1 = 1. $$Both constraints are satisfied.