Question 1

Consider the linear regression model

$$ y = X\beta + \epsilon, $$

where $X \in \mathbb{R}^{n \times p}$ is a fixed column-orthogonal design matrix (i.e., $X^\top X = I_p$), $\beta \in \mathbb{R}^p$ is a vector of unknown regression coefficients, and $\epsilon \sim N_n(0, \sigma^2 I_n)$ is a Gaussian noise vector. Assume that $\sigma^2$ is fixed and known. The response vector follows a multivariate normal distribution:

$$ y \sim N_n(X\beta, \sigma^2 I_n). $$

(a) Write down the log-likelihood function $\ell(y_1, \dots, y_n; \beta)$, and explain why finding the maximum likelihood estimate (MLE) for $\beta$ is equivalent to solving the least-squares problem, i.e., minimizing $\|y - X\beta\|^2$ with respect to $\beta$.

(b) From the previous part, we conclude that the MLE is $\hat{\beta} = X^\top y$. What is the distribution of $\hat{\beta}$?

(c) Suppose you would like to test the hypothesis $H_0: \beta = 0$ vs. $H_1: \beta \neq 0$. Compute the test statistic corresponding to this hypothesis and indicate what distribution this statistic follows under the null hypothesis.

(d) Suppose the p-value you obtained is less than 0.05. What can we conclude about the components of $\beta = (\beta_1, \dots, \beta_p)^\top$?


Solution 1

(a) The probability density function for $y$ is

$$ f(y; \beta) = (2\pi \sigma^2)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} \|y - X\beta\|^2 \right). $$

Taking the logarithm, the log-likelihood function is:

$$ \ell(y; \beta) = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log(\sigma^2) - \frac{1}{2\sigma^2} \|y - X\beta\|^2. $$

Since the first two terms are constants with respect to $\beta$ and $\frac{1}{2\sigma^2} > 0$, maximizing $\ell(y; \beta)$ is equivalent to minimizing the term $\|y - X\beta\|^2$. Thus, the MLE is equivalent to the least-squares solution.

(b) Since $y \sim N_n(X\beta, \sigma^2 I_n)$ and $\hat{\beta} = X^\top y$ is a linear transformation of $y$:

$$ E[\hat{\beta}] = X^\top E[y] = X^\top (X\beta) = (X^\top X)\beta = \beta, $$

$$ Var(\hat{\beta}) = X^\top Var(y) X = X^\top (\sigma^2 I_n) X = \sigma^2 (X^\top X) = \sigma^2 I_p. $$

Therefore, the distribution is:

$$ \hat{\beta} \sim N_p(\beta, \sigma^2 I_p). $$

(c) Under the null hypothesis $H_0: \beta = 0$, we have $\hat{\beta} \sim N_p(0, \sigma^2 I_p)$. The test statistic is:

$$ \chi^2 = (\hat{\beta} - 0)^\top (\sigma^2 I_p)^{-1} (\hat{\beta} - 0) = \frac{\|\hat{\beta}\|^2}{\sigma^2}. $$

This statistic follows a Chi-squared distribution with $p$ degrees of freedom:

$$ \frac{\|\hat{\beta}\|^2}{\sigma^2} \sim \chi^2_p. $$

(d) Rejecting the null hypothesis $H_0: \beta = 0$ implies that $\beta$ is not the zero vector. Therefore, we conclude that at least one component $\beta_i$ is non-zero.


Question 2

Consider the following dataset with 3 observations and 2 features:

$$ X = \begin{pmatrix} 0 & 0 \\ 1 & 1 \\ -1 & -1 \end{pmatrix} \in \mathbb{R}^{3 \times 2}. $$

(a) Find the first principal component direction $v_1$. (Hint: drawing a scatterplot may help.)

(b) How much variance is explained by the first principal component?

(c) Find the second principal component direction $v_2$. How much variance is explained by the second component?

(d) Draw a biplot using the first and second principal components, and indicate the proportion of variance explained by each component.


Solution 2

(a) By inspecting the data or drawing a scatterplot, we observe that the points $x_1=(0,0)$, $x_2=(1,1)$, and $x_3=(-1,-1)$ are centered at the origin and lie perfectly on the line $x_2 = x_1$. The direction of maximum variance aligns with this line. The direction vector is $(1, 1)^\top$. Normalizing this to unit length, we get the first principal component direction:

$$ v_1 = \begin{pmatrix} \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} \end{pmatrix}. $$

(b) The projected scores $z_{i(1)} = x_i^\top v_1$ on the first principal component are:

$$ \begin{aligned} z_{1(1)} &= 0 \cdot \frac{1}{\sqrt{2}} + 0 \cdot \frac{1}{\sqrt{2}} = 0 \\ z_{2(1)} &= 1 \cdot \frac{1}{\sqrt{2}} + 1 \cdot \frac{1}{\sqrt{2}} = \sqrt{2} \\ z_{3(1)} &= -1 \cdot \frac{1}{\sqrt{2}} - 1 \cdot \frac{1}{\sqrt{2}} = -\sqrt{2} \end{aligned} $$

The variance explained by the first component is the variance of these projections:

$$ \lambda_1 = \text{Var}(z_{(1)}) = \frac{1}{3} \sum_{i=1}^3 (z_{i(1)} - \bar{z}_{(1)})^2 = \frac{1}{3} (0^2 + (\sqrt{2})^2 + (-\sqrt{2})^2) = \frac{4}{3}. $$

(c) The second principal component direction $v_2$ must be orthogonal to $v_1$ and have unit norm. In $\mathbb{R}^2$, the direction orthogonal to $(1, 1)^\top$ is $(-1, 1)^\top$ (or $(1, -1)^\top$). Normalizing this:

$$ v_2 = \begin{pmatrix} -\frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} \end{pmatrix}. $$

Calculating the projections $z_{i(2)} = x_i^\top v_2$:

$$ \begin{aligned} z_{1(2)} &= 0 \\ z_{2(2)} &= 1(-\frac{1}{\sqrt{2}}) + 1(\frac{1}{\sqrt{2}}) = 0 \\ z_{3(2)} &= -1(-\frac{1}{\sqrt{2}}) - 1(\frac{1}{\sqrt{2}}) = 0 \end{aligned} $$

Since all projected values are 0, the variance explained is:

$$ \lambda_2 = \text{Var}(z_{(2)}) = 0. $$

(d) The total variance is $\lambda_1 + \lambda_2 = 4/3 + 0 = 4/3$. The proportion of variance explained (PVE) is:

$$ \begin{aligned} \text{PVE}_1 &= \frac{4/3}{4/3} = 100\% \\ \text{PVE}_2 &= \frac{0}{4/3} = 0\% \end{aligned} $$

The biplot (score plot) would display the observations in the new coordinate system $(z_{(1)}, z_{(2)})$:

  • Observation 1: $(0, 0)$
  • Observation 2: $(\sqrt{2}, 0)$
  • Observation 3: $(-\sqrt{2}, 0)$

All points lie on the PC1 axis, reflecting that 100% of the variance is captured by the first component.

Here is the content for Question 3 and Question 4, formatted in LaTeX for your website.


Question 3

[4] Suppose we apply kernel PCA to a dataset

$$ X = \begin{pmatrix} -x_1^\top- \\ \vdots \\ -x_n^\top- \end{pmatrix} \in \mathbb{R}^{n \times p} $$

using a kernel function $K(a, b)$. We compute the kernel matrix $K \in \mathbb{R}^{n \times n}$ with entries $K_{ij} = K(x_i, x_j)$.

(a) [2] Is the kernel matrix $K$ always positive semidefinite? Explain why or why not.

(b) [2] Explain why using the linear kernel $K(a, b) = \langle a, b \rangle$ in kernel PCA is equivalent to performing standard PCA on the original dataset.


Solution 3

(a) Yes, the kernel matrix $K$ is always positive semidefinite. By definition, a valid kernel corresponds to an inner product in some feature space with a mapping $\Phi$. We can write the kernel matrix as $K = \Phi \Phi^\top$, where $\Phi$ contains the feature vectors. For any non-zero vector $v \in \mathbb{R}^n$:

$$ v^\top K v = v^\top \Phi \Phi^\top v = (\Phi^\top v)^\top (\Phi^\top v) = \|\Phi^\top v\|^2 \geq 0. $$

Since the quadratic form is non-negative for any $v$, $K$ is positive semidefinite.

(b) The linear kernel is defined as $K(a, b) = \langle a, b \rangle$, which implies the identity map $\Phi(x) = x$. In this case, the kernel matrix becomes the Gram matrix $K = XX^\top$. While standard PCA diagonalizes the covariance matrix $S \propto X^\top X$, Kernel PCA diagonalizes $K = XX^\top$. From SVD duality, $X^\top X$ and $XX^\top$ share the same non-zero eigenvalues (variances). Therefore, performing Kernel PCA with a linear kernel yields the same principal components and projections as standard PCA on the original data.


Question 4

[6] Consider two centered feature columns $f_1 \in \mathbb{R}^n$ and $f_2 \in \mathbb{R}^n$ that are uncorrelated and have unit sample variance:

$$ s_{f_1}^2 = s_{f_2}^2 = 1, \quad s_{f_1 f_2} = 0. $$

We observe two datasets:

$$ X = \begin{pmatrix} | & | \\ f_1 & f_2 \\ | & | \end{pmatrix} \in \mathbb{R}^{n \times 2}, \quad Y = \begin{pmatrix} | \\ f_1 \\ | \end{pmatrix} \in \mathbb{R}^{n \times 1}. $$

(a) [2] Write down the sample covariance matrices $S_X$, $S_Y$, and $S_{XY}$ required for canonical correlation analysis.

(b) [2] What is the first canonical correlation between $X$ and $Y$? (Hint: you can solve this question without computing an SVD, just use the structure of the data.)

(c) [2] Determine the corresponding canonical directions $u_1$ for $X$ and $v_1$ for $Y$.


Solution 4

(a) Given that $f_1$ and $f_2$ have unit variance and are uncorrelated:

$$ S_X = \begin{pmatrix} Var(f_1) & Cov(f_1, f_2) \\ Cov(f_2, f_1) & Var(f_2) \end{pmatrix} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} = I_2, $$

$$ S_Y = Var(f_1) = 1. $$

The cross-covariance matrix $S_{XY}$ contains the covariance between columns of $X$ and $Y$:

$$ S_{XY} = \begin{pmatrix} Cov(f_1, f_1) \\ Cov(f_2, f_1) \end{pmatrix} = \begin{pmatrix} 1 \\ 0 \end{pmatrix}. $$

(b) CCA seeks linear combinations of $X$ and $Y$ that maximize correlation. Notice that the single column in $Y$ ($f_1$) is identical to the first column of $X$. If we choose the linear combination $U = 1 \cdot f_1 + 0 \cdot f_2$ from $X$ and $V = 1 \cdot f_1$ from $Y$, the variables $U$ and $V$ are identical. The correlation of a variable with itself is 1. Since correlation cannot exceed 1, the first canonical correlation is 1.

(c) The canonical directions $u_1 \in \mathbb{R}^2$ and $v_1 \in \mathbb{R}^1$ must maximize the correlation subject to the variance constraints $u^\top S_X u = 1$ and $v^\top S_Y v = 1$. Based on part (b), we select the first feature of $X$ and the only feature of $Y$:

$$ u_1 = \begin{pmatrix} 1 \\ 0 \end{pmatrix}, \quad v_1 = 1. $$

Checking constraints:

$$ u_1^\top S_X u_1 = \begin{pmatrix} 1 & 0 \end{pmatrix} \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} 1 \\ 0 \end{pmatrix} = 1. $$

$$ v_1^\top S_Y v_1 = 1 \cdot 1 \cdot 1 = 1. $$

Both constraints are satisfied.