Midterm 1 & Multivariate Statistics

1. Sample Mean and Covariance (Data Matrix Properties)

Problem Setup: Let $X \in \mathbb{R}^{10 \times 3}$ be a data matrix where rows correspond to observations and columns correspond to variables.

$$X = \begin{pmatrix} -x_1- \\ \vdots \\ -x_{10}- \end{pmatrix}$$

Assume that $X$ is column-centered (each column has mean 0) and column-orthogonal ($X^T X = I_3$).

(a) Find the sample mean vector $\bar{x}$ and the sample covariance matrix $S$ of $X$.

Sample Mean $\bar{x}$: The sample mean is the vector of the averages of each column. Since $X$ is explicitly stated to be column-centered:

$$ \bar{x} = \begin{pmatrix} \bar{x}_1 \\ \bar{x}_2 \\ \bar{x}_3 \end{pmatrix} = \begin{pmatrix} 0 \\ 0 \\ 0 \end{pmatrix} = \vec{0} \in \mathbb{R}^{3 \times 1} $$

Sample Covariance $S$: Using the definition $S = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})(x_i - \bar{x})^T = \frac{X^T C X}{n}$. Since $X$ is already centered, the centering operator $C$ is not needed (or $CX = X$).

$$ S = \frac{X^T X}{n} = \frac{1}{10} I_3 = \begin{pmatrix} 1/10 & 0 & 0 \\ 0 & 1/10 & 0 \\ 0 & 0 & 1/10 \end{pmatrix} $$

(b) Unit Conversion

Suppose $f_1$ represents meters, $f_2$ centimeters, and $f_3$ millimeters. Convert all to centimeters by defining $Y$. What is the sample covariance matrix of $Y$?

Conversion:

$f_1$ (m) $\to$ cm: Multiply by 100.
$f_2$ (cm) $\to$ cm: Multiply by 1.
$f_3$ (mm) $\to$ cm: Multiply by 0.1.

This is a matrix multiplication with a diagonal matrix $D$:

$$Y = XD = X \begin{pmatrix} 100 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0.1 \end{pmatrix}$$

Sample Covariance of $Y$:

$$ \begin{aligned} S_Y &= \frac{Y^T Y}{n} = \frac{(XD)^T (XD)}{n} \\ &= \frac{D^T (X^T X) D}{10} \\ &= \frac{1}{10} D I_3 D = \frac{1}{10} D^2 \\ &= \frac{1}{10} \begin{pmatrix} 100^2 & 0 & 0 \\ 0 & 1^2 & 0 \\ 0 & 0 & 0.1^2 \end{pmatrix} = \begin{pmatrix} 1000 & 0 & 0 \\ 0 & 0.1 & 0 \\ 0 & 0 & 0.001 \end{pmatrix} \end{aligned} $$

(c) Linear Transformation Mean

Let $v \in \mathbb{R}^3$ be a unit vector ($||v||=1$). Define $z = Xv$. Find the sample mean of $z$.

$$ \begin{aligned} \bar{z} &= \text{Mean}(Xv) \\ &= \bar{x}^T v \\ &= \vec{0}^T v \\ &= 0 \end{aligned} $$

(d) Linear Transformation Variance

What is the sample variance of $z$?

$$ \begin{aligned} S_z &= \frac{z^T z}{n} \quad (\text{Since } \bar{z}=0) \\ &= \frac{(Xv)^T (Xv)}{10} \\ &= \frac{v^T (X^T X) v}{10} \\ &= \frac{v^T I v}{10} \\ &= \frac{||v||^2}{10} = \frac{1}{10} \end{aligned} $$

2. Geometric PCA

Dataset: Consider 3 observations with 2 features:

$$X = \begin{pmatrix} 0 & 0 \\ 1 & 1 \\ -1 & -1 \end{pmatrix} \in \mathbb{R}^{3 \times 2}$$

The original space is $\mathbb{R}^2$. The data points are flattened into a 2D plane.

(a) Find the first principal component direction $v_1$.

(Hint: drawing a scatterplot may help)

Observing the scatterplot (Points lie on $y=x$): The unit vector along the line $y=x$ is:

$$v_1 = \begin{pmatrix} 1/\sqrt{2} \\ 1/\sqrt{2} \end{pmatrix}$$

(b) How much variance is explained by the first principal component?

Variance Explained by PC = Projection (dot product) of original data onto the new axis. Let $z_{i(1)} = x_i^T v_1$.

Point 1: $(0,0) \cdot v_1 = 0$
Point 2: $(1,1) \cdot v_1 = \frac{1}{\sqrt{2}} + \frac{1}{\sqrt{2}} = \sqrt{2}$
Point 3: $(-1,-1) \cdot v_1 = -\sqrt{2}$

Mean $\bar{z}_1 = 0$.

$$ \lambda_1 = Var(z_{(1)}) = \frac{1}{3} \sum (z_i - \bar{z})^2 = \frac{1}{3} (0 + 2 + 2) = \frac{4}{3} $$

(c) Find the second principal component direction $v_2$.

$v_2$ must be orthogonal to $v_1$ ($v_2 \cdot v_1 = 0$).

$$v_2 = \begin{pmatrix} -1/\sqrt{2} \\ 1/\sqrt{2} \end{pmatrix}$$

Variance Explained: Projections onto $y=-x$:

$(0,0) \to 0$
$(1,1) \to 0$
$(-1,-1) \to 0$

$$ \lambda_2 = Var(z_{(2)}) = 0 $$

Conclusion: Variance Explained is 0.

(d) Biplot

PC1 Axis: Points are at $-\sqrt{2}, 0, \sqrt{2}$.
PC2 Axis: No projection.
PVE (Proportion of Variance Explained):
- $PVE_1 = \frac{4/3}{4/3 + 0} = 1$
- $PVE_2 = 0$

3. Kernel PCA

Setup: Dataset $X \in \mathbb{R}^{n \times p}$, Kernel function $\mathcal{K}(a, b)$. Kernel Matrix $K \in \mathbb{R}^{n \times n}$ where $K_{ij} = \mathcal{K}(x_i, x_j)$.

(a) Is the kernel matrix $K$ always positive semi-definite?

Yes. Proof: $K$ is a Gram Matrix. Let $\Phi$ be the mapping to the feature space coordinate. Then $K_{ij} = \Phi(x_i)^T \Phi(x_j)$, so $K = \Phi \Phi^T$. For any vector $v$:

$$v^T K v = v^T \Phi \Phi^T v = ||\Phi^T v||^2 \ge 0$$

Thus, $K$ is PSD.

(b) Equivalence of Linear Kernel

Explain why $\mathcal{K}(a,b) = \langle a, b \rangle$ corresponds to standard PCA.

Linear Kernel PCA: Operates on $K = XX^T \in \mathbb{R}^{n \times n}$. We solve $(XX^T)v = \lambda v$.
Standard PCA: Operates on $S \propto X^T X \in \mathbb{R}^{p \times p}$. We solve $(X^T X)w = \lambda w$.
Connection (SVD): Let $X = UDV^T$.
- $X^T X = V D^2 V^T$ (Eigenvectors are $V$).
- $K = X X^T = U D^2 U^T$ (Eigenvectors are $U$).
- Since $X = UDV^T$, we have $XV = UD$. The coordinates in Kernel PCA (scaled $U$) correspond to the principal component scores ($XV$) in standard PCA.

4. Canonical Correlation Analysis (CCA)

Problem: Two centered feature columns $f_1, f_2 \in \mathbb{R}^n$. Uncorrelated ($s_{f_1 f_2} = 0$) and unit variance ($s_{f_1}^2 = s_{f_2}^2 = 1$). Datasets: $X = (f_1, f_2) \in \mathbb{R}^{n \times 2}$, $Y = (f_1) \in \mathbb{R}^{n \times 1}$.

(a) Sample Covariance Matrices

$S_X = \begin{pmatrix} Var(f_1) & Cov(f_1, f_2) \\ Cov(f_2, f_1) & Var(f_2) \end{pmatrix} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} = I_2$.
$S_Y = Var(f_1) = 1$.
$S_{XY} = \begin{pmatrix} Cov(f_1, f_1) \\ Cov(f_2, f_1) \end{pmatrix} = \begin{pmatrix} 1 \\ 0 \end{pmatrix}$.

(b) First Canonical Correlation

Answer: 1. CCA finds linear combinations of $X$ columns to maximally correlate with $Y$. Since $Y$ ($f_1$) coincides exactly with the first column of $X$, we can achieve a perfect correlation of 1 by selecting only that column.

(c) Canonical Directions

For X ($u_1$): We only seek the first column and omit the second. $$u_1 = \begin{pmatrix} 1 \\ 0 \end{pmatrix} \in \mathbb{R}^2$$
For Y ($v_1$): We keep everything. $$v_1 = 1 \in \mathbb{R}$$

Verification: Subject to $u_1^T S_X u_1 = 1$ and $v_1^T S_Y v_1 = 1$.

$u_1^T S_X u_1 = (1, 0) I (1, 0)^T = 1$.
$v_1^T S_Y v_1 = 1 \cdot 1 \cdot 1 = 1$.
Maximized Correlation: $u_1^T S_{XY} v_1 = (1, 0) \begin{pmatrix} 1 \\ 0 \end{pmatrix} (1) = 1$.

Midterm 2: Linear Regression & MVN

Formulas:

MVN PDF: $f(y) = (2\pi)^{-n/2}|\Sigma|^{-1/2} e^{-\frac{1}{2}(y-\mu)^T \Sigma^{-1} (y-\mu)}$.
Linear Transformation: $y \sim N(\mu, \Sigma) \implies Ay \sim N(A\mu, A\Sigma A^T)$.
Matrix Calculus: $\partial_\beta [\beta^T A \beta] = 2A\beta$ (if symmetric).

1. Linear Regression Model

$$y = X\beta + \epsilon, \quad \epsilon \sim N_n(0, \sigma^2 I_n)$$

Where $X$ is column-orthogonal ($X^T X = I_p$).

(a) Log-Likelihood & MLE vs Least Squares

$y \sim N_n(X\beta, \sigma^2 I_n)$.

$$ \ell(y; \beta) = -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2} ||y - X\beta||^2 $$

To maximize $\ell$, we must minimize the term with the negative sign: $||y - X\beta||^2$. Thus, MLE is equivalent to Least Squares (OLS).

(b) Distribution of MLE $\hat{\beta}$

Given $\hat{\beta} = X^T y$.

Expectation: $E(\hat{\beta}) = X^T E(y) = X^T (X\beta) = (X^T X)\beta = I_p \beta = \beta$. (Unbiased).
Variance: $Var(\hat{\beta}) = X^T Var(y) X = X^T (\sigma^2 I_n) X = \sigma^2 (X^T X) = \sigma^2 I_p$. Result: $\hat{\beta} \sim N_p(\beta, \sigma^2 I_p)$.

(c) Distribution of Fitted Values $\hat{y}$

$\hat{y} = X\hat{\beta} = X(X^T y) = (XX^T)y = Py$. $P = XX^T$ is a projection matrix (Symmetric and Idempotent).

Expectation: $E(\hat{y}) = P E(y) = XX^T X \beta = X \beta$.
Variance: $Var(\hat{y}) = P Var(y) P^T = P(\sigma^2 I) P = \sigma^2 P^2 = \sigma^2 P$. Result: $\hat{y} \sim N_n(X\beta, \sigma^2 XX^T)$.

(d) Distribution of Residuals $r$

$r = y - \hat{y} = (I_n - P)y = P_{\perp} y$. $P_{\perp} = I_n - XX^T$ projects onto the orthogonal complement of the column space of $X$.

Expectation: $E(r) = (I-P)X\beta = X\beta - X\beta = 0$.
Variance: $Var(r) = (I-P)(\sigma^2 I)(I-P)^T = \sigma^2 (I-P)$. Result: $r \sim N_n(0, \sigma^2 (I_n - XX^T))$.

(e/f) Independence of $\hat{y}$ and $r$

Using MVN property: Independent $\iff$ Uncorrelated.

$$ \begin{aligned} Cov(\hat{y}, r) &= Cov(Py, (I-P)y) \\ &= P Var(y) (I-P)^T \\ &= P (\sigma^2 I) (I-P) \\ &= \sigma^2 (P - P^2) \\ &= 0 \quad (\text{Since } P=P^2) \end{aligned} $$

Thus, they are independent.

(g) Joint Distribution

$$ \begin{pmatrix} \hat{y} \\ r \end{pmatrix} \sim N_{2n} \left( \begin{pmatrix} X\beta \\ 0 \end{pmatrix}, \sigma^2 \begin{pmatrix} XX^T & 0 \\ 0 & I_n - XX^T \end{pmatrix} \right) $$

(h) Conditional Distribution $r | \hat{y}$

Since they are independent:

$$r | \hat{y} \sim N_n(0, \sigma^2(I - XX^T))$$

2. Hypothesis Testing & Kurtosis

(c) Hypothesis Test $H_0: \beta = 0$

Test statistic based on $\hat{\beta} \sim N(\beta, \sigma^2 I_p)$.

$$ \chi^2 = (\hat{\beta} - 0)^T (\sigma^2 I_p)^{-1} (\hat{\beta} - 0) = \frac{\hat{\beta}^T \hat{\beta}}{\sigma^2} = \frac{||\hat{\beta}||^2}{\sigma^2} $$

Under $H_0$, this follows a Chi-square distribution with $p$ degrees of freedom.

(d) P-value < 0.05 Interpretation

If P-value < 0.05, we reject $H_0$. This implies that at least one $\beta_i \neq 0$.

(e) Kurtosis of Mixed Signals (Proof)

Problem: $y_1 \sim (\mu_1, \sigma_1^2), y_2 \sim (\mu_2, \sigma_2^2)$ are independent. Show:

$$\mathcal{K}(y_1 + y_2) = \frac{\sigma_1^4 \mathcal{K}(y_1) + \sigma_2^4 \mathcal{K}(y_2)}{(\sigma_1^2 + \sigma_2^2)^2}$$

Proof:

Centering: Kurtosis is translation invariant. Assume $\mu_1 = \mu_2 = 0$.
Definition: Excess Kurtosis $\mathcal{K}(y) = \frac{E(y^4)}{\sigma^4} - 3 \implies E(y^4) = (\mathcal{K}(y)+3)\sigma^4$.
Sum Variance: $\sigma_S^2 = \sigma_1^2 + \sigma_2^2$.
Expectation of Sum^4: Expand $(y_1 + y_2)^4 = y_1^4 + 4y_1^3 y_2 + 6y_1^2 y_2^2 + 4y_1 y_2^3 + y_2^4$. Since independent and mean 0, $E(y_1^3 y_2) = E(y_1^3)E(y_2) = 0$. $$ \begin{aligned} E(S^4) &= E(y_1^4) + 6E(y_1^2)E(y_2^2) + E(y_2^4) \\ &= [\mathcal{K}(y_1)+3]\sigma_1^4 + 6\sigma_1^2 \sigma_2^2 + [\mathcal{K}(y_2)+3]\sigma_2^4 \\ &= \mathcal{K}(y_1)\sigma_1^4 + \mathcal{K}(y_2)\sigma_2^4 + 3(\sigma_1^4 + 2\sigma_1^2 \sigma_2^2 + \sigma_2^4) \\ &= \mathcal{K}(y_1)\sigma_1^4 + \mathcal{K}(y_2)\sigma_2^4 + 3(\sigma_1^2 + \sigma_2^2)^2 \end{aligned} $$
Calculate Kurtosis: $$ \begin{aligned} \mathcal{K}(S) &= \frac{E(S^4)}{(\sigma_S^2)^2} - 3 \\ &= \frac{\mathcal{K}(y_1)\sigma_1^4 + \mathcal{K}(y_2)\sigma_2^4}{(\sigma_1^2 + \sigma_2^2)^2} + \frac{3(\sigma_1^2+\sigma_2^2)^2}{(\sigma_1^2+\sigma_2^2)^2} - 3 \\ &= \frac{\mathcal{K}(y_1)\sigma_1^4 + \mathcal{K}(y_2)\sigma_2^4}{(\sigma_1^2 + \sigma_2^2)^2} + 3 - 3 \end{aligned} $$ Q.E.D.

Note 9: Factor Analysis (FA)

Model: $X = \mu + Lz + \epsilon$.

$z \sim N(0, I)$, $\epsilon \sim N(0, \Psi)$.
Independent.

Q1. Logic Flow: Marginal $\to$ Joint $\to$ Conditional

Marginal $X$: $E(X) = \mu$. $Var(X) = LL^T + \Psi$.
Joint $(X, z)$: $$\begin{pmatrix} X \\ z \end{pmatrix} \sim N \left( \begin{pmatrix} \mu \\ 0 \end{pmatrix}, \begin{pmatrix} LL^T + \Psi & L \\ L^T & I \end{pmatrix} \right)$$ (Proof of Cov(X,z): $E(X z^T) = E((Lz+\epsilon)z^T) = L E(zz^T) = L$).

Q2. Proportion of Variance Explained (PVE)

Show that PVE by $j$-th factor in FA via PCA is $\frac{\lambda_j}{tr(\Sigma)}$.

Solution: In FA via PCA, the loading vector is defined as $l_j = \sqrt{\lambda_j} v_j$.

$$PVE_j = \frac{||l_j||^2}{tr(\Sigma)} = \frac{(\sqrt{\lambda_j} v_j)^T (\sqrt{\lambda_j} v_j)}{\sum \lambda_k} = \frac{\lambda_j (v_j^T v_j)}{tr(\Sigma)} = \frac{\lambda_j}{tr(\Sigma)}$$

Q3. Rotation Invariance

Is PVE overall and PVE individual invariant after rotation $\tilde{L} = LQ$?

Overall: Yes. $trace(\tilde{L}\tilde{L}^T) = trace(LQ Q^T L^T) = trace(LL^T) = ||L||_F^2$.
Individual: No. $||\tilde{l}_j||^2 = (Lq_j)^T (Lq_j)$. This depends on the specific column $q_j$ of the rotation matrix.

Q4. Identifiability (Example)

Given $\Sigma = \begin{pmatrix} 1 & 0.9 & 0.7 \\ 0.9 & 1 & 0.4 \\ 0.7 & 0.4 & 1 \end{pmatrix}$. Try to fit 1 Factor ($r=1$). $L = (l_{11}, l_{21}, l_{31})^T$. $\Sigma \approx LL^T + \Psi$. Equations:

$l_{11}l_{21} = 0.9$
$l_{11}l_{31} = 0.7$
$l_{21}l_{31} = 0.4$

Solving for $l_{11}$:

$$l_{11}^2 = \frac{(l_{11}l_{21})(l_{11}l_{31})}{l_{21}l_{31}} = \frac{0.9 \times 0.7}{0.4} = 1.575 \implies l_{11} \approx 1.255$$

Check Diagonal $\Sigma_{11}$:

$$\Sigma_{11} = l_{11}^2 + \psi_1 \implies 1 = 1.575 + \psi_1 \implies \psi_1 = -0.575$$

Since variance $\psi_1$ cannot be negative, there is no solution.

Q5. Scoring (Regression Method)

Using the conditional distribution $z|x$:

$$ E(z|x) = 0 + L^T(LL^T + \Psi)^{-1}(x-\mu) $$

$$ Var(z|x) = I - L^T(LL^T + \Psi)^{-1}L $$

The score $\hat{z}$ is the posterior mean $E(z|x)$.

Note 10: Independent Component Analysis (ICA)

Model: $X = Lz$. Goal: Find $W = L^{-1}$ to recover $z = WX$.

Q1. Permutation Matrix

Definition: $P$ is a square matrix with a single 1 in each row and column, 0 elsewhere.

Explicitly: $P_{kk}=1$ for $k \neq i, j$. $P_{ij} = P_{ji} = 1$ (Swap).
Inverse: $P^T = P$ (Symmetric) and $P^T = P^{-1}$ (Orthogonal). Thus $P^{-1} = P$.
Proof: $\tilde{L}\tilde{z} = (LP^{-1})(Pz) = L(P^{-1}P)z = Lz$. (Label switching ambiguity).

Q2. Uncorrelated $\neq$ Independent

Counter-example: Random vector $x$ takes values $\{(0,1), (0,-1), (1,0), (-1,0)\}$ with probability $1/4$ each.

Uncorrelated: $E(x_1) = 0, E(x_2) = 0$. $x_1 x_2$ is always 0. $Cov(x_1, x_2) = 0$.
Not Independent:
- $P(x_1=0) = 1/2$.
- $P(x_2=1) = 1/4$.
- Product: $1/8$.
- Joint $P(x_1=0, x_2=1) = 1/4$.
- $1/4 \neq 1/8$, thus not independent.

Midterm 1 & Multivariate Statistics#

1. Sample Mean and Covariance (Data Matrix Properties)#

(a) Find the sample mean vector $\bar{x}$ and the sample covariance matrix $S$ of $X$.#

(b) Unit Conversion#

(c) Linear Transformation Mean#

(d) Linear Transformation Variance#

2. Geometric PCA#

(a) Find the first principal component direction $v_1$.#

(b) How much variance is explained by the first principal component?#

(c) Find the second principal component direction $v_2$.#

(d) Biplot#

3. Kernel PCA#

(a) Is the kernel matrix $K$ always positive semi-definite?#

(b) Equivalence of Linear Kernel#

4. Canonical Correlation Analysis (CCA)#

(a) Sample Covariance Matrices#

(b) First Canonical Correlation#

(c) Canonical Directions#

Midterm 2: Linear Regression & MVN#

1. Linear Regression Model#

(a) Log-Likelihood & MLE vs Least Squares#

(b) Distribution of MLE $\hat{\beta}$#

(c) Distribution of Fitted Values $\hat{y}$#

(d) Distribution of Residuals $r$#

(e/f) Independence of $\hat{y}$ and $r$#

(g) Joint Distribution#

(h) Conditional Distribution $r | \hat{y}$#

2. Hypothesis Testing & Kurtosis#

(c) Hypothesis Test $H_0: \beta = 0$#

(d) P-value < 0.05 Interpretation#

(e) Kurtosis of Mixed Signals (Proof)#

Note 9: Factor Analysis (FA)#

Q1. Logic Flow: Marginal $\to$ Joint $\to$ Conditional#

Q2. Proportion of Variance Explained (PVE)#

Q3. Rotation Invariance#

Q4. Identifiability (Example)#

Q5. Scoring (Regression Method)#

Note 10: Independent Component Analysis (ICA)#

Q1. Permutation Matrix#

Q2. Uncorrelated $\neq$ Independent#