Midterm 1 & Multivariate Statistics
1. Sample Mean and Covariance (Data Matrix Properties)
Problem Setup: Let $X \in \mathbb{R}^{10 \times 3}$ be a data matrix where rows correspond to observations and columns correspond to variables.
$$X = \begin{pmatrix} -x_1- \\ \vdots \\ -x_{10}- \end{pmatrix}$$Assume that $X$ is column-centered (each column has mean 0) and column-orthogonal ($X^T X = I_3$).
(a) Find the sample mean vector $\bar{x}$ and the sample covariance matrix $S$ of $X$.
Sample Mean $\bar{x}$: The sample mean is the vector of the averages of each column. Since $X$ is explicitly stated to be column-centered:
$$ \bar{x} = \begin{pmatrix} \bar{x}_1 \\ \bar{x}_2 \\ \bar{x}_3 \end{pmatrix} = \begin{pmatrix} 0 \\ 0 \\ 0 \end{pmatrix} = \vec{0} \in \mathbb{R}^{3 \times 1} $$Sample Covariance $S$: Using the definition $S = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})(x_i - \bar{x})^T = \frac{X^T C X}{n}$. Since $X$ is already centered, the centering operator $C$ is not needed (or $CX = X$).
$$ S = \frac{X^T X}{n} = \frac{1}{10} I_3 = \begin{pmatrix} 1/10 & 0 & 0 \\ 0 & 1/10 & 0 \\ 0 & 0 & 1/10 \end{pmatrix} $$(b) Unit Conversion
Suppose $f_1$ represents meters, $f_2$ centimeters, and $f_3$ millimeters. Convert all to centimeters by defining $Y$. What is the sample covariance matrix of $Y$?
Conversion:
- $f_1$ (m) $\to$ cm: Multiply by 100.
- $f_2$ (cm) $\to$ cm: Multiply by 1.
- $f_3$ (mm) $\to$ cm: Multiply by 0.1.
This is a matrix multiplication with a diagonal matrix $D$:
$$Y = XD = X \begin{pmatrix} 100 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0.1 \end{pmatrix}$$Sample Covariance of $Y$:
$$ \begin{aligned} S_Y &= \frac{Y^T Y}{n} = \frac{(XD)^T (XD)}{n} \\ &= \frac{D^T (X^T X) D}{10} \\ &= \frac{1}{10} D I_3 D = \frac{1}{10} D^2 \\ &= \frac{1}{10} \begin{pmatrix} 100^2 & 0 & 0 \\ 0 & 1^2 & 0 \\ 0 & 0 & 0.1^2 \end{pmatrix} = \begin{pmatrix} 1000 & 0 & 0 \\ 0 & 0.1 & 0 \\ 0 & 0 & 0.001 \end{pmatrix} \end{aligned} $$(c) Linear Transformation Mean
Let $v \in \mathbb{R}^3$ be a unit vector ($||v||=1$). Define $z = Xv$. Find the sample mean of $z$.
$$ \begin{aligned} \bar{z} &= \text{Mean}(Xv) \\ &= \bar{x}^T v \\ &= \vec{0}^T v \\ &= 0 \end{aligned} $$(d) Linear Transformation Variance
What is the sample variance of $z$?
$$ \begin{aligned} S_z &= \frac{z^T z}{n} \quad (\text{Since } \bar{z}=0) \\ &= \frac{(Xv)^T (Xv)}{10} \\ &= \frac{v^T (X^T X) v}{10} \\ &= \frac{v^T I v}{10} \\ &= \frac{||v||^2}{10} = \frac{1}{10} \end{aligned} $$2. Geometric PCA
Dataset: Consider 3 observations with 2 features:
$$X = \begin{pmatrix} 0 & 0 \\ 1 & 1 \\ -1 & -1 \end{pmatrix} \in \mathbb{R}^{3 \times 2}$$The original space is $\mathbb{R}^2$. The data points are flattened into a 2D plane.
(a) Find the first principal component direction $v_1$.
(Hint: drawing a scatterplot may help)
Observing the scatterplot (Points lie on $y=x$): The unit vector along the line $y=x$ is:
$$v_1 = \begin{pmatrix} 1/\sqrt{2} \\ 1/\sqrt{2} \end{pmatrix}$$(b) How much variance is explained by the first principal component?
Variance Explained by PC = Projection (dot product) of original data onto the new axis. Let $z_{i(1)} = x_i^T v_1$.
- Point 1: $(0,0) \cdot v_1 = 0$
- Point 2: $(1,1) \cdot v_1 = \frac{1}{\sqrt{2}} + \frac{1}{\sqrt{2}} = \sqrt{2}$
- Point 3: $(-1,-1) \cdot v_1 = -\sqrt{2}$
Mean $\bar{z}_1 = 0$.
$$ \lambda_1 = Var(z_{(1)}) = \frac{1}{3} \sum (z_i - \bar{z})^2 = \frac{1}{3} (0 + 2 + 2) = \frac{4}{3} $$(c) Find the second principal component direction $v_2$.
$v_2$ must be orthogonal to $v_1$ ($v_2 \cdot v_1 = 0$).
$$v_2 = \begin{pmatrix} -1/\sqrt{2} \\ 1/\sqrt{2} \end{pmatrix}$$Variance Explained: Projections onto $y=-x$:
- $(0,0) \to 0$
- $(1,1) \to 0$
- $(-1,-1) \to 0$
Conclusion: Variance Explained is 0.
(d) Biplot
- PC1 Axis: Points are at $-\sqrt{2}, 0, \sqrt{2}$.
- PC2 Axis: No projection.
- PVE (Proportion of Variance Explained):
- $PVE_1 = \frac{4/3}{4/3 + 0} = 1$
- $PVE_2 = 0$
3. Kernel PCA
Setup: Dataset $X \in \mathbb{R}^{n \times p}$, Kernel function $\mathcal{K}(a, b)$. Kernel Matrix $K \in \mathbb{R}^{n \times n}$ where $K_{ij} = \mathcal{K}(x_i, x_j)$.
(a) Is the kernel matrix $K$ always positive semi-definite?
Yes. Proof: $K$ is a Gram Matrix. Let $\Phi$ be the mapping to the feature space coordinate. Then $K_{ij} = \Phi(x_i)^T \Phi(x_j)$, so $K = \Phi \Phi^T$. For any vector $v$:
$$v^T K v = v^T \Phi \Phi^T v = ||\Phi^T v||^2 \ge 0$$Thus, $K$ is PSD.
(b) Equivalence of Linear Kernel
Explain why $\mathcal{K}(a,b) = \langle a, b \rangle$ corresponds to standard PCA.
- Linear Kernel PCA: Operates on $K = XX^T \in \mathbb{R}^{n \times n}$. We solve $(XX^T)v = \lambda v$.
- Standard PCA: Operates on $S \propto X^T X \in \mathbb{R}^{p \times p}$. We solve $(X^T X)w = \lambda w$.
- Connection (SVD):
Let $X = UDV^T$.
- $X^T X = V D^2 V^T$ (Eigenvectors are $V$).
- $K = X X^T = U D^2 U^T$ (Eigenvectors are $U$).
- Since $X = UDV^T$, we have $XV = UD$. The coordinates in Kernel PCA (scaled $U$) correspond to the principal component scores ($XV$) in standard PCA.
4. Canonical Correlation Analysis (CCA)
Problem: Two centered feature columns $f_1, f_2 \in \mathbb{R}^n$. Uncorrelated ($s_{f_1 f_2} = 0$) and unit variance ($s_{f_1}^2 = s_{f_2}^2 = 1$). Datasets: $X = (f_1, f_2) \in \mathbb{R}^{n \times 2}$, $Y = (f_1) \in \mathbb{R}^{n \times 1}$.
(a) Sample Covariance Matrices
- $S_X = \begin{pmatrix} Var(f_1) & Cov(f_1, f_2) \\ Cov(f_2, f_1) & Var(f_2) \end{pmatrix} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} = I_2$.
- $S_Y = Var(f_1) = 1$.
- $S_{XY} = \begin{pmatrix} Cov(f_1, f_1) \\ Cov(f_2, f_1) \end{pmatrix} = \begin{pmatrix} 1 \\ 0 \end{pmatrix}$.
(b) First Canonical Correlation
Answer: 1. CCA finds linear combinations of $X$ columns to maximally correlate with $Y$. Since $Y$ ($f_1$) coincides exactly with the first column of $X$, we can achieve a perfect correlation of 1 by selecting only that column.
(c) Canonical Directions
- For X ($u_1$): We only seek the first column and omit the second. $$u_1 = \begin{pmatrix} 1 \\ 0 \end{pmatrix} \in \mathbb{R}^2$$
- For Y ($v_1$): We keep everything. $$v_1 = 1 \in \mathbb{R}$$
Verification: Subject to $u_1^T S_X u_1 = 1$ and $v_1^T S_Y v_1 = 1$.
- $u_1^T S_X u_1 = (1, 0) I (1, 0)^T = 1$.
- $v_1^T S_Y v_1 = 1 \cdot 1 \cdot 1 = 1$.
- Maximized Correlation: $u_1^T S_{XY} v_1 = (1, 0) \begin{pmatrix} 1 \\ 0 \end{pmatrix} (1) = 1$.
Midterm 2: Linear Regression & MVN
Formulas:
- MVN PDF: $f(y) = (2\pi)^{-n/2}|\Sigma|^{-1/2} e^{-\frac{1}{2}(y-\mu)^T \Sigma^{-1} (y-\mu)}$.
- Linear Transformation: $y \sim N(\mu, \Sigma) \implies Ay \sim N(A\mu, A\Sigma A^T)$.
- Matrix Calculus: $\partial_\beta [\beta^T A \beta] = 2A\beta$ (if symmetric).
1. Linear Regression Model
$$y = X\beta + \epsilon, \quad \epsilon \sim N_n(0, \sigma^2 I_n)$$Where $X$ is column-orthogonal ($X^T X = I_p$).
(a) Log-Likelihood & MLE vs Least Squares
$y \sim N_n(X\beta, \sigma^2 I_n)$.
$$ \ell(y; \beta) = -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2} ||y - X\beta||^2 $$To maximize $\ell$, we must minimize the term with the negative sign: $||y - X\beta||^2$. Thus, MLE is equivalent to Least Squares (OLS).
(b) Distribution of MLE $\hat{\beta}$
Given $\hat{\beta} = X^T y$.
- Expectation: $E(\hat{\beta}) = X^T E(y) = X^T (X\beta) = (X^T X)\beta = I_p \beta = \beta$. (Unbiased).
- Variance: $Var(\hat{\beta}) = X^T Var(y) X = X^T (\sigma^2 I_n) X = \sigma^2 (X^T X) = \sigma^2 I_p$. Result: $\hat{\beta} \sim N_p(\beta, \sigma^2 I_p)$.
(c) Distribution of Fitted Values $\hat{y}$
$\hat{y} = X\hat{\beta} = X(X^T y) = (XX^T)y = Py$. $P = XX^T$ is a projection matrix (Symmetric and Idempotent).
- Expectation: $E(\hat{y}) = P E(y) = XX^T X \beta = X \beta$.
- Variance: $Var(\hat{y}) = P Var(y) P^T = P(\sigma^2 I) P = \sigma^2 P^2 = \sigma^2 P$. Result: $\hat{y} \sim N_n(X\beta, \sigma^2 XX^T)$.
(d) Distribution of Residuals $r$
$r = y - \hat{y} = (I_n - P)y = P_{\perp} y$. $P_{\perp} = I_n - XX^T$ projects onto the orthogonal complement of the column space of $X$.
- Expectation: $E(r) = (I-P)X\beta = X\beta - X\beta = 0$.
- Variance: $Var(r) = (I-P)(\sigma^2 I)(I-P)^T = \sigma^2 (I-P)$. Result: $r \sim N_n(0, \sigma^2 (I_n - XX^T))$.
(e/f) Independence of $\hat{y}$ and $r$
Using MVN property: Independent $\iff$ Uncorrelated.
$$ \begin{aligned} Cov(\hat{y}, r) &= Cov(Py, (I-P)y) \\ &= P Var(y) (I-P)^T \\ &= P (\sigma^2 I) (I-P) \\ &= \sigma^2 (P - P^2) \\ &= 0 \quad (\text{Since } P=P^2) \end{aligned} $$Thus, they are independent.
(g) Joint Distribution
$$ \begin{pmatrix} \hat{y} \\ r \end{pmatrix} \sim N_{2n} \left( \begin{pmatrix} X\beta \\ 0 \end{pmatrix}, \sigma^2 \begin{pmatrix} XX^T & 0 \\ 0 & I_n - XX^T \end{pmatrix} \right) $$(h) Conditional Distribution $r | \hat{y}$
Since they are independent:
$$r | \hat{y} \sim N_n(0, \sigma^2(I - XX^T))$$2. Hypothesis Testing & Kurtosis
(c) Hypothesis Test $H_0: \beta = 0$
Test statistic based on $\hat{\beta} \sim N(\beta, \sigma^2 I_p)$.
$$ \chi^2 = (\hat{\beta} - 0)^T (\sigma^2 I_p)^{-1} (\hat{\beta} - 0) = \frac{\hat{\beta}^T \hat{\beta}}{\sigma^2} = \frac{||\hat{\beta}||^2}{\sigma^2} $$Under $H_0$, this follows a Chi-square distribution with $p$ degrees of freedom.
(d) P-value < 0.05 Interpretation
If P-value < 0.05, we reject $H_0$. This implies that at least one $\beta_i \neq 0$.
(e) Kurtosis of Mixed Signals (Proof)
Problem: $y_1 \sim (\mu_1, \sigma_1^2), y_2 \sim (\mu_2, \sigma_2^2)$ are independent. Show:
$$\mathcal{K}(y_1 + y_2) = \frac{\sigma_1^4 \mathcal{K}(y_1) + \sigma_2^4 \mathcal{K}(y_2)}{(\sigma_1^2 + \sigma_2^2)^2}$$Proof:
- Centering: Kurtosis is translation invariant. Assume $\mu_1 = \mu_2 = 0$.
- Definition: Excess Kurtosis $\mathcal{K}(y) = \frac{E(y^4)}{\sigma^4} - 3 \implies E(y^4) = (\mathcal{K}(y)+3)\sigma^4$.
- Sum Variance: $\sigma_S^2 = \sigma_1^2 + \sigma_2^2$.
- Expectation of Sum^4: Expand $(y_1 + y_2)^4 = y_1^4 + 4y_1^3 y_2 + 6y_1^2 y_2^2 + 4y_1 y_2^3 + y_2^4$. Since independent and mean 0, $E(y_1^3 y_2) = E(y_1^3)E(y_2) = 0$. $$ \begin{aligned} E(S^4) &= E(y_1^4) + 6E(y_1^2)E(y_2^2) + E(y_2^4) \\ &= [\mathcal{K}(y_1)+3]\sigma_1^4 + 6\sigma_1^2 \sigma_2^2 + [\mathcal{K}(y_2)+3]\sigma_2^4 \\ &= \mathcal{K}(y_1)\sigma_1^4 + \mathcal{K}(y_2)\sigma_2^4 + 3(\sigma_1^4 + 2\sigma_1^2 \sigma_2^2 + \sigma_2^4) \\ &= \mathcal{K}(y_1)\sigma_1^4 + \mathcal{K}(y_2)\sigma_2^4 + 3(\sigma_1^2 + \sigma_2^2)^2 \end{aligned} $$
- Calculate Kurtosis: $$ \begin{aligned} \mathcal{K}(S) &= \frac{E(S^4)}{(\sigma_S^2)^2} - 3 \\ &= \frac{\mathcal{K}(y_1)\sigma_1^4 + \mathcal{K}(y_2)\sigma_2^4}{(\sigma_1^2 + \sigma_2^2)^2} + \frac{3(\sigma_1^2+\sigma_2^2)^2}{(\sigma_1^2+\sigma_2^2)^2} - 3 \\ &= \frac{\mathcal{K}(y_1)\sigma_1^4 + \mathcal{K}(y_2)\sigma_2^4}{(\sigma_1^2 + \sigma_2^2)^2} + 3 - 3 \end{aligned} $$ Q.E.D.
Note 9: Factor Analysis (FA)
Model: $X = \mu + Lz + \epsilon$.
- $z \sim N(0, I)$, $\epsilon \sim N(0, \Psi)$.
- Independent.
Q1. Logic Flow: Marginal $\to$ Joint $\to$ Conditional
- Marginal $X$: $E(X) = \mu$. $Var(X) = LL^T + \Psi$.
- Joint $(X, z)$: $$\begin{pmatrix} X \\ z \end{pmatrix} \sim N \left( \begin{pmatrix} \mu \\ 0 \end{pmatrix}, \begin{pmatrix} LL^T + \Psi & L \\ L^T & I \end{pmatrix} \right)$$ (Proof of Cov(X,z): $E(X z^T) = E((Lz+\epsilon)z^T) = L E(zz^T) = L$).
Q2. Proportion of Variance Explained (PVE)
Show that PVE by $j$-th factor in FA via PCA is $\frac{\lambda_j}{tr(\Sigma)}$.
Solution: In FA via PCA, the loading vector is defined as $l_j = \sqrt{\lambda_j} v_j$.
$$PVE_j = \frac{||l_j||^2}{tr(\Sigma)} = \frac{(\sqrt{\lambda_j} v_j)^T (\sqrt{\lambda_j} v_j)}{\sum \lambda_k} = \frac{\lambda_j (v_j^T v_j)}{tr(\Sigma)} = \frac{\lambda_j}{tr(\Sigma)}$$Q3. Rotation Invariance
Is PVE overall and PVE individual invariant after rotation $\tilde{L} = LQ$?
- Overall: Yes. $trace(\tilde{L}\tilde{L}^T) = trace(LQ Q^T L^T) = trace(LL^T) = ||L||_F^2$.
- Individual: No. $||\tilde{l}_j||^2 = (Lq_j)^T (Lq_j)$. This depends on the specific column $q_j$ of the rotation matrix.
Q4. Identifiability (Example)
Given $\Sigma = \begin{pmatrix} 1 & 0.9 & 0.7 \\ 0.9 & 1 & 0.4 \\ 0.7 & 0.4 & 1 \end{pmatrix}$. Try to fit 1 Factor ($r=1$). $L = (l_{11}, l_{21}, l_{31})^T$. $\Sigma \approx LL^T + \Psi$. Equations:
- $l_{11}l_{21} = 0.9$
- $l_{11}l_{31} = 0.7$
- $l_{21}l_{31} = 0.4$
Solving for $l_{11}$:
$$l_{11}^2 = \frac{(l_{11}l_{21})(l_{11}l_{31})}{l_{21}l_{31}} = \frac{0.9 \times 0.7}{0.4} = 1.575 \implies l_{11} \approx 1.255$$Check Diagonal $\Sigma_{11}$:
$$\Sigma_{11} = l_{11}^2 + \psi_1 \implies 1 = 1.575 + \psi_1 \implies \psi_1 = -0.575$$Since variance $\psi_1$ cannot be negative, there is no solution.
Q5. Scoring (Regression Method)
Using the conditional distribution $z|x$:
$$ E(z|x) = 0 + L^T(LL^T + \Psi)^{-1}(x-\mu) $$$$ Var(z|x) = I - L^T(LL^T + \Psi)^{-1}L $$The score $\hat{z}$ is the posterior mean $E(z|x)$.
Note 10: Independent Component Analysis (ICA)
Model: $X = Lz$. Goal: Find $W = L^{-1}$ to recover $z = WX$.
Q1. Permutation Matrix
Definition: $P$ is a square matrix with a single 1 in each row and column, 0 elsewhere.
- Explicitly: $P_{kk}=1$ for $k \neq i, j$. $P_{ij} = P_{ji} = 1$ (Swap).
- Inverse: $P^T = P$ (Symmetric) and $P^T = P^{-1}$ (Orthogonal). Thus $P^{-1} = P$.
- Proof: $\tilde{L}\tilde{z} = (LP^{-1})(Pz) = L(P^{-1}P)z = Lz$. (Label switching ambiguity).
Q2. Uncorrelated $\neq$ Independent
Counter-example: Random vector $x$ takes values $\{(0,1), (0,-1), (1,0), (-1,0)\}$ with probability $1/4$ each.
- Uncorrelated: $E(x_1) = 0, E(x_2) = 0$. $x_1 x_2$ is always 0. $Cov(x_1, x_2) = 0$.
- Not Independent:
- $P(x_1=0) = 1/2$.
- $P(x_2=1) = 1/4$.
- Product: $1/8$.
- Joint $P(x_1=0, x_2=1) = 1/4$.
- $1/4 \neq 1/8$, thus not independent.