Simulation 2 — Bayesian Linear Regression + Gaussian Process
Q1. Quadratic-form diagnostic — 10 pts
Let
$$ p(w)\propto \exp\left\{-\frac12 w^T A w + b^T w\right\}, $$where $A\in\mathbb{R}^{D\times D}$ is symmetric positive definite and $b\in\mathbb{R}^D$.
1. Show that $p(w)$ is a multivariate Gaussian.
2. Identify its covariance matrix and mean in terms of $A$ and $b$.
3. Apply your result to
$$ p(w\mid y,X)\propto \exp\left\{-\frac12 w^T\left(I+\frac{1}{\sigma^2}X^TX\right)w +\left(\mu+\frac{1}{\sigma^2}X^Ty\right)^T w\right\}. $$State the posterior mean and posterior covariance immediately, without re-deriving from scratch.
Q2. OLS, MLE, and “what matters” — 12 pts
Suppose
$$ y\mid X,w,\Sigma \sim \mathcal{N}(Xw,\Sigma), $$where $y\in\mathbb{R}^N$, $X\in\mathbb{R}^{N\times D}$, $N>D$, and $X^TX$ is invertible.
1. Derive the ordinary least squares estimator
$$ \hat w_{\mathrm{LS}}=\arg\min_{w} \lVert y-Xw\rVert^2. $$2. Write the log-likelihood of $w$ up to additive constants independent of $w$.
3. For what class of covariance matrices $\Sigma$ does the MLE $\hat w_{\mathrm{MLE}}$ coincide with $\hat w_{\mathrm{LS}}$?
4. State in one sentence why this question does not require you to compute any marginal density such as $p(y)$.
Q3. Bayesian linear regression posterior — 14 pts
Now assume
$$ \Sigma=\sigma^2 I,\qquad p(w)=\mathcal{N}(w\mid \mu,I). $$1. Starting from
$$ p(w\mid y,X)\propto p(w)\,p(y\mid X,w,\sigma^2), $$expand the exponent fully.
2. Collect the quadratic term in $w$ and identify the posterior precision matrix.
3. Collect the linear term in $w$ and identify the posterior mean.
4. Your final answer must be
$$ p(w\mid y,X,\sigma^2)=\mathcal{N}(w\mid \mu_N,\Sigma_N), $$with explicit formulas for $\mu_N$ and $\Sigma_N$.
Do not compute the normalization constant.
Q4. BLR weighted-average geometry — 10 pts
Continue from Q3 and assume
$$ X^TX=I. $$1. Show that
$$ \hat w_{\mathrm{LS}}=X^T y. $$2. Show that the posterior mean can be written as
$$ \mu_N=(1-\lambda)\mu+\lambda \hat w_{\mathrm{LS}} $$for some scalar $\lambda$. Find $\lambda$.
3. Explain in one sentence what happens to $\mu_N$ when $\sigma^2\to 0$.
4. Explain in one sentence what happens to $\mu_N$ when $\sigma^2\to \infty$.
Q5. Design matrix explicitization — 8 pts
Let $D=3$, $N=4$, and
$$ \psi(x)= \begin{bmatrix} \psi_1(x)\\ \psi_2(x)\\ \psi_3(x) \end{bmatrix}. $$1. Write the design matrix $\Psi$ explicitly, row by row, for training inputs $x^{(1)},x^{(2)},x^{(3)},x^{(4)}$.
2. Write $w$, $y_N$, and $\hat y$ explicitly with their dimensions.
3. Verify by explicit dimensions that
$$ \hat y=\Psi w $$is valid.
4. State the dimensions of
$$ \Psi,\quad w,\quad \Psi^T\Psi,\quad \Psi\Psi^T,\quad \psi_*:=\psi(x_*). $$Q6. GP from weight-space — 12 pts
Assume
$$ \hat y(x)=w^T\psi(x),\qquad w\sim \mathcal{N}(0,\alpha^{-1}I), $$and define
$$ \hat y= \begin{bmatrix} \hat y(x^{(1)})\\ \vdots\\ \hat y(x^{(N)}) \end{bmatrix} =\Psi w. $$1. Derive the distribution of $\hat y$.
2. Define
$$ K_N=\frac{1}{\alpha} \Psi\Psi^T. $$Show that
$$ \hat y\sim \mathcal{N}(0,K_N). $$3. Now assume noisy observations
$$ y_N=\hat y+\varepsilon,\qquad \varepsilon\sim \mathcal{N}(0,\sigma^2 I), $$with $\varepsilon$ independent of $w$. Derive the marginal distribution of $y_N$.
4. Define
$$ C_N=K_N+\sigma^2 I. $$Write the final answer using $C_N$.
Q7. GP predictive block matrix — 14 pts
Let $x_*=x^{(N+1)}$ be a new test input. Define
$$ \psi_*=\psi(x_*),\qquad \hat y_* = w^T\psi_*, \qquad y_*=\hat y_*+\varepsilon_*, \qquad \varepsilon_*\sim \mathcal{N}(0,\sigma^2), $$independent of everything else.
1. Derive
$$ \operatorname{Cov}(y_N,y_*). $$Your answer must be written as an $N\times 1$ vector.
2. Derive
$$ \operatorname{Var}(y_*). $$3. Write the joint Gaussian distribution of
$$ \begin{bmatrix} y_N\\ y_* \end{bmatrix} $$in block form, clearly identifying $\Sigma_{11},\Sigma_{12},\Sigma_{21},\Sigma_{22}$.
4. Using the Gaussian conditioning formula, derive
$$ p(y_*\mid y_N). $$Your final mean and variance must be written using
$$ C_N,\qquad k_*:=\frac{1}{\alpha} \Psi\psi_*,\qquad c_{**}:=\frac{1}{\alpha} \psi_*^T\psi_*+\sigma^2. $$Q8. BLR–GP bridge — 10 pts
Assume the same feature map $\psi$ and prior
$$ w\sim \mathcal{N}(0,\alpha^{-1}I). $$1. Show that for any two inputs $x,x'$,
$$ \operatorname{Cov}(\hat y(x),\hat y(x'))=\frac{1}{\alpha} \psi(x)^T\psi(x'). $$2. Hence identify the kernel
$$ k(x,x')=\frac{1}{\alpha} \psi(x)^T\psi(x'). $$3. Explain in one sentence why this GP is the function-space view of the same Bayesian linear model.
4. State the predictive mean from the BLR side and from the GP side, and show they are expressions for the same object.
Q9. Short diagnostic — 10 pts
For each statement, write True or False and give a one-line justification.
1. In BLR, the posterior normalization constant is needed in order to identify the posterior mean and covariance.
2. If
$$ p(w\mid y,X)\propto \exp\left\{-\frac12 w^T A w+b^T w\right\}, $$then the posterior covariance is $A$.
3. In the weight-space GP construction,
$$ K_N=\frac{1}{\alpha} \Psi\Psi^T $$is an $N\times N$ matrix.
4. In the GP predictive block matrix,
$$ \Sigma_{12} $$must have the same shape as $y_N$.
5. If $\varepsilon$ and $w$ are independent and centered, then
$$ \operatorname{Cov}(\Psi w,\varepsilon)=0. $$6. When $X^TX=I$, the BLR posterior mean is always exactly equal to $\hat w_{\mathrm{LS}}$, regardless of $\sigma^2$.
7. In the GP noisy-observation setup,
$$ \operatorname{Var}(y_N)=K_N+\sigma^2 I. $$8. In BLR, the quadratic term determines the posterior mean, while the linear term determines the posterior covariance.
9. In GP predictive conditioning, $\Sigma_{21}=\Sigma_{12}^T$.
10. In BLR, seeing
$$ -\frac12 w^T A w+b^T w $$should immediately make you think “complete the square and read off $\Sigma_N^{-1}=A$”.