Setup

Task:

Given training data, predict the output for a new input and quantify predictive uncertainty.

Data:

Three training points with input $x \in \mathbb{R}$ and output $y \in \mathbb{R}$.

$n$	$x_n$	$y_n$
1	0	1.2
2	1	2.8
3	2	4.1

Model:

$$y = w_0 + w_1 x + \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, \sigma^2)$$

Use the feature map

$$\psi(x) = (1, x)^\top$$

so that

$$y = w^\top \psi(x) + \varepsilon, \qquad w = (w_0, w_1)^\top \in \mathbb{R}^2$$

Notation:

$w \in \mathbb{R}^2$: weight vector
$\Psi \in \mathbb{R}^{N \times D}$: design matrix with $N = 3$ and $D = 2$
$y \in \mathbb{R}^N$: observed response vector
$S \in \mathbb{R}^{D \times D}$: prior covariance matrix
$\sigma^2 \in \mathbb{R}$: known noise variance
$\mu_{\text{post}}, \Sigma_{\text{post}}$: posterior mean and covariance

Known / fixed:

Noise variance: $\sigma^2 = 1$
Prior: $w \sim \mathcal{N}(0, S)$ with

$$S = 2I = \begin{pmatrix} 2 & 0 \\ 0 & 2 \end{pmatrix}$$

Phase 0: Build the Matrices

Design matrix $\Psi$:

Each row is $\psi(x_n)^\top = (1, x_n)$.

$$\Psi = \begin{pmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{pmatrix} \in \mathbb{R}^{3 \times 2}$$

Observed vector:

$$y = \begin{pmatrix} 1.2 \\ 2.8 \\ 4.1 \end{pmatrix}$$

Precompute $\Psi^\top \Psi$:

$$\Psi^\top \Psi = \begin{pmatrix} 1 & 1 & 1 \\ 0 & 1 & 2 \end{pmatrix}\begin{pmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{pmatrix} = \begin{pmatrix} 3 & 3 \\ 3 & 5 \end{pmatrix}$$

Precompute $\Psi^\top y$:

$$\Psi^\top y = \begin{pmatrix} 1 & 1 & 1 \\ 0 & 1 & 2 \end{pmatrix}\begin{pmatrix} 1.2 \\ 2.8 \\ 4.1 \end{pmatrix} = \begin{pmatrix} 8.1 \\ 11.0 \end{pmatrix}$$

Precompute $S^{-1}$:

$$S^{-1} = (2I)^{-1} = \frac{1}{2}I = \begin{pmatrix} 0.5 & 0 \\ 0 & 0.5 \end{pmatrix}$$

Phase 1: Posterior Derivation

Start from

$$\log p(w \mid y, \Psi) = \log p(w) + \log p(y \mid w, \Psi) + \text{const}$$

Prior term:

$$\log p(w) = -\frac{1}{2} w^\top S^{-1} w + \text{const}$$

Likelihood term:

$$\log p(y \mid w, \Psi) = -\frac{1}{2\sigma^2}\lVert y - \Psi w \rVert^2 + \text{const}$$

Plug in $\sigma^2 = 1$:

$$\log p(w \mid y) = -\frac{1}{2} w^\top S^{-1} w - \frac{1}{2}(y - \Psi w)^\top (y - \Psi w) + \text{const}$$

Expand the quadratic form:

$$\lVert y - \Psi w \rVert^2 = y^\top y - 2 y^\top \Psi w + w^\top \Psi^\top \Psi w$$

Collect all $w$-dependent terms:

$$\log p(w \mid y) = -\frac{1}{2} w^\top (S^{-1} + \sigma^{-2}\Psi^\top \Psi) w + \sigma^{-2} y^\top \Psi w + \text{const}$$

This is a quadratic form in $w$, so the posterior is Gaussian:

$$w \mid y \sim \mathcal{N}(\mu_{\text{post}}, \Sigma_{\text{post}})$$

Posterior covariance:

$$\Sigma_{\text{post}} = (\sigma^{-2}\Psi^\top \Psi + S^{-1})^{-1}$$

Posterior mean:

$$\mu_{\text{post}} = \sigma^{-2}\Sigma_{\text{post}}\Psi^\top y$$

Phase 2: Compute the Posterior Numerically

Step 1. Posterior precision matrix

$$\Sigma_{\text{post}}^{-1} = \sigma^{-2}\Psi^\top \Psi + S^{-1} = \begin{pmatrix} 3 & 3 \\ 3 & 5 \end{pmatrix} + \begin{pmatrix} 0.5 & 0 \\ 0 & 0.5 \end{pmatrix} = \begin{pmatrix} 3.5 & 3 \\ 3 & 5.5 \end{pmatrix}$$

Step 2. Invert to get $\Sigma_{\text{post}}$

For a $2 \times 2$ matrix,

$$\begin{pmatrix} a & b \\ c & d \end{pmatrix}^{-1} = \frac{1}{ad - bc}\begin{pmatrix} d & -b \\ -c & a \end{pmatrix}$$

$$\det(\Sigma_{\text{post}}^{-1}) = 3.5 \times 5.5 - 3 \times 3 = 19.25 - 9 = 10.25$$

Therefore,

$$\Sigma_{\text{post}} = \frac{1}{10.25}\begin{pmatrix} 5.5 & -3 \\ -3 & 3.5 \end{pmatrix} \approx \begin{pmatrix} 0.537 & -0.293 \\ -0.293 & 0.341 \end{pmatrix}$$

Step 3. Posterior mean

$$\mu_{\text{post}} = \Sigma_{\text{post}}\Psi^\top y = \frac{1}{10.25}\begin{pmatrix} 5.5 & -3 \\ -3 & 3.5 \end{pmatrix}\begin{pmatrix} 8.1 \\ 11.0 \end{pmatrix} \approx \begin{pmatrix} 1.127 \\ 1.385 \end{pmatrix}$$

Interpretation:

The posterior mean says $w_0 \approx 1.13$ for the intercept and $w_1 \approx 1.39$ for the slope. The prior mean is $0$, so the posterior is pulled slightly toward the origin; this is the regularization effect of Bayesian Inference.

Phase 3: Compare MLE / MAP / Ridge

MLE (no prior):

$$\hat{w}_{\text{MLE}} = (\Psi^\top \Psi)^{-1}\Psi^\top y$$$$(\Psi^\top \Psi)^{-1} = \frac{1}{3 \times 5 - 9}\begin{pmatrix} 5 & -3 \\ -3 & 3 \end{pmatrix} = \frac{1}{6}\begin{pmatrix} 5 & -3 \\ -3 & 3 \end{pmatrix} = \begin{pmatrix} 0.833 & -0.5 \\ -0.5 & 0.5 \end{pmatrix}$$$$\hat{w}_{\text{MLE}} = \begin{pmatrix} 0.833 & -0.5 \\ -0.5 & 0.5 \end{pmatrix}\begin{pmatrix} 8.1 \\ 11.0 \end{pmatrix} = \begin{pmatrix} 1.250 \\ 1.450 \end{pmatrix}$$

MAP:

Because the posterior is Gaussian, the posterior mean and posterior mode are the same:

$$\hat{w}_{\text{MAP}} = \mu_{\text{post}} \approx \begin{pmatrix} 1.127 \\ 1.385 \end{pmatrix}$$

Ridge connection:

With $\sigma^2 = 1$ and $S = 2I$, maximizing the posterior is equivalent to minimizing

$$\lVert y - \Psi w \rVert^2 + \frac{1}{2}\lVert w \rVert^2$$

which is exactly a ridge-style penalty.

Comparison

	$w_0$	$w_1$
MLE	1.250	1.450
Posterior mean / MAP	1.127	1.385
Prior mean	0	0

This is the weighted-average effect that often appears in STA414 exam questions.

Phase 4: Predictive Distribution

Goal:

For a new input $x_* = 3$, compute the predictive distribution of $y_*$.

Formula:

$$p(y_* \mid x_*, y, \Psi) = \mathcal{N}(y_*; \mu_*, \sigma_*^2)$$$$\mu_* = \psi(x_*)^\top \mu_{\text{post}}$$$$\sigma_*^2 = \psi(x_*)^\top \Sigma_{\text{post}} \psi(x_*) + \sigma^2$$

Step 1. Feature vector

$$\psi(x_*) = \psi(3) = \begin{pmatrix} 1 \\ 3 \end{pmatrix}$$

Step 2. Predictive mean

$$\mu_* = \begin{pmatrix} 1 & 3 \end{pmatrix}\begin{pmatrix} 1.127 \\ 1.385 \end{pmatrix} \approx 5.282$$

Step 3. Predictive variance

$$\psi(x_*)^\top \Sigma_{\text{post}} \psi(x_*) = \begin{pmatrix} 1 & 3 \end{pmatrix}\frac{1}{10.25}\begin{pmatrix} 5.5 & -3 \\ -3 & 3.5 \end{pmatrix}\begin{pmatrix} 1 \\ 3 \end{pmatrix} \approx 1.854$$

$$\sigma_*^2 \approx 1.854 + 1 = 2.854$$

Result:

$$y_* \mid x_* = 3 \sim \mathcal{N}(5.282, 2.854)$$

Interpretation:

The prediction is about $5.28$, and uncertainty grows because $x_* = 3$ lies outside the observed training range. This is the core intuition behind posterior predictive uncertainty.

Concept Chain

Data + model
  -> build design matrix Ψ
    -> combine prior + likelihood
      -> posterior: Σ_post, μ_post
        -> compare MLE / MAP
          -> predictive distribution: μ_*, σ_*^2
            -> uncertainty increases outside the training range

Reference

缺失过程补充

$-\log p(y | \Psi)$ 哪里来的？

1. 完整的贝叶斯公式

在不省略任何项的情况下，计算参数 $w$ 的后验概率（Posterior）的完整公式是这样的：

$$p(w | y, \Psi) = \frac{p(y | w, \Psi) p(w)}{p(y | \Psi)}$$

$p(w | y, \Psi)$：后验概率（我们在观测到数据后，参数 $w$ 的概率分布）。
$p(y | w, \Psi)$：似然（Likelihood，在给定参数 $w$ 的情况下，观测到数据 $y$ 的概率）。
$p(w)$：先验概率（Prior，我们在看到数据之前，对参数 $w$ 的假设）。
$p(y | \Psi)$：边缘似然（Marginal Likelihood）或证据（Evidence）。这就是红线那一项的源头。

2. 取对数（Log）操作

为了把乘除法变成加减法（这在机器学习里更容易计算，且能防止数值下溢），我们对等式两边同时取自然对数 $\log$。

根据高中的对数运算法则：

乘法变加法：$\log(A \cdot B) = \log(A) + \log(B)$
除法变减法：$\log(\frac{A}{B}) = \log(A) - \log(B)$

将法则应用到完整的贝叶斯公式上：

$$\log p(w | y, \Psi) = \log \left( \frac{p(y | w, \Psi) p(w)}{p(y | \Psi)} \right)$$

拆解开来：

$$\log p(w | y, \Psi) = \log(p(w)) + \log(p(y | w, \Psi)) - \log(p(y | \Psi))$$

你看，因为 $p(y | \Psi)$ 在原始公式里是分母，所以取对数后，它就变成了减去这一项。

3. 为什么最后把它变成了 const？

正如图片最后一句所说，在机器学习和贝叶斯推断中，我们的核心任务通常是寻找最优的模型参数 $w$（即最大后验估计 MAP）。

当你仔细观察 $p(y | \Psi)$ 这一项时，你会发现它里面根本没有参数 $w$。它是一个只和给定数据 $y$ 以及超参数 $\Psi$ 有关的固定值。

既然我们在寻找让等式左边最大的 $w$，那么等式右边任何不包含 $w$ 的项，在求导时都会变成 0。它只会上下平移整个函数，不会改变最优 $w$ 的位置。所以为了简化计算，数学上我们直接把它统称为 $\text{const}$（常数）并忽略掉。

Phase1里的Likelihood哪里来的？

我们可以一步步把中间省略的数学过程展开：

第一步：理解模型与条件分布

图片中给出的起始模型是：

$$y = \Psi w + \epsilon$$

其中：

$y$ 是观测数据（一个向量）。
$\Psi w$ 是模型给出的确定性预测值（矩阵 $\Psi$ 乘以权重向量 $w$）。
$\epsilon$ 是噪声项，且假设它服从均值为 0、协方差矩阵为 $\sigma^2 I$ 的多维高斯分布，即 $\epsilon \sim \mathcal{N}(0, \sigma^2 I)$。这里的 $I$ 是单位矩阵，说明各个维度的噪声是独立同分布的。

当我们想要写出似然（Likelihood） $p(y \mid w, \Psi)$ 时，我们是在问：“在给定参数 $w$ 和特征 $\Psi$ 的情况下，$y$ 的分布是什么？”

因为 $y$ 等于一个常数项（$\Psi w$ 是给定的）加上一个均值为 $0$ 的高斯噪声（$\epsilon$），所以 $y$ 整体也就变成了一个高斯分布。它的均值被平移到了 $\Psi w$，而方差和噪声保持一致。

因此我们得到条件分布：

$$y \mid w, \Psi \sim \mathcal{N}(\Psi w, \sigma^2 I)$$

第二步：代入多维高斯分布的概率密度函数 (PDF)

对于一个 $D$ 维的高斯分布 $x \sim \mathcal{N}(\mu, \Sigma)$，它的标准概率密度函数公式是：

$$p(x) = \frac{1}{(2\pi)^{D/2} |\Sigma|^{1/2}} \exp\left( -\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu) \right)$$

现在，我们把第一步得到的分布 $\mathcal{N}(\Psi w, \sigma^2 I)$ 代入这个标准公式中：

随机变量 $x$ 替换为 $y$
均值向量 $\mu$ 替换为 $\Psi w$
协方差矩阵 $\Sigma$ 替换为 $\sigma^2 I$

对于协方差矩阵的逆 $\Sigma^{-1}$，由于 $\Sigma = \sigma^2 I$，它的逆矩阵就是 $\frac{1}{\sigma^2} I$。

代入后，似然函数的完整展开式为：

$$p(y \mid w, \Psi) = \frac{1}{(2\pi)^{D/2} |\sigma^2 I|^{1/2}} \exp\left( -\frac{1}{2} (y - \Psi w)^T \left(\frac{1}{\sigma^2} I\right) (y - \Psi w) \right)$$

我们可以把指数部分里的常数 $\frac{1}{\sigma^2}$ 提出来，并且向量的内积 $(y - \Psi w)^T (y - \Psi w)$ 在数学上就等价于它的 $L_2$ 范数的平方，即 $||y - \Psi w||^2$。

所以指数部分化简为：

$$\exp\left( -\frac{1}{2\sigma^2} ||y - \Psi w||^2 \right)$$

第三步：两边取对数 (Log)

为了得到对数似然（Log-likelihood），我们对上面的式子两边取自然对数 $\log$。

根据对数的性质 $\log(A \cdot e^B) = \log(A) + B$：

$$\log p(y \mid w, \Psi) = \log\left( \frac{1}{(2\pi)^{D/2} |\sigma^2 I|^{1/2}} \right) + \left( -\frac{1}{2\sigma^2} ||y - \Psi w||^2 \right)$$

第四步：分离出常量 (const)

现在仔细观察取对数后的式子。我们的最终目的是为了在后验概率中找到最优的参数 $w$。

式子的第一项 $\log\left( \frac{1}{(2\pi)^{D/2} |\sigma^2 I|^{1/2}} \right)$ 中，包含了圆周率 $\pi$、数据维度 $D$、以及噪声方差 $\sigma^2$。这一项里根本没有 $w$。

在关于 $w$ 的求导或者优化过程中，所有不包含 $w$ 的项都可以视为常数（它们不会影响 $w$ 取何值时函数达到最大/最小）。因此，我们把前面这一大坨纯数字统称为 const。

最终，我们就得到了图片红圈里的结论：

$$\log p(y \mid w, \Psi) = -\frac{1}{2\sigma^2} ||y - \Psi w||^2 + \text{const}$$

—

符号意思澄清

$p(\theta \mid D) = \frac{p(D \mid \theta)p(\theta)}{p(D)}$ 是统计学中最标准的贝叶斯公式抽象写法。该molecule语境下则是显式标注Hyperparameter后的结果。

1. $w \iff \theta$ （模型参数）

在你的公式里，$\theta$ 是未知的模型参数。
在图中的语境里，$w$ 代表 Weights（权重）。在机器学习中，无论是线性回归还是神经网络，需要通过数据去学习的参数习惯上都被称为 $w$。

2. $y \iff D$ （观测数据）

在你的公式里，$D$ 代表 Data（数据）。
在图中的语境里，$y$ 代表观测到的目标变量/标签 (Targets/Labels)。在监督学习中，我们通常用 $x$ 表示输入特征，用 $y$ 表示输出结果。这里省略了 $x$，直接用 $y$ 指代我们需要拟合或解释的观测数据。

3. $\Psi$ （超参数 Hyperparameters）

这是你提到的标准公式里没有显式写出来的一项。
$\Psi$ (Psi) 代表模型中预先设定的、不参与当前贝叶斯更新的固定参数。例如，假设你的数据观测过程中存在高斯噪声，这个噪声的方差大小就是 $\Psi$。或者在定义先验分布时的某些固定属性。
把 $| \Psi$ 写在条件概率的右边，是为了数学表达的严谨性，意思是：“在已经确定了这些超参数 $\Psi$ 的大前提下”。

总结映射关系：

如果你为了方便理解，暂时把代表固定大背景的 $\Psi$ 遮住，图中的核心推导其实就是：

$$p(w \mid y) \propto p(y \mid w) p(w)$$

这和你的标准公式：

$$p(\theta \mid D) \propto p(D \mid \theta) p(\theta)$$

是一模一样的。

所以，图中你问的 $p(y \mid w, \Psi)$，对应的就是你熟悉的 $p(D \mid \theta)$，也就是似然函数 (Likelihood)——意思是：在当前给定的模型参数 $w$（以及固定的超参数 $\Psi$）下，我们观测到手头这批数据 $y$ 的概率有多大。

Phase 1在Likelihood后的部分是哪里来的

这是贝叶斯线性回归中一个非常经典的数学推导过程：如何通过“配方法（Completing the Square）”求出后验分布的参数。

图中的核心目标是：已知先验分布（Prior）和似然函数（Likelihood）都是高斯分布，将它们相乘（在对数空间中相加）后，证明得到的后验分布（Posterior）依然是高斯分布，并推导出这个后验分布的协方差矩阵 $\Sigma_{\text{post}}$ 和均值向量 $\mu_{\text{post}}$。

(注：图片中第二行公式上方写了 “Plug in $\sigma^2 = 1$”，但它在第四行收集项的时候又把 $\sigma^{-2}$ 加回来了。这应该是原笔记的一个笔误，为了保持推导的严谨性，下面的解析将保留 $\sigma^2$ 进行完整推导。)

以下是详细的步骤拆解：

1. 明确我们要拼凑的“目标形状”

任何一个多维高斯分布 $w \sim \mathcal{N}(\mu, \Sigma)$，它的对数概率密度函数展开后，都可以写成这种关于 $w$ 的标准二次型：

$$\log p(w) = -\frac{1}{2} w^\top \Sigma^{-1} w + w^\top \Sigma^{-1} \mu + \text{const}$$

二次项（包含 $w^\top \dots w$）：决定了协方差矩阵的逆 $\Sigma^{-1}$。
一次项（包含 $w$ 或 $w^\top$）：决定了均值 $\mu$。

我们的策略就是：把先验和似然加起来，展开所有括号，然后把关于 $w$ 的二次项和一次项分别提取出来，与上面这个“目标形状”进行对比。

2. 写出对数后验（Log Posterior）

根据贝叶斯公式，$\log(\text{后验}) = \log(\text{先验}) + \log(\text{似然}) + \text{const}$。

先验项（假设均值为 $0$，协方差为 $S$）：$-\frac{1}{2} w^\top S^{-1} w$
似然项：$-\frac{1}{2\sigma^2} \|y - \Psi w\|^2$

把它们加起来：

$$\log p(w \mid y) = -\frac{1}{2} w^\top S^{-1} w - \frac{1}{2\sigma^2} \|y - \Psi w\|^2 + \text{const}$$

3. 展开似然项里的平方 (Expand the quadratic form)

这就对应图片中的第三部分。把 $L_2$ 范数的平方展开成矩阵乘法：

$$\|y - \Psi w\|^2 = (y - \Psi w)^\top (y - \Psi w)$$$$= (y^\top - w^\top \Psi^\top) (y - \Psi w)$$$$= y^\top y - y^\top \Psi w - w^\top \Psi^\top y + w^\top \Psi^\top \Psi w$$

注意，由于 $y^\top \Psi w$ 是一个标量（一个数字），标量的转置等于它自己，所以 $y^\top \Psi w = (y^\top \Psi w)^\top = w^\top \Psi^\top y$。因此中间两项可以合并：

$$= y^\top y - 2y^\top \Psi w + w^\top \Psi^\top \Psi w$$

4. 收集关于 $w$ 的项 (Collect the terms involving $w$)

现在把展开后的结果代回第 2 步的等式中：

$$\log p(w \mid y) = -\frac{1}{2} w^\top S^{-1} w - \frac{1}{2\sigma^2} (y^\top y - 2y^\top \Psi w + w^\top \Psi^\top \Psi w) + \text{const}$$

接下来，我们把所有含有 $w^\top \dots w$ 的放在一起，含有 $\dots w$ 的放在一起。至于 $y^\top y$，因为它里面没有 $w$，我们直接把它丢进 $\text{const}$（常数项）里。

收集二次项 $w^\top (\dots) w$：
$-\frac{1}{2} w^\top S^{-1} w - \frac{1}{2\sigma^2} w^\top \Psi^\top \Psi w = -\frac{1}{2} w^\top (S^{-1} + \sigma^{-2} \Psi^\top \Psi) w$
收集一次项：
$-\frac{1}{2\sigma^2} (-2y^\top \Psi w) = \sigma^{-2} y^\top \Psi w$ （注意 $-1/2$ 和 $-2$ 抵消了）

合并起来，我们就得到了图片中关键的一步：

$$\log p(w \mid y) = -\frac{1}{2} w^\top (S^{-1} + \sigma^{-2} \Psi^\top \Psi) w + \sigma^{-2} y^\top \Psi w + \text{const}$$

5. 对比标准形式，得出结论 (This is a Gaussian quadratic form)

现在，拿我们推导出的公式，去和第 1 步里的标准高斯二次型做对比：

对比二次项：

我们的：$-\frac{1}{2} w^\top (S^{-1} + \sigma^{-2} \Psi^\top \Psi) w$

标准的：$-\frac{1}{2} w^\top \Sigma_{\text{post}}^{-1} w$

结论 1：后验协方差矩阵的逆是 $\Sigma_{\text{post}}^{-1} = S^{-1} + \sigma^{-2} \Psi^\top \Psi$。两边同时取逆矩阵，就得到了图中的 $\Sigma_{\text{post}}$ 公式。

对比一次项：

我们的：$w^\top (\sigma^{-2} \Psi^\top y)$ (把前面的 $\sigma^{-2} y^\top \Psi w$ 转置一下，因为它是标量)

标准的：$w^\top (\Sigma_{\text{post}}^{-1} \mu_{\text{post}})$

结论 2：$\Sigma_{\text{post}}^{-1} \mu_{\text{post}} = \sigma^{-2} \Psi^\top y$。我们在等式两边左乘 $\Sigma_{\text{post}}$，就得到了图中的均值公式：$\mu_{\text{post}} = \sigma^{-2} \Sigma_{\text{post}} \Psi^\top y$。

这就是贝叶斯推断中常说的“共轭（Conjugacy）”特性的完美体现——高斯先验乘上高斯似然，最终还是一个高斯分布。

Setup#

Phase 0: Build the Matrices#

Phase 1: Posterior Derivation#

Phase 2: Compute the Posterior Numerically#

Phase 3: Compare MLE / MAP / Ridge#

Phase 4: Predictive Distribution#

Concept Chain#

Reference#

缺失过程补充#

$-\log p(y | \Psi)$ 哪里来的？#

1. 完整的贝叶斯公式#

2. 取对数（Log）操作#

3. 为什么最后把它变成了 const？#

Phase1里的Likelihood哪里来的？#

第一步：理解模型与条件分布#

第二步：代入多维高斯分布的概率密度函数 (PDF)#

第三步：两边取对数 (Log)#

第四步：分离出常量 (const)#

符号意思澄清#

Phase 1在Likelihood后的部分是哪里来的#

1. 明确我们要拼凑的“目标形状”#

2. 写出对数后验（Log Posterior）#

3. 展开似然项里的平方 (Expand the quadratic form)#

4. 收集关于 $w$ 的项 (Collect the terms involving $w$)#

5. 对比标准形式，得出结论 (This is a Gaussian quadratic form)#