STA414 Final Aidsheet Strategy

这张 aidsheet 不按章节顺序排。
考试调用频率 + 可压缩程度 + 临场遗忘风险 + practice final 暴露出的题型权重 排。

核心原则只有一句:

aidsheet 不是 summary note,而是 exam-time retrieval device。

要固化进去的内容是:

  • 容易混
  • 容易卡
  • 一旦忘了整题崩掉
  • 但又能被高度压缩

Final layout

Front side

  1. Graphical Models + Conditional Independence — 22%
  2. Variable Elimination / Message Passing / Belief Propagation / HMM — 16%
  3. Bayesian Regression — 16%
  4. Gaussian Processes + Kernel View — 16%

Back side

  1. Decision Theory — 12%
  2. Variational Inference + EM + Exponential Family — 10%
  3. Sampling: Monte Carlo / Importance Sampling / MCMC — 10%
  4. VAE + Diffusion — 10%
  5. Thin utility strip: Gaussian / matrix tools — 8%

Why this layout

这次最优策略不是把课程内容平均切开,而是按 题型簇 切。

从 practice final 看,最重的是:

  • Graphical Models
  • Bayesian Linear Regression
  • Gaussian Processes
  • Decision Theory

它们不是边角题,而是明显的大分值主干题型。

因此这次 aidsheet 应该围绕这四个主轴展开,其他内容做压缩挂件。


Module 1: Graphical Models + Conditional Independence

这是第一优先级模块。

Put only the following

DAG factorization

$$ p(x_1,\dots,x_n)=\prod_i p(x_i \mid \mathrm{pa}(x_i)) $$

Plate notation reading rule

d-separation templates

三类 path 必须写成极短模板:

  • chain: $A \to B \to C$
  • fork: $A \leftarrow B \to C$
  • collider: $A \to B \leftarrow C$

并明确写:

  • conditioning on chain / fork middle node blocks path
  • collider blocks path by default
  • conditioning on collider or its descendant opens path

Markov blanket

How to infer graph from a given factorization

Directed graphical model vs MRF

只写一行对照,不展开。

Why this deserves the largest area

因为这类题不是在考你“知道图模型是什么”,而是在考你:

  • 写 factorization
  • 看 conditional independence
  • 从 joint form 反推图
  • 判断哪些路径被 block / open

这类题临场最容易因为 collider 或 conditioning 失误直接丢整串分。


Module 2: Variable Elimination / Message Passing / Belief Propagation / HMM

这个模块要写成纯模板区,不要写成长解释。

Variable elimination skeleton

$$ \text{collect relevant factors} \;\to\; \text{multiply} \;\to\; \text{sum out variable} \;\to\; \text{create new factor} $$

再写一句:

complexity depends on elimination order

Sum-product message

Pairwise message

$$ m_{i \to j}(x_j) = \sum_{x_i} \phi_i(x_i)\psi_{ij}(x_i,x_j) \prod_{k \in N(i)\setminus j} m_{k \to i}(x_i) $$

Belief

$$ b_i(x_i)\propto \phi_i(x_i)\prod_{k\in N(i)} m_{k\to i}(x_i) $$

HMM factorization

$$ p(z_{1:T},x_{1:T}) = p(z_1)\prod_{t=2}^T p(z_t \mid z_{t-1}) \prod_{t=1}^T p(x_t \mid z_t) $$

Forward-backward

只保留 forward / backward 递推骨架,不写 prose。

Why this is separate from Module 1

因为 Module 1 是结构识别题。
这个模块是计算题。
二者调用方式不同,不能混成一个大块。


Module 3: Bayesian Regression

这块只放终点公式和 completing-the-square 模板。

Likelihood

$$ p(y \mid X,w,\sigma^2)=\mathcal N(y \mid Xw,\sigma^2 I) $$

Ordinary least squares

$$ \hat w_{LS} = (X^T X)^{-1}X^T y $$

Key point

MLE and OLS coincide when the covariance structure does not change the optimization target beyond a scalar weighting of squared error.

最常见考试版本直接记成:

  • isotropic Gaussian noise
  • Gaussian prior
  • posterior is Gaussian

Posterior structure

写成 precision form:

$$ \Lambda_{\mathrm{post}} = \Lambda_{\mathrm{prior}} + \Lambda_{\mathrm{like}} $$$$ \mu_{\mathrm{post}} = \Lambda_{\mathrm{post}}^{-1} \bigl(\text{prior linear term} + \text{likelihood linear term}\bigr) $$

对于最常见版本,直接写:

$$ \Sigma_{\mathrm{post}} = \left(I + \frac{1}{\sigma^2}X^T X\right)^{-1} $$$$ \mu_{\mathrm{post}} = \left(I + \frac{1}{\sigma^2}X^T X\right)^{-1} \left(\mu + \frac{1}{\sigma^2}X^T y\right) $$

Orthogonal features case

$$ X^T X = I $$

则 posterior mean 是 prior mean 和 OLS solution 的 weighted average。

这句必须写,因为它很适合出短推导题。

Warning

这块最容易死在:

  • quadratic expansion 出错
  • linear term 漏号
  • completing the square 最终 mean / covariance 位置写反

所以旁边留一条小公式:

对于

$$ -\frac12 w^T A w + b^T w + \text{const} $$

对应 Gaussian mean / covariance 为

$$ \Sigma = A^{-1}, \qquad \mu = A^{-1}b $$

Module 4: Gaussian Processes + Kernel View

GP 不能写成长定义。
只写 block Gaussian conditioning 模板。

Prior

$$ f \sim GP(m,k) $$

Observation model

$$ y = f + \epsilon, \qquad \epsilon \sim \mathcal N(0,\sigma^2 I) $$

Gram matrix

$$ K_{ij} = k(x_i,x_j) $$

Training marginal

$$ y_N \sim \mathcal N(0, K_N + \sigma^2 I) $$

Joint train-test form

$$ \begin{bmatrix} y_N \\ y_* \end{bmatrix} \sim \mathcal N \left( 0, \begin{bmatrix} K_N+\sigma^2 I & k_* \\ k_*^T & c \end{bmatrix} \right) $$

Predictive mean

$$ \mu_* = k_*^T (K_N+\sigma^2 I)^{-1} y_N $$

Predictive variance

$$ \sigma_*^2 = c - k_*^T (K_N+\sigma^2 I)^{-1} k_* $$

Kernel view

写一句即可:

GP prediction is Gaussian conditioning on the joint covariance induced by the kernel.

Why this is a main module

这类题临场不是不理解,而是容易忘:

  • train covariance 到底是 $K$ 还是 $K+\sigma^2 I$
  • predictive variance 的减号位置
  • joint block Gaussian 的结构

所以它非常适合做成公式卡。


Module 5: Decision Theory

这次它不能再放边角。

Bayes decision rule

$$ \text{choose class 1 if } p(t=1 \mid x) > p(t=0 \mid x) $$

equal priors 下直接写成:

$$ p(x \mid t=1) > p(x \mid t=0) $$

Misclassification rate

$$ P(\mathrm{error}) = P(x \in R_0, t=1) + P(x \in R_1, t=0) $$

Equal variance Gaussian classes

$$ x \mid t=0 \sim \mathcal N(\mu_0,\sigma^2), \qquad x \mid t=1 \sim \mathcal N(\mu_1,\sigma^2) $$

则 decision boundary 在 midpoint。

Same mean, unequal variances

若均值相同但 variance 不同,则 decision region 变成 centered interval / outside interval 结构。
要把 threshold 写成单独一行,不必推导全过程。

Standard normal CDF transform

$$ P(a \le X \le b) $$

转成

$$ \Phi\left(\frac{b-\mu}{\sigma}\right)-\Phi\left(\frac{a-\mu}{\sigma}\right) $$

Warning

这类题最常见错误:

  • posterior comparison 写成 prior comparison
  • 两类 error 区域积分方向反
  • threshold 找对了但 misclassification formula 写错 region

Module 6: Variational Inference + EM + Exponential Family

这三块压在一起最省面积。

ELBO

$$ \mathcal L(q) = \mathbb E_{q(z\mid x)} \bigl[\log p(x,z)-\log q(z\mid x)\bigr] $$

等价写法:

$$ \mathcal L(q) = \mathbb E_q[\log p(x\mid z)] - KL(q(z\mid x)\,\|\,p(z)) $$

只在适用时写。

ELBO identity

$$ \log p(x) = \mathcal L(q) + KL(q(z\mid x)\,\|\,p(z\mid x)) $$

因此

$$ \mathcal L(q)\le \log p(x) $$

Jensen

写一句即可:

for concave $\log$,

$$ \log \mathbb E[X] \ge \mathbb E[\log X] $$

KL direction

只写一句:

  • $KL(q\|p)$ often mode-seeking
  • $KL(p\|q)$ often mass-covering

EM

$$ Q(\theta,\theta^{old}) = \mathbb E_{p(z\mid x,\theta^{old})} [\log p(x,z \mid \theta)] $$
  • E-step: compute posterior over latent variables using old parameters
  • M-step: maximize $Q(\theta,\theta^{old})$ over $\theta$

再写一句:

EM maximizes expected complete-data log-likelihood, not the incomplete-data likelihood directly.

Exponential family

$$ p(x \mid \eta) = h(x)\exp\{\eta^T T(x)-A(\eta)\} $$

只保留:

  • sufficient statistic $T(x)$
  • natural parameter $\eta$
  • log-partition $A(\eta)$

Module 7: Sampling — MC / IS / MCMC

这里只写 estimator 模板。

Simple Monte Carlo

$$ \hat e = \frac{1}{S}\sum_{i=1}^S f(x^{(i)}), \qquad x^{(i)} \sim p $$

Unbiasedness

$$ \mathbb E[\hat e] = \mathbb E_p[f(x)] $$

Variance

$$ \mathrm{Var}(\hat e) = \frac{\mathrm{Var}_p(f(x))}{S} $$

Importance sampling identity

$$ \mathbb E_p[f(x)] = \mathbb E_q\left[f(x)\frac{p(x)}{q(x)}\right] $$

IS estimator

$$ \hat e_{IS} = \frac{1}{S}\sum_{i=1}^S f(x^{(i)}) \frac{p(x^{(i)})}{q(x^{(i)})}, \qquad x^{(i)}\sim q $$

Normalized weights

$$ \tilde w_i = \frac{w_i}{\sum_j w_j} $$

MCMC

只保留骨架:

  • proposal
  • acceptance ratio
  • target as stationary distribution

不要展开长解释。


Module 8: VAE + Diffusion

AE / VAE 要写成“答题句子模板”。

Deterministic autoencoder

  • encoder maps input to a single point
  • reconstruction loss only
  • no latent regularization
  • no guarantee that nearby inputs map to nearby latent codes
  • discontinuities / holes can appear in latent space

VAE encoder

$$ q_\phi(z\mid x)=\mathcal N(\mu_\phi(x), \mathrm{diag}(\sigma_\phi^2(x))) $$

Reparameterization

$$ z = \mu + \sigma \odot \epsilon, \qquad \epsilon \sim \mathcal N(0,I) $$

VAE ELBO

$$ \mathbb E_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] - KL(q_\phi(z\mid x)\,\|\,p(z)) $$

Interpretation

  • reconstruction term preserves information
  • KL term regularizes latent space
  • encourages overlap and smoothness
  • makes interpolation / generation meaningful

Diffusion

只写最核心三句:

  • forward process gradually adds noise
  • reverse process learns denoising transitions
  • training often predicts noise / score instead of direct data reconstruction

Thin utility strip

这一条不要做成独立模块,只放边缘细条。

Must include

Bayes rule

$$ p(z\mid x)=\frac{p(x\mid z)p(z)}{p(x)} $$

Gaussian conditioning

$$ \begin{bmatrix} y_1\\ y_2 \end{bmatrix} \sim \mathcal N \left( \begin{bmatrix} \mu_1\\ \mu_2 \end{bmatrix}, \begin{bmatrix} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{21} & \Sigma_{22} \end{bmatrix} \right) $$

$$ y_2 \mid (y_1=a) \sim \mathcal N \bigl( \mu_2+\Sigma_{21}\Sigma_{11}^{-1}(a-\mu_1), \Sigma_{22}-\Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12} \bigr) $$

Completing the square

对于

$$ -\frac12 w^T A w + b^T w + c $$

$$ \mu=A^{-1}b,\qquad \Sigma=A^{-1} $$

Matrix derivatives

$$ \nabla_w (b^T w)=b $$$$ \nabla_w (w^T A w)=(A+A^T)w $$

若 $A$ symmetric:

$$ \nabla_w (w^T A w)=2Aw $$

Jensen

Standard normal CDF notation

$$ \Phi(\cdot) $$

What should not be on the aidsheet

不要放:

  • 你已经滚瓜烂熟的定义
  • 长自然语言解释
  • 纯 intuition 图景
  • 历史背景
  • 不能直接触发解题的 prose
  • practice final 具体数字答案

aidsheet 的目标不是“讲清楚”。
而是 在 exam pressure 下把你从卡壳点拉出来。


Final working rule

每个模块只保留三层:

  1. one-line definition
  2. one-line core formula
  3. one-line warning / common failure point

凡是超过这三层还在解释的内容,都应该删。


Final conclusion

这次 STA414 aidsheet 不该平均分给每章。
它应该围绕四个主轴展开:

  • Graphical Models
  • Bayesian Regression
  • Gaussian Processes
  • Decision Theory

其余内容做压缩挂件。

这才是最适合双面 A4 的排法。