STA414 Final Aidsheet Strategy
这张 aidsheet 不按章节顺序排。
按 考试调用频率 + 可压缩程度 + 临场遗忘风险 + practice final 暴露出的题型权重 排。
核心原则只有一句:
aidsheet 不是 summary note,而是 exam-time retrieval device。
要固化进去的内容是:
- 容易混
- 容易卡
- 一旦忘了整题崩掉
- 但又能被高度压缩
Final layout
Front side
- Graphical Models + Conditional Independence — 22%
- Variable Elimination / Message Passing / Belief Propagation / HMM — 16%
- Bayesian Regression — 16%
- Gaussian Processes + Kernel View — 16%
Back side
- Decision Theory — 12%
- Variational Inference + EM + Exponential Family — 10%
- Sampling: Monte Carlo / Importance Sampling / MCMC — 10%
- VAE + Diffusion — 10%
- Thin utility strip: Gaussian / matrix tools — 8%
Why this layout
这次最优策略不是把课程内容平均切开,而是按 题型簇 切。
从 practice final 看,最重的是:
- Graphical Models
- Bayesian Linear Regression
- Gaussian Processes
- Decision Theory
它们不是边角题,而是明显的大分值主干题型。
因此这次 aidsheet 应该围绕这四个主轴展开,其他内容做压缩挂件。
Module 1: Graphical Models + Conditional Independence
这是第一优先级模块。
Put only the following
DAG factorization
$$ p(x_1,\dots,x_n)=\prod_i p(x_i \mid \mathrm{pa}(x_i)) $$Plate notation reading rule
d-separation templates
三类 path 必须写成极短模板:
- chain: $A \to B \to C$
- fork: $A \leftarrow B \to C$
- collider: $A \to B \leftarrow C$
并明确写:
- conditioning on chain / fork middle node blocks path
- collider blocks path by default
- conditioning on collider or its descendant opens path
Markov blanket
How to infer graph from a given factorization
Directed graphical model vs MRF
只写一行对照,不展开。
Why this deserves the largest area
因为这类题不是在考你“知道图模型是什么”,而是在考你:
- 写 factorization
- 看 conditional independence
- 从 joint form 反推图
- 判断哪些路径被 block / open
这类题临场最容易因为 collider 或 conditioning 失误直接丢整串分。
Module 2: Variable Elimination / Message Passing / Belief Propagation / HMM
这个模块要写成纯模板区,不要写成长解释。
Variable elimination skeleton
$$ \text{collect relevant factors} \;\to\; \text{multiply} \;\to\; \text{sum out variable} \;\to\; \text{create new factor} $$再写一句:
complexity depends on elimination order
Sum-product message
Pairwise message
$$ m_{i \to j}(x_j) = \sum_{x_i} \phi_i(x_i)\psi_{ij}(x_i,x_j) \prod_{k \in N(i)\setminus j} m_{k \to i}(x_i) $$Belief
$$ b_i(x_i)\propto \phi_i(x_i)\prod_{k\in N(i)} m_{k\to i}(x_i) $$HMM factorization
$$ p(z_{1:T},x_{1:T}) = p(z_1)\prod_{t=2}^T p(z_t \mid z_{t-1}) \prod_{t=1}^T p(x_t \mid z_t) $$Forward-backward
只保留 forward / backward 递推骨架,不写 prose。
Why this is separate from Module 1
因为 Module 1 是结构识别题。
这个模块是计算题。
二者调用方式不同,不能混成一个大块。
Module 3: Bayesian Regression
这块只放终点公式和 completing-the-square 模板。
Likelihood
$$ p(y \mid X,w,\sigma^2)=\mathcal N(y \mid Xw,\sigma^2 I) $$Ordinary least squares
$$ \hat w_{LS} = (X^T X)^{-1}X^T y $$Key point
MLE and OLS coincide when the covariance structure does not change the optimization target beyond a scalar weighting of squared error.
最常见考试版本直接记成:
- isotropic Gaussian noise
- Gaussian prior
- posterior is Gaussian
Posterior structure
写成 precision form:
$$ \Lambda_{\mathrm{post}} = \Lambda_{\mathrm{prior}} + \Lambda_{\mathrm{like}} $$$$ \mu_{\mathrm{post}} = \Lambda_{\mathrm{post}}^{-1} \bigl(\text{prior linear term} + \text{likelihood linear term}\bigr) $$对于最常见版本,直接写:
$$ \Sigma_{\mathrm{post}} = \left(I + \frac{1}{\sigma^2}X^T X\right)^{-1} $$$$ \mu_{\mathrm{post}} = \left(I + \frac{1}{\sigma^2}X^T X\right)^{-1} \left(\mu + \frac{1}{\sigma^2}X^T y\right) $$Orthogonal features case
若
$$ X^T X = I $$则 posterior mean 是 prior mean 和 OLS solution 的 weighted average。
这句必须写,因为它很适合出短推导题。
Warning
这块最容易死在:
- quadratic expansion 出错
- linear term 漏号
- completing the square 最终 mean / covariance 位置写反
所以旁边留一条小公式:
对于
$$ -\frac12 w^T A w + b^T w + \text{const} $$对应 Gaussian mean / covariance 为
$$ \Sigma = A^{-1}, \qquad \mu = A^{-1}b $$Module 4: Gaussian Processes + Kernel View
GP 不能写成长定义。
只写 block Gaussian conditioning 模板。
Prior
$$ f \sim GP(m,k) $$Observation model
$$ y = f + \epsilon, \qquad \epsilon \sim \mathcal N(0,\sigma^2 I) $$Gram matrix
$$ K_{ij} = k(x_i,x_j) $$Training marginal
$$ y_N \sim \mathcal N(0, K_N + \sigma^2 I) $$Joint train-test form
$$ \begin{bmatrix} y_N \\ y_* \end{bmatrix} \sim \mathcal N \left( 0, \begin{bmatrix} K_N+\sigma^2 I & k_* \\ k_*^T & c \end{bmatrix} \right) $$Predictive mean
$$ \mu_* = k_*^T (K_N+\sigma^2 I)^{-1} y_N $$Predictive variance
$$ \sigma_*^2 = c - k_*^T (K_N+\sigma^2 I)^{-1} k_* $$Kernel view
写一句即可:
GP prediction is Gaussian conditioning on the joint covariance induced by the kernel.
Why this is a main module
这类题临场不是不理解,而是容易忘:
- train covariance 到底是 $K$ 还是 $K+\sigma^2 I$
- predictive variance 的减号位置
- joint block Gaussian 的结构
所以它非常适合做成公式卡。
Module 5: Decision Theory
这次它不能再放边角。
Bayes decision rule
$$ \text{choose class 1 if } p(t=1 \mid x) > p(t=0 \mid x) $$equal priors 下直接写成:
$$ p(x \mid t=1) > p(x \mid t=0) $$Misclassification rate
$$ P(\mathrm{error}) = P(x \in R_0, t=1) + P(x \in R_1, t=0) $$Equal variance Gaussian classes
若
$$ x \mid t=0 \sim \mathcal N(\mu_0,\sigma^2), \qquad x \mid t=1 \sim \mathcal N(\mu_1,\sigma^2) $$则 decision boundary 在 midpoint。
Same mean, unequal variances
若均值相同但 variance 不同,则 decision region 变成 centered interval / outside interval 结构。
要把 threshold 写成单独一行,不必推导全过程。
Standard normal CDF transform
把
$$ P(a \le X \le b) $$转成
$$ \Phi\left(\frac{b-\mu}{\sigma}\right)-\Phi\left(\frac{a-\mu}{\sigma}\right) $$Warning
这类题最常见错误:
- posterior comparison 写成 prior comparison
- 两类 error 区域积分方向反
- threshold 找对了但 misclassification formula 写错 region
Module 6: Variational Inference + EM + Exponential Family
这三块压在一起最省面积。
ELBO
$$ \mathcal L(q) = \mathbb E_{q(z\mid x)} \bigl[\log p(x,z)-\log q(z\mid x)\bigr] $$等价写法:
$$ \mathcal L(q) = \mathbb E_q[\log p(x\mid z)] - KL(q(z\mid x)\,\|\,p(z)) $$只在适用时写。
ELBO identity
$$ \log p(x) = \mathcal L(q) + KL(q(z\mid x)\,\|\,p(z\mid x)) $$因此
$$ \mathcal L(q)\le \log p(x) $$Jensen
写一句即可:
for concave $\log$,
$$ \log \mathbb E[X] \ge \mathbb E[\log X] $$KL direction
只写一句:
- $KL(q\|p)$ often mode-seeking
- $KL(p\|q)$ often mass-covering
EM
$$ Q(\theta,\theta^{old}) = \mathbb E_{p(z\mid x,\theta^{old})} [\log p(x,z \mid \theta)] $$- E-step: compute posterior over latent variables using old parameters
- M-step: maximize $Q(\theta,\theta^{old})$ over $\theta$
再写一句:
EM maximizes expected complete-data log-likelihood, not the incomplete-data likelihood directly.
Exponential family
$$ p(x \mid \eta) = h(x)\exp\{\eta^T T(x)-A(\eta)\} $$只保留:
- sufficient statistic $T(x)$
- natural parameter $\eta$
- log-partition $A(\eta)$
Module 7: Sampling — MC / IS / MCMC
这里只写 estimator 模板。
Simple Monte Carlo
$$ \hat e = \frac{1}{S}\sum_{i=1}^S f(x^{(i)}), \qquad x^{(i)} \sim p $$Unbiasedness
$$ \mathbb E[\hat e] = \mathbb E_p[f(x)] $$Variance
$$ \mathrm{Var}(\hat e) = \frac{\mathrm{Var}_p(f(x))}{S} $$Importance sampling identity
$$ \mathbb E_p[f(x)] = \mathbb E_q\left[f(x)\frac{p(x)}{q(x)}\right] $$IS estimator
$$ \hat e_{IS} = \frac{1}{S}\sum_{i=1}^S f(x^{(i)}) \frac{p(x^{(i)})}{q(x^{(i)})}, \qquad x^{(i)}\sim q $$Normalized weights
$$ \tilde w_i = \frac{w_i}{\sum_j w_j} $$MCMC
只保留骨架:
- proposal
- acceptance ratio
- target as stationary distribution
不要展开长解释。
Module 8: VAE + Diffusion
AE / VAE 要写成“答题句子模板”。
Deterministic autoencoder
- encoder maps input to a single point
- reconstruction loss only
- no latent regularization
- no guarantee that nearby inputs map to nearby latent codes
- discontinuities / holes can appear in latent space
VAE encoder
$$ q_\phi(z\mid x)=\mathcal N(\mu_\phi(x), \mathrm{diag}(\sigma_\phi^2(x))) $$Reparameterization
$$ z = \mu + \sigma \odot \epsilon, \qquad \epsilon \sim \mathcal N(0,I) $$VAE ELBO
$$ \mathbb E_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] - KL(q_\phi(z\mid x)\,\|\,p(z)) $$Interpretation
- reconstruction term preserves information
- KL term regularizes latent space
- encourages overlap and smoothness
- makes interpolation / generation meaningful
Diffusion
只写最核心三句:
- forward process gradually adds noise
- reverse process learns denoising transitions
- training often predicts noise / score instead of direct data reconstruction
Thin utility strip
这一条不要做成独立模块,只放边缘细条。
Must include
Bayes rule
$$ p(z\mid x)=\frac{p(x\mid z)p(z)}{p(x)} $$Gaussian conditioning
若
$$ \begin{bmatrix} y_1\\ y_2 \end{bmatrix} \sim \mathcal N \left( \begin{bmatrix} \mu_1\\ \mu_2 \end{bmatrix}, \begin{bmatrix} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{21} & \Sigma_{22} \end{bmatrix} \right) $$则
$$ y_2 \mid (y_1=a) \sim \mathcal N \bigl( \mu_2+\Sigma_{21}\Sigma_{11}^{-1}(a-\mu_1), \Sigma_{22}-\Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12} \bigr) $$Completing the square
对于
$$ -\frac12 w^T A w + b^T w + c $$有
$$ \mu=A^{-1}b,\qquad \Sigma=A^{-1} $$Matrix derivatives
$$ \nabla_w (b^T w)=b $$$$ \nabla_w (w^T A w)=(A+A^T)w $$若 $A$ symmetric:
$$ \nabla_w (w^T A w)=2Aw $$Jensen
Standard normal CDF notation
$$ \Phi(\cdot) $$What should not be on the aidsheet
不要放:
- 你已经滚瓜烂熟的定义
- 长自然语言解释
- 纯 intuition 图景
- 历史背景
- 不能直接触发解题的 prose
- practice final 具体数字答案
aidsheet 的目标不是“讲清楚”。
而是 在 exam pressure 下把你从卡壳点拉出来。
Final working rule
每个模块只保留三层:
- one-line definition
- one-line core formula
- one-line warning / common failure point
凡是超过这三层还在解释的内容,都应该删。
Final conclusion
这次 STA414 aidsheet 不该平均分给每章。
它应该围绕四个主轴展开:
- Graphical Models
- Bayesian Regression
- Gaussian Processes
- Decision Theory
其余内容做压缩挂件。
这才是最适合双面 A4 的排法。