STA414 Final Aidsheet Strategy

这张 aidsheet 不按章节顺序排。
按 考试调用频率 + 可压缩程度 + 临场遗忘风险 + practice final 暴露出的题型权重 排。

核心原则只有一句：

aidsheet 不是 summary note，而是 exam-time retrieval device。

要固化进去的内容是：

容易混
容易卡
一旦忘了整题崩掉
但又能被高度压缩

Final layout

Front side

Graphical Models + Conditional Independence — 22%
Variable Elimination / Message Passing / Belief Propagation / HMM — 16%
Bayesian Regression — 16%
Gaussian Processes + Kernel View — 16%

Back side

Decision Theory — 12%
Variational Inference + EM + Exponential Family — 10%
Sampling: Monte Carlo / Importance Sampling / MCMC — 10%
VAE + Diffusion — 10%
Thin utility strip: Gaussian / matrix tools — 8%

Why this layout

这次最优策略不是把课程内容平均切开，而是按 题型簇 切。

从 practice final 看，最重的是：

Graphical Models
Bayesian Linear Regression
Gaussian Processes
Decision Theory

它们不是边角题，而是明显的大分值主干题型。

因此这次 aidsheet 应该围绕这四个主轴展开，其他内容做压缩挂件。

Module 1: Graphical Models + Conditional Independence

这是第一优先级模块。

Put only the following

DAG factorization

$$ p(x_1,\dots,x_n)=\prod_i p(x_i \mid \mathrm{pa}(x_i)) $$

Plate notation reading rule

d-separation templates

三类 path 必须写成极短模板：

chain: $A \to B \to C$
fork: $A \leftarrow B \to C$
collider: $A \to B \leftarrow C$

并明确写：

conditioning on chain / fork middle node blocks path
collider blocks path by default
conditioning on collider or its descendant opens path

Markov blanket

How to infer graph from a given factorization

Directed graphical model vs MRF

只写一行对照，不展开。

Why this deserves the largest area

因为这类题不是在考你“知道图模型是什么”，而是在考你：

写 factorization
看 conditional independence
从 joint form 反推图
判断哪些路径被 block / open

这类题临场最容易因为 collider 或 conditioning 失误直接丢整串分。

Module 2: Variable Elimination / Message Passing / Belief Propagation / HMM

这个模块要写成纯模板区，不要写成长解释。

Variable elimination skeleton

$$ \text{collect relevant factors} \;\to\; \text{multiply} \;\to\; \text{sum out variable} \;\to\; \text{create new factor} $$

再写一句：

complexity depends on elimination order

Sum-product message

Pairwise message

$$ m_{i \to j}(x_j) = \sum_{x_i} \phi_i(x_i)\psi_{ij}(x_i,x_j) \prod_{k \in N(i)\setminus j} m_{k \to i}(x_i) $$

Belief

$$ b_i(x_i)\propto \phi_i(x_i)\prod_{k\in N(i)} m_{k\to i}(x_i) $$

HMM factorization

$$ p(z_{1:T},x_{1:T}) = p(z_1)\prod_{t=2}^T p(z_t \mid z_{t-1}) \prod_{t=1}^T p(x_t \mid z_t) $$

Forward-backward

只保留 forward / backward 递推骨架，不写 prose。

Why this is separate from Module 1

因为 Module 1 是结构识别题。
这个模块是计算题。
二者调用方式不同，不能混成一个大块。

Module 3: Bayesian Regression

这块只放终点公式和 completing-the-square 模板。

Likelihood

$$ p(y \mid X,w,\sigma^2)=\mathcal N(y \mid Xw,\sigma^2 I) $$

Ordinary least squares

$$ \hat w_{LS} = (X^T X)^{-1}X^T y $$

Key point

MLE and OLS coincide when the covariance structure does not change the optimization target beyond a scalar weighting of squared error.

最常见考试版本直接记成：

isotropic Gaussian noise
Gaussian prior
posterior is Gaussian

Posterior structure

写成 precision form：

$$ \Lambda_{\mathrm{post}} = \Lambda_{\mathrm{prior}} + \Lambda_{\mathrm{like}} $$$$ \mu_{\mathrm{post}} = \Lambda_{\mathrm{post}}^{-1} \bigl(\text{prior linear term} + \text{likelihood linear term}\bigr) $$

对于最常见版本，直接写：

$$ \Sigma_{\mathrm{post}} = \left(I + \frac{1}{\sigma^2}X^T X\right)^{-1} $$$$ \mu_{\mathrm{post}} = \left(I + \frac{1}{\sigma^2}X^T X\right)^{-1} \left(\mu + \frac{1}{\sigma^2}X^T y\right) $$

Orthogonal features case

若

$$ X^T X = I $$

则 posterior mean 是 prior mean 和 OLS solution 的 weighted average。

这句必须写，因为它很适合出短推导题。

Warning

这块最容易死在：

quadratic expansion 出错
linear term 漏号
completing the square 最终 mean / covariance 位置写反

所以旁边留一条小公式：

对于

$$ -\frac12 w^T A w + b^T w + \text{const} $$

对应 Gaussian mean / covariance 为

$$ \Sigma = A^{-1}, \qquad \mu = A^{-1}b $$

Module 4: Gaussian Processes + Kernel View

GP 不能写成长定义。
只写 block Gaussian conditioning 模板。

Prior

$$ f \sim GP(m,k) $$

Observation model

$$ y = f + \epsilon, \qquad \epsilon \sim \mathcal N(0,\sigma^2 I) $$

Gram matrix

$$ K_{ij} = k(x_i,x_j) $$

Training marginal

$$ y_N \sim \mathcal N(0, K_N + \sigma^2 I) $$

Joint train-test form

$$ \begin{bmatrix} y_N \\ y_* \end{bmatrix} \sim \mathcal N \left( 0, \begin{bmatrix} K_N+\sigma^2 I & k_* \\ k_*^T & c \end{bmatrix} \right) $$

Predictive mean

$$ \mu_* = k_*^T (K_N+\sigma^2 I)^{-1} y_N $$

Predictive variance

$$ \sigma_*^2 = c - k_*^T (K_N+\sigma^2 I)^{-1} k_* $$

Kernel view

写一句即可：

GP prediction is Gaussian conditioning on the joint covariance induced by the kernel.

Why this is a main module

这类题临场不是不理解，而是容易忘：

train covariance 到底是 $K$ 还是 $K+\sigma^2 I$
predictive variance 的减号位置
joint block Gaussian 的结构

所以它非常适合做成公式卡。

Module 5: Decision Theory

这次它不能再放边角。

Bayes decision rule

$$ \text{choose class 1 if } p(t=1 \mid x) > p(t=0 \mid x) $$

equal priors 下直接写成：

$$ p(x \mid t=1) > p(x \mid t=0) $$

Misclassification rate

$$ P(\mathrm{error}) = P(x \in R_0, t=1) + P(x \in R_1, t=0) $$

Equal variance Gaussian classes

若

$$ x \mid t=0 \sim \mathcal N(\mu_0,\sigma^2), \qquad x \mid t=1 \sim \mathcal N(\mu_1,\sigma^2) $$

则 decision boundary 在 midpoint。

Same mean, unequal variances

若均值相同但 variance 不同，则 decision region 变成 centered interval / outside interval 结构。
要把 threshold 写成单独一行，不必推导全过程。

Standard normal CDF transform

把

$$ P(a \le X \le b) $$

转成

$$ \Phi\left(\frac{b-\mu}{\sigma}\right)-\Phi\left(\frac{a-\mu}{\sigma}\right) $$

Warning

这类题最常见错误：

posterior comparison 写成 prior comparison
两类 error 区域积分方向反
threshold 找对了但 misclassification formula 写错 region

Module 6: Variational Inference + EM + Exponential Family

这三块压在一起最省面积。

ELBO

$$ \mathcal L(q) = \mathbb E_{q(z\mid x)} \bigl[\log p(x,z)-\log q(z\mid x)\bigr] $$

等价写法：

$$ \mathcal L(q) = \mathbb E_q[\log p(x\mid z)] - KL(q(z\mid x)\,\|\,p(z)) $$

只在适用时写。

ELBO identity

$$ \log p(x) = \mathcal L(q) + KL(q(z\mid x)\,\|\,p(z\mid x)) $$

因此

$$ \mathcal L(q)\le \log p(x) $$

Jensen

写一句即可：

for concave $\log$,

$$ \log \mathbb E[X] \ge \mathbb E[\log X] $$

KL direction

只写一句：

$KL(q\|p)$ often mode-seeking
$KL(p\|q)$ often mass-covering

EM

$$ Q(\theta,\theta^{old}) = \mathbb E_{p(z\mid x,\theta^{old})} [\log p(x,z \mid \theta)] $$

E-step: compute posterior over latent variables using old parameters
M-step: maximize $Q(\theta,\theta^{old})$ over $\theta$

再写一句：

EM maximizes expected complete-data log-likelihood, not the incomplete-data likelihood directly.

Exponential family

$$ p(x \mid \eta) = h(x)\exp\{\eta^T T(x)-A(\eta)\} $$

只保留：

sufficient statistic $T(x)$
natural parameter $\eta$
log-partition $A(\eta)$

Module 7: Sampling — MC / IS / MCMC

这里只写 estimator 模板。

Simple Monte Carlo

$$ \hat e = \frac{1}{S}\sum_{i=1}^S f(x^{(i)}), \qquad x^{(i)} \sim p $$

Unbiasedness

$$ \mathbb E[\hat e] = \mathbb E_p[f(x)] $$

Variance

$$ \mathrm{Var}(\hat e) = \frac{\mathrm{Var}_p(f(x))}{S} $$

Importance sampling identity

$$ \mathbb E_p[f(x)] = \mathbb E_q\left[f(x)\frac{p(x)}{q(x)}\right] $$

IS estimator

$$ \hat e_{IS} = \frac{1}{S}\sum_{i=1}^S f(x^{(i)}) \frac{p(x^{(i)})}{q(x^{(i)})}, \qquad x^{(i)}\sim q $$

Normalized weights

$$ \tilde w_i = \frac{w_i}{\sum_j w_j} $$

MCMC

只保留骨架：

proposal
acceptance ratio
target as stationary distribution

不要展开长解释。

Module 8: VAE + Diffusion

AE / VAE 要写成“答题句子模板”。

Deterministic autoencoder

encoder maps input to a single point
reconstruction loss only
no latent regularization
no guarantee that nearby inputs map to nearby latent codes
discontinuities / holes can appear in latent space

VAE encoder

$$ q_\phi(z\mid x)=\mathcal N(\mu_\phi(x), \mathrm{diag}(\sigma_\phi^2(x))) $$

Reparameterization

$$ z = \mu + \sigma \odot \epsilon, \qquad \epsilon \sim \mathcal N(0,I) $$

VAE ELBO

$$ \mathbb E_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] - KL(q_\phi(z\mid x)\,\|\,p(z)) $$

Interpretation

reconstruction term preserves information
KL term regularizes latent space
encourages overlap and smoothness
makes interpolation / generation meaningful

Diffusion

只写最核心三句：

forward process gradually adds noise
reverse process learns denoising transitions
training often predicts noise / score instead of direct data reconstruction

Thin utility strip

这一条不要做成独立模块，只放边缘细条。

Must include

Bayes rule

$$ p(z\mid x)=\frac{p(x\mid z)p(z)}{p(x)} $$

Gaussian conditioning

若

$$ \begin{bmatrix} y_1\\ y_2 \end{bmatrix} \sim \mathcal N \left( \begin{bmatrix} \mu_1\\ \mu_2 \end{bmatrix}, \begin{bmatrix} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{21} & \Sigma_{22} \end{bmatrix} \right) $$

则

$$ y_2 \mid (y_1=a) \sim \mathcal N \bigl( \mu_2+\Sigma_{21}\Sigma_{11}^{-1}(a-\mu_1), \Sigma_{22}-\Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12} \bigr) $$

Completing the square

对于

$$ -\frac12 w^T A w + b^T w + c $$

有

$$ \mu=A^{-1}b,\qquad \Sigma=A^{-1} $$

Matrix derivatives

$$ \nabla_w (b^T w)=b $$$$ \nabla_w (w^T A w)=(A+A^T)w $$

若 $A$ symmetric:

$$ \nabla_w (w^T A w)=2Aw $$

Jensen

Standard normal CDF notation

$$ \Phi(\cdot) $$

What should not be on the aidsheet

不要放：

你已经滚瓜烂熟的定义
长自然语言解释
纯 intuition 图景
历史背景
不能直接触发解题的 prose
practice final 具体数字答案

aidsheet 的目标不是“讲清楚”。
而是 在 exam pressure 下把你从卡壳点拉出来。

Final working rule

每个模块只保留三层：

one-line definition
one-line core formula
one-line warning / common failure point

凡是超过这三层还在解释的内容，都应该删。

Final conclusion

这次 STA414 aidsheet 不该平均分给每章。
它应该围绕四个主轴展开：

Graphical Models
Bayesian Regression
Gaussian Processes
Decision Theory

其余内容做压缩挂件。

这才是最适合双面 A4 的排法。

STA414 Final Aidsheet Strategy#

Final layout#

Front side#

Back side#

Why this layout#

Module 1: Graphical Models + Conditional Independence#

Put only the following#

DAG factorization#

Plate notation reading rule#

d-separation templates#

Markov blanket#

How to infer graph from a given factorization#

Directed graphical model vs MRF#

Why this deserves the largest area#

Module 2: Variable Elimination / Message Passing / Belief Propagation / HMM#

Variable elimination skeleton#

Sum-product message#

Pairwise message#

Belief#

HMM factorization#

Forward-backward#

Why this is separate from Module 1#

Module 3: Bayesian Regression#

Likelihood#

Ordinary least squares#

Key point#

Posterior structure#

Orthogonal features case#

Warning#

Module 4: Gaussian Processes + Kernel View#

Prior#

Observation model#

Gram matrix#

Training marginal#

Joint train-test form#

Predictive mean#

Predictive variance#

Kernel view#

Why this is a main module#

Module 5: Decision Theory#

Bayes decision rule#

Misclassification rate#

Equal variance Gaussian classes#

Same mean, unequal variances#

Standard normal CDF transform#

Warning#

Module 6: Variational Inference + EM + Exponential Family#

ELBO#

ELBO identity#

Jensen#

KL direction#

EM#

Exponential family#

Module 7: Sampling — MC / IS / MCMC#

Simple Monte Carlo#

Unbiasedness#

Variance#

Importance sampling identity#

IS estimator#

Normalized weights#

MCMC#

Module 8: VAE + Diffusion#

Deterministic autoencoder#

VAE encoder#

Reparameterization#

VAE ELBO#

Interpretation#

Diffusion#

Thin utility strip#

Must include#

Bayes rule#

Gaussian conditioning#

Completing the square#

Matrix derivatives#

Jensen#

Standard normal CDF notation#

What should not be on the aidsheet#

Final working rule#

Final conclusion#

STA414 Final Aidsheet Strategy

Final layout

Front side

Back side

Why this layout

Module 1: Graphical Models + Conditional Independence

Put only the following

DAG factorization

Plate notation reading rule

d-separation templates

Markov blanket

How to infer graph from a given factorization

Directed graphical model vs MRF

Why this deserves the largest area

Module 2: Variable Elimination / Message Passing / Belief Propagation / HMM

Variable elimination skeleton

Sum-product message

Pairwise message

Belief

HMM factorization

Forward-backward

Why this is separate from Module 1

Module 3: Bayesian Regression

Likelihood

Ordinary least squares

Key point

Posterior structure

Orthogonal features case

Warning

Module 4: Gaussian Processes + Kernel View

Prior

Observation model

Gram matrix

Training marginal

Joint train-test form

Predictive mean

Predictive variance

Kernel view

Why this is a main module

Module 5: Decision Theory

Bayes decision rule

Misclassification rate

Equal variance Gaussian classes

Same mean, unequal variances

Standard normal CDF transform

Warning

Module 6: Variational Inference + EM + Exponential Family

ELBO

ELBO identity

Jensen

KL direction

EM

Exponential family

Module 7: Sampling — MC / IS / MCMC

Simple Monte Carlo

Unbiasedness

Variance

Importance sampling identity

IS estimator

Normalized weights

MCMC

Module 8: VAE + Diffusion

Deterministic autoencoder

VAE encoder

Reparameterization

VAE ELBO

Interpretation

Diffusion

Thin utility strip

Must include

Bayes rule

Gaussian conditioning

Completing the square

Matrix derivatives

Jensen

Standard normal CDF notation

What should not be on the aidsheet

Final working rule

Final conclusion