Backpropagation
Backpropagation computes gradients of a scalar loss through a layered computation graph by repeated application of the chain rule. In a feedforward neural network, it gives all parameter gradients in time comparable to one forward pass.
1. Setup
For layers $\ell = 1, \dots, L$, define
$$z^{(\ell)} = W^{(\ell)} a^{(\ell-1)} + b^{(\ell)}, \qquad a^{(\ell)} = \phi^{(\ell)}(z^{(\ell)}), \qquad a^{(0)} = x.$$Let $\ell(y, a^{(L)})$ denote the loss for one observation. To avoid collision with the existing ELBO notation $\mathcal{L}$ in STA414, this note uses lowercase $\ell$ for supervised loss and $J(\theta)$ for the full objective.
2. Error Signal
Define the backpropagated error at layer $\ell$ by
$$\delta^{(\ell)} := \frac{\partial \ell}{\partial z^{(\ell)}}.$$At the output layer,
$$\delta^{(L)} = \nabla_{a^{(L)}} \ell \odot \left(\phi^{(L)}\right)'(z^{(L)}),$$where $\odot$ denotes element-wise multiplication.
For softmax output with cross-entropy loss, this simplifies to
$$\delta^{(L)} = \hat{y} - y, \qquad \hat{y} = a^{(L)}.$$3. Hidden-Layer Recursion
For $\ell = L-1, \dots, 1$,
$$\delta^{(\ell)} = \left(W^{(\ell+1)}\right)^\top \delta^{(\ell+1)} \odot \left(\phi^{(\ell)}\right)'(z^{(\ell)}).$$This is the key recursive step: gradients are propagated backward from the output layer to earlier layers.
4. Parameter Gradients
Once $\delta^{(\ell)}$ is known, the gradients with respect to the parameters of layer $\ell$ are
$$\frac{\partial \ell}{\partial W^{(\ell)}} = \delta^{(\ell)} \left(a^{(\ell-1)}\right)^\top, \qquad \frac{\partial \ell}{\partial b^{(\ell)}} = \delta^{(\ell)}.$$For a dataset objective
$$J(\theta) = \sum_{i=1}^n \ell_i \qquad \text{or} \qquad J(\theta) = \frac{1}{n}\sum_{i=1}^n \ell_i,$$the total gradient is obtained by summing or averaging the per-observation gradients.
5. Why It Matters
- Backpropagation turns a deep chain rule calculation into a dynamic-programming recursion.
- It is the standard way to compute gradients for gradient descent or stochastic gradient descent in neural networks.
- In modern probabilistic models, it also appears when optimizing differentiable objectives such as the ELBO, especially together with the Reparameterization Trick.
6. Complexity
For a feedforward network, one backward pass has the same order of cost as one forward pass: linear in the number of edges/parameters of the network.