Backpropagation

Backpropagation computes gradients of a scalar loss through a layered computation graph by repeated application of the chain rule. In a feedforward neural network, it gives all parameter gradients in time comparable to one forward pass.

1. Setup

For layers $\ell = 1, \dots, L$, define

$$z^{(\ell)} = W^{(\ell)} a^{(\ell-1)} + b^{(\ell)}, \qquad a^{(\ell)} = \phi^{(\ell)}(z^{(\ell)}), \qquad a^{(0)} = x.$$

Let $\ell(y, a^{(L)})$ denote the loss for one observation. To avoid collision with the existing ELBO notation $\mathcal{L}$ in STA414, this note uses lowercase $\ell$ for supervised loss and $J(\theta)$ for the full objective.

2. Error Signal

Define the backpropagated error at layer $\ell$ by

$$\delta^{(\ell)} := \frac{\partial \ell}{\partial z^{(\ell)}}.$$

At the output layer,

$$\delta^{(L)} = \nabla_{a^{(L)}} \ell \odot \left(\phi^{(L)}\right)'(z^{(L)}),$$

where $\odot$ denotes element-wise multiplication.

For softmax output with cross-entropy loss, this simplifies to

$$\delta^{(L)} = \hat{y} - y, \qquad \hat{y} = a^{(L)}.$$

3. Hidden-Layer Recursion

For $\ell = L-1, \dots, 1$,

$$\delta^{(\ell)} = \left(W^{(\ell+1)}\right)^\top \delta^{(\ell+1)} \odot \left(\phi^{(\ell)}\right)'(z^{(\ell)}).$$

This is the key recursive step: gradients are propagated backward from the output layer to earlier layers.

4. Parameter Gradients

Once $\delta^{(\ell)}$ is known, the gradients with respect to the parameters of layer $\ell$ are

$$\frac{\partial \ell}{\partial W^{(\ell)}} = \delta^{(\ell)} \left(a^{(\ell-1)}\right)^\top, \qquad \frac{\partial \ell}{\partial b^{(\ell)}} = \delta^{(\ell)}.$$

For a dataset objective

$$J(\theta) = \sum_{i=1}^n \ell_i \qquad \text{or} \qquad J(\theta) = \frac{1}{n}\sum_{i=1}^n \ell_i,$$

the total gradient is obtained by summing or averaging the per-observation gradients.

5. Why It Matters

  • Backpropagation turns a deep chain rule calculation into a dynamic-programming recursion.
  • It is the standard way to compute gradients for gradient descent or stochastic gradient descent in neural networks.
  • In modern probabilistic models, it also appears when optimizing differentiable objectives such as the ELBO, especially together with the Reparameterization Trick.

6. Complexity

For a feedforward network, one backward pass has the same order of cost as one forward pass: linear in the number of edges/parameters of the network.