Backpropagation

Backpropagation computes gradients of a scalar loss through a layered computation graph by repeated application of the chain rule. In a feedforward neural network, it gives all parameter gradients in time comparable to one forward pass.

1. Setup

For layers $\ell = 1, \dots, L$, define

$$z^{(\ell)} = W^{(\ell)} a^{(\ell-1)} + b^{(\ell)}, \qquad a^{(\ell)} = \phi^{(\ell)}(z^{(\ell)}), \qquad a^{(0)} = x.$$

Let $\ell(y, a^{(L)})$ denote the loss for one observation. To avoid collision with the existing ELBO notation $\mathcal{L}$ in STA414, this note uses lowercase $\ell$ for supervised loss and $J(\theta)$ for the full objective.

2. Error Signal

Define the backpropagated error at layer $\ell$ by

$$\delta^{(\ell)} := \frac{\partial \ell}{\partial z^{(\ell)}}.$$

At the output layer,

$$\delta^{(L)} = \nabla_{a^{(L)}} \ell \odot \left(\phi^{(L)}\right)'(z^{(L)}),$$

where $\odot$ denotes element-wise multiplication.

For softmax output with cross-entropy loss, this simplifies to

$$\delta^{(L)} = \hat{y} - y, \qquad \hat{y} = a^{(L)}.$$

3. Hidden-Layer Recursion

For $\ell = L-1, \dots, 1$,

$$\delta^{(\ell)} = \left(W^{(\ell+1)}\right)^\top \delta^{(\ell+1)} \odot \left(\phi^{(\ell)}\right)'(z^{(\ell)}).$$

This is the key recursive step: gradients are propagated backward from the output layer to earlier layers.

4. Parameter Gradients

Once $\delta^{(\ell)}$ is known, the gradients with respect to the parameters of layer $\ell$ are

$$\frac{\partial \ell}{\partial W^{(\ell)}} = \delta^{(\ell)} \left(a^{(\ell-1)}\right)^\top, \qquad \frac{\partial \ell}{\partial b^{(\ell)}} = \delta^{(\ell)}.$$

For a dataset objective

$$J(\theta) = \sum_{i=1}^n \ell_i \qquad \text{or} \qquad J(\theta) = \frac{1}{n}\sum_{i=1}^n \ell_i,$$

the total gradient is obtained by summing or averaging the per-observation gradients.

5. Why It Matters

Backpropagation turns a deep chain rule calculation into a dynamic-programming recursion.
It is the standard way to compute gradients for gradient descent or stochastic gradient descent in neural networks.
In modern probabilistic models, it also appears when optimizing differentiable objectives such as the ELBO, especially together with the Reparameterization Trick.

6. Complexity

For a feedforward network, one backward pass has the same order of cost as one forward pass: linear in the number of edges/parameters of the network.

Backpropagation#

1. Setup#

2. Error Signal#

3. Hidden-Layer Recursion#

4. Parameter Gradients#

5. Why It Matters#

6. Complexity#

Related#