Question 1

What if there are more than two layers?

Accepted Answer

Apply the chain rule repeatedly, from the outside in.  For h(g(f(x))) you get h'(g(f(x))) · g'(f(x)) · f'(x).  Each layer contributes its derivative evaluated at whatever was inside it.

Question 2

Why does it work?

Accepted Answer

From the definition of the derivative: Δy/Δx = (Δy/Δu)·(Δu/Δx) when u depends smoothly on x.  Letting Δx → 0 makes both ratios into derivatives, and the product becomes the chain rule.  A rigorous proof handles the case Δu = 0 separately to avoid dividing by zero.

Question 3

How is it related to back-propagation in neural networks?

Accepted Answer

A neural network is a deeply nested composition of layers, each parameterised by weights.  Back-propagation computes the gradient of the loss with respect to each weight by repeated application of the chain rule, working backwards from the loss layer to the input layer.  The order of multiplication is what gives back-propagation its O(n) cost instead of O(n²).

Question 4

What's the multivariate version?

Accepted Answer

If z = f(x, y) where x = g(t) and y = h(t), then dz/dt = (∂f/∂x)·(dx/dt) + (∂f/∂y)·(dy/dt).  Each path from t to z contributes a product of partial derivatives along the path; sum over all paths.

The chain rule: differentiating composite functions

What this shows

Where it shows up

Frequently asked questions

Related topics