The primary goal of backpropagation is to calculate the partial derivatives of the cost function C with respect to every weight W and bias b in the network.
To facilitate a clean derivation using matrix calculus, we adopt the following conventions:
- al: Activation vector of the l-th layer (nl×1).
- zl: Weighted input vector of the l-th layer (nl×1), where zl=Wlal−1+bl.
- Wl: Weight matrix of the l-th layer (nl×nl−1).
- bl: Bias vector of the l-th layer (nl×1).
- δl: Error vector of the l-th layer, defined as δl≡∂zl∂C.
- ⊙: Hadamard product (element-wise multiplication).
- σ′(zl): The derivative of the activation function applied element-wise to zl.
2. Derivation of the Equations
BP1: Error at the Output Layer
Goal: δL=∇aLC⊙σ′(zL)
Derivation Steps:
- Apply the Chain Rule: The error δL represents how the cost C changes with respect to the weighted input zL. Since C depends on zL through the activations aL:
δL=∂zL∂C=(∂zL∂aL)T∂aL∂C
- Compute the Jacobian: Because ajL=σ(zjL) (each output depends only on its own input), the Jacobian matrix ∂zL∂aL is a diagonal matrix:
∂zL∂aL=diag(σ′(z1L),…,σ′(znL))
- Result: Multiplying a diagonal matrix by a vector is equivalent to the Hadamard product:
δL=∇aLC⊙σ′(zL)
BP2: Propagating Error to Hidden Layers
Goal: δl=((Wl+1)Tδl+1)⊙σ′(zl)
Derivation Steps:
- Link Consecutive Layers: We express the error at layer l in terms of the error at layer l+1 using the multivariate chain rule:
δl=∂zl∂C=(∂zl∂zl+1)T∂zl+1∂C=(∂zl∂zl+1)Tδl+1
- Differentiate the Linear Transformation: Recall zl+1=Wl+1al+bl+1 and al=σ(zl). Applying the chain rule to find ∂zl∂zl+1:
∂zl∂zl+1=∂al∂zl+1⋅∂zl∂al=Wl+1⋅diag(σ′(zl))
- Transpose and Simplify: Substitute back and use the property (AB)T=BTAT:
δl=(Wl+1⋅diag(σ′(zl)))Tδl+1=diag(σ′(zl))(Wl+1)Tδl+1
- Final Form:
Convert the diagonal matrix multiplication to a Hadamard product:
δl=((Wl+1)Tδl+1)⊙σ′(zl)
BP3: Gradient for Biases
Goal: ∂bl∂C=δl
Derivation Steps:
- Chain Rule via Intermediate z:
∂bl∂C=(∂bl∂zl)T∂zl∂C
-
Compute Local Gradient: From zl=Wlal−1+bl, we see that bl is added directly to the weighted sum. Thus, ∂bl∂zl is the Identity matrix I.
-
Conclusion:
∂bl∂C=ITδl=δl
BP4: Gradient for Weights
Goal: ∂Wl∂C=δl(al−1)T
Derivation Steps:
- Element-wise Approach: Consider a single weight wjkl connecting neuron k in layer l−1 to neuron j in layer l:
∂wjkl∂C=∂zjl∂C∂wjkl∂zjl=δjl∂wjkl∂zjl
-
Solve for Partial Derivative: Since zjl=∑mwjmlaml−1+bjl, the derivative with respect to wjkl is simply akl−1. Therefore, ∂wjkl∂C=δjlakl−1.
-
Vectorize as an Outer Product: The collection of all such derivatives δjlakl−1 for all j,k forms the Outer Product of the error vector δl and the input activation vector al−1:
∂Wl∂C=δl(al−1)T