Skip to content

Backpropagation: A Vector Calculus Perspective

The primary goal of backpropagation is to calculate the partial derivatives of the cost function CC with respect to every weight W\mathbf{W} and bias b\mathbf{b} in the network.

1. Notation (Matrix Form)

To facilitate a clean derivation using matrix calculus, we adopt the following conventions:

  • al\mathbf{a}^l: Activation vector of the ll-th layer (nl×1n_l \times 1).
  • zl\mathbf{z}^l: Weighted input vector of the ll-th layer (nl×1n_l \times 1), where zl=Wlal1+bl\mathbf{z}^l = \mathbf{W}^l \mathbf{a}^{l-1} + \mathbf{b}^l.
  • Wl\mathbf{W}^l: Weight matrix of the ll-th layer (nl×nl1n_l \times n_{l-1}).
  • bl\mathbf{b}^l: Bias vector of the ll-th layer (nl×1n_l \times 1).
  • δl\boldsymbol{\delta}^l: Error vector of the ll-th layer, defined as δlCzl\boldsymbol{\delta}^l \equiv \frac{\partial C}{\partial \mathbf{z}^l}.
  • \odot: Hadamard product (element-wise multiplication).
  • σ(zl)\sigma'(\mathbf{z}^l): The derivative of the activation function applied element-wise to zl\mathbf{z}^l.

2. Derivation of the Equations

BP1: Error at the Output Layer

Goal: δL=aLCσ(zL)\boldsymbol{\delta}^L = \nabla_{\mathbf{a}^L} C \odot \sigma'(\mathbf{z}^L)

Derivation Steps:

  1. Apply the Chain Rule: The error δL\boldsymbol{\delta}^L represents how the cost CC changes with respect to the weighted input zL\mathbf{z}^L. Since CC depends on zL\mathbf{z}^L through the activations aL\mathbf{a}^L:

δL=CzL=(aLzL)TCaL\boldsymbol{\delta}^L = \frac{\partial C}{\partial \mathbf{z}^L} = \left( \frac{\partial \mathbf{a}^L}{\partial \mathbf{z}^L} \right)^T \frac{\partial C}{\partial \mathbf{a}^L}

  1. Compute the Jacobian: Because ajL=σ(zjL)a^L_j = \sigma(z^L_j) (each output depends only on its own input), the Jacobian matrix aLzL\frac{\partial \mathbf{a}^L}{\partial \mathbf{z}^L} is a diagonal matrix:

aLzL=diag(σ(z1L),,σ(znL))\frac{\partial \mathbf{a}^L}{\partial \mathbf{z}^L} = \text{diag}(\sigma'(z^L_1), \dots, \sigma'(z^L_n))

  1. Result: Multiplying a diagonal matrix by a vector is equivalent to the Hadamard product:

δL=aLCσ(zL)\boldsymbol{\delta}^L = \nabla_{\mathbf{a}^L} C \odot \sigma'(\mathbf{z}^L)

BP2: Propagating Error to Hidden Layers

Goal: δl=((Wl+1)Tδl+1)σ(zl)\boldsymbol{\delta}^l = ((\mathbf{W}^{l+1})^T \boldsymbol{\delta}^{l+1}) \odot \sigma'(\mathbf{z}^l)

Derivation Steps:

  1. Link Consecutive Layers: We express the error at layer ll in terms of the error at layer l+1l+1 using the multivariate chain rule:

δl=Czl=(zl+1zl)TCzl+1=(zl+1zl)Tδl+1\boldsymbol{\delta}^l = \frac{\partial C}{\partial \mathbf{z}^l} = \left( \frac{\partial \mathbf{z}^{l+1}}{\partial \mathbf{z}^l} \right)^T \frac{\partial C}{\partial \mathbf{z}^{l+1}} = \left( \frac{\partial \mathbf{z}^{l+1}}{\partial \mathbf{z}^l} \right)^T \boldsymbol{\delta}^{l+1}

  1. Differentiate the Linear Transformation: Recall zl+1=Wl+1al+bl+1\mathbf{z}^{l+1} = \mathbf{W}^{l+1} \mathbf{a}^l + \mathbf{b}^{l+1} and al=σ(zl)\mathbf{a}^l = \sigma(\mathbf{z}^l). Applying the chain rule to find zl+1zl\frac{\partial \mathbf{z}^{l+1}}{\partial \mathbf{z}^l}:

zl+1zl=zl+1alalzl=Wl+1diag(σ(zl))\frac{\partial \mathbf{z}^{l+1}}{\partial \mathbf{z}^l} = \frac{\partial \mathbf{z}^{l+1}}{\partial \mathbf{a}^l} \cdot \frac{\partial \mathbf{a}^l}{\partial \mathbf{z}^l} = \mathbf{W}^{l+1} \cdot \text{diag}(\sigma'(\mathbf{z}^l))

  1. Transpose and Simplify: Substitute back and use the property (AB)T=BTAT(AB)^T = B^T A^T:

δl=(Wl+1diag(σ(zl)))Tδl+1=diag(σ(zl))(Wl+1)Tδl+1\boldsymbol{\delta}^l = \left( \mathbf{W}^{l+1} \cdot \text{diag}(\sigma'(\mathbf{z}^l)) \right)^T \boldsymbol{\delta}^{l+1} = \text{diag}(\sigma'(\mathbf{z}^l)) (\mathbf{W}^{l+1})^T \boldsymbol{\delta}^{l+1}

  1. Final Form:
    Convert the diagonal matrix multiplication to a Hadamard product:

δl=((Wl+1)Tδl+1)σ(zl)\boldsymbol{\delta}^l = ((\mathbf{W}^{l+1})^T \boldsymbol{\delta}^{l+1}) \odot \sigma'(\mathbf{z}^l)

BP3: Gradient for Biases

Goal: Cbl=δl\frac{\partial C}{\partial \mathbf{b}^l} = \boldsymbol{\delta}^l

Derivation Steps:

  1. Chain Rule via Intermediate zz:

Cbl=(zlbl)TCzl\frac{\partial C}{\partial \mathbf{b}^l} = \left( \frac{\partial \mathbf{z}^l}{\partial \mathbf{b}^l} \right)^T \frac{\partial C}{\partial \mathbf{z}^l}

  1. Compute Local Gradient: From zl=Wlal1+bl\mathbf{z}^l = \mathbf{W}^l \mathbf{a}^{l-1} + \mathbf{b}^l, we see that bl\mathbf{b}^l is added directly to the weighted sum. Thus, zlbl\frac{\partial \mathbf{z}^l}{\partial \mathbf{b}^l} is the Identity matrix I\mathbf{I}.

  2. Conclusion:

Cbl=ITδl=δl\frac{\partial C}{\partial \mathbf{b}^l} = \mathbf{I}^T \boldsymbol{\delta}^l = \boldsymbol{\delta}^l

BP4: Gradient for Weights

Goal: CWl=δl(al1)T\frac{\partial C}{\partial \mathbf{W}^l} = \boldsymbol{\delta}^l (\mathbf{a}^{l-1})^T

Derivation Steps:

  1. Element-wise Approach: Consider a single weight wjklw^l_{jk} connecting neuron kk in layer l1l-1 to neuron jj in layer ll:

Cwjkl=Czjlzjlwjkl=δjlzjlwjkl\frac{\partial C}{\partial w^l_{jk}} = \frac{\partial C}{\partial z^l_j} \frac{\partial z^l_j}{\partial w^l_{jk}} = \delta^l_j \frac{\partial z^l_j}{\partial w^l_{jk}}

  1. Solve for Partial Derivative: Since zjl=mwjmlaml1+bjlz^l_j = \sum_m w^l_{jm} a^{l-1}_m + b^l_j, the derivative with respect to wjklw^l_{jk} is simply akl1a^{l-1}_k. Therefore, Cwjkl=δjlakl1\frac{\partial C}{\partial w^l_{jk}} = \delta^l_j a^{l-1}_k.

  2. Vectorize as an Outer Product: The collection of all such derivatives δjlakl1\delta^l_j a^{l-1}_k for all j,kj, k forms the Outer Product of the error vector δl\boldsymbol{\delta}^l and the input activation vector al1\mathbf{a}^{l-1}:

CWl=δl(al1)T\frac{\partial C}{\partial \mathbf{W}^l} = \boldsymbol{\delta}^l (\mathbf{a}^{l-1})^T

About this Post

This post is written by Louis C Deng, licensed under CC BY-NC 4.0.

#CS231n #Math #Vector Calculus