EN | CS231n

February 13, 2026

Backpropagation: A Vector Calculus Perspective

The primary goal of backpropagation is to calculate the partial derivatives of the cost function $C$ with respect to every weight $\mathbf{W}$ and bias $\mathbf{b}$ in the network.

1. Notation (Matrix Form)

To facilitate a clean derivation using matrix calculus, we adopt the following conventions:

$\mathbf{a}^l$ : Activation vector of the $l$ -th layer ( $n_l \times 1$ ).
$\mathbf{z}^l$ : Weighted input vector of the $l$ -th layer ( $n_l \times 1$ ), where $\mathbf{z}^l = \mathbf{W}^l \mathbf{a}^{l-1} + \mathbf{b}^l$ .
$\mathbf{W}^l$ : Weight matrix of the $l$ -th layer ( $n_l \times n_{l-1}$ ).
$\mathbf{b}^l$ : Bias vector of the $l$ -th layer ( $n_l \times 1$ ).
$\boldsymbol{\delta}^l$ : Error vector of the $l$ -th layer, defined as $\boldsymbol{\delta}^l \equiv \frac{\partial C}{\partial \mathbf{z}^l}$ .
$\odot$ : Hadamard product (element-wise multiplication).
$\sigma'(\mathbf{z}^l)$ : The derivative of the activation function applied element-wise to $\mathbf{z}^l$ .

2. Derivation of the Equations

BP1: Error at the Output Layer

Goal: $\boldsymbol{\delta}^L = \nabla_{\mathbf{a}^L} C \odot \sigma'(\mathbf{z}^L)$

Derivation Steps:

Apply the Chain Rule: The error $\boldsymbol{\delta}^L$ represents how the cost $C$ changes with respect to the weighted input $\mathbf{z}^L$ . Since $C$ depends on $\mathbf{z}^L$ through the activations $\mathbf{a}^L$ :

\boldsymbol{\delta}^L = \frac{\partial C}{\partial \mathbf{z}^L} = \left( \frac{\partial \mathbf{a}^L}{\partial \mathbf{z}^L} \right)^T \frac{\partial C}{\partial \mathbf{a}^L}

Compute the Jacobian: Because $a^L_j = \sigma(z^L_j)$ (each output depends only on its own input), the Jacobian matrix $\frac{\partial \mathbf{a}^L}{\partial \mathbf{z}^L}$ is a diagonal matrix:

\frac{\partial \mathbf{a}^L}{\partial \mathbf{z}^L} = \text{diag}(\sigma'(z^L_1), \dots, \sigma'(z^L_n))

Result: Multiplying a diagonal matrix by a vector is equivalent to the Hadamard product:

\boldsymbol{\delta}^L = \nabla_{\mathbf{a}^L} C \odot \sigma'(\mathbf{z}^L)

BP2: Propagating Error to Hidden Layers

Goal: $\boldsymbol{\delta}^l = ((\mathbf{W}^{l+1})^T \boldsymbol{\delta}^{l+1}) \odot \sigma'(\mathbf{z}^l)$

Derivation Steps:

Link Consecutive Layers: We express the error at layer $l$ in terms of the error at layer $l+1$ using the multivariate chain rule:

\boldsymbol{\delta}^l = \frac{\partial C}{\partial \mathbf{z}^l} = \left( \frac{\partial \mathbf{z}^{l+1}}{\partial \mathbf{z}^l} \right)^T \frac{\partial C}{\partial \mathbf{z}^{l+1}} = \left( \frac{\partial \mathbf{z}^{l+1}}{\partial \mathbf{z}^l} \right)^T \boldsymbol{\delta}^{l+1}

Differentiate the Linear Transformation: Recall $\mathbf{z}^{l+1} = \mathbf{W}^{l+1} \mathbf{a}^l + \mathbf{b}^{l+1}$ and $\mathbf{a}^l = \sigma(\mathbf{z}^l)$ . Applying the chain rule to find $\frac{\partial \mathbf{z}^{l+1}}{\partial \mathbf{z}^l}$ :

\frac{\partial \mathbf{z}^{l+1}}{\partial \mathbf{z}^l} = \frac{\partial \mathbf{z}^{l+1}}{\partial \mathbf{a}^l} \cdot \frac{\partial \mathbf{a}^l}{\partial \mathbf{z}^l} = \mathbf{W}^{l+1} \cdot \text{diag}(\sigma'(\mathbf{z}^l))

Transpose and Simplify: Substitute back and use the property $(AB)^T = B^T A^T$ :

\boldsymbol{\delta}^l = \left( \mathbf{W}^{l+1} \cdot \text{diag}(\sigma'(\mathbf{z}^l)) \right)^T \boldsymbol{\delta}^{l+1} = \text{diag}(\sigma'(\mathbf{z}^l)) (\mathbf{W}^{l+1})^T \boldsymbol{\delta}^{l+1}

Final Form:
Convert the diagonal matrix multiplication to a Hadamard product:

\boldsymbol{\delta}^l = ((\mathbf{W}^{l+1})^T \boldsymbol{\delta}^{l+1}) \odot \sigma'(\mathbf{z}^l)

BP3: Gradient for Biases

Goal: $\frac{\partial C}{\partial \mathbf{b}^l} = \boldsymbol{\delta}^l$

Derivation Steps:

Chain Rule via Intermediate $z$ :

\frac{\partial C}{\partial \mathbf{b}^l} = \left( \frac{\partial \mathbf{z}^l}{\partial \mathbf{b}^l} \right)^T \frac{\partial C}{\partial \mathbf{z}^l}

Compute Local Gradient: From $\mathbf{z}^l = \mathbf{W}^l \mathbf{a}^{l-1} + \mathbf{b}^l$ , we see that $\mathbf{b}^l$ is added directly to the weighted sum. Thus, $\frac{\partial \mathbf{z}^l}{\partial \mathbf{b}^l}$ is the Identity matrix $\mathbf{I}$ .
Conclusion:

\frac{\partial C}{\partial \mathbf{b}^l} = \mathbf{I}^T \boldsymbol{\delta}^l = \boldsymbol{\delta}^l

BP4: Gradient for Weights

Goal: $\frac{\partial C}{\partial \mathbf{W}^l} = \boldsymbol{\delta}^l (\mathbf{a}^{l-1})^T$

Derivation Steps:

Element-wise Approach: Consider a single weight $w^l_{jk}$ connecting neuron $k$ in layer $l-1$ to neuron $j$ in layer $l$ :

\frac{\partial C}{\partial w^l_{jk}} = \frac{\partial C}{\partial z^l_j} \frac{\partial z^l_j}{\partial w^l_{jk}} = \delta^l_j \frac{\partial z^l_j}{\partial w^l_{jk}}

Solve for Partial Derivative: Since $z^l_j = \sum_m w^l_{jm} a^{l-1}_m + b^l_j$ , the derivative with respect to $w^l_{jk}$ is simply $a^{l-1}_k$ . Therefore, $\frac{\partial C}{\partial w^l_{jk}} = \delta^l_j a^{l-1}_k$ .
Vectorize as an Outer Product: The collection of all such derivatives $\delta^l_j a^{l-1}_k$ for all $j, k$ forms the Outer Product of the error vector $\boldsymbol{\delta}^l$ and the input activation vector $\mathbf{a}^{l-1}$ :

\frac{\partial C}{\partial \mathbf{W}^l} = \boldsymbol{\delta}^l (\mathbf{a}^{l-1})^T

About this Post

This post is written by Louis C Deng, licensed under CC BY-NC 4.0.

#CS231n #Math #Vector Calculus

Backpropagation: A Vector Calculus Perspective

1. Notation (Matrix Form)

2. Derivation of the Equations

BP1: Error at the Output Layer

BP2: Propagating Error to Hidden Layers

BP3: Gradient for Biases

BP4: Gradient for Weights

About this Post

CS231n Lecture Note IV: Neural Networks and Backpropagation