EN | CS231n

February 13, 2026

CS231n Lecture Note IV: Neural Networks and Backpropagation

Here we are introducing Neural Networks and Backpropagation.

Neural Networks

We will start with a 2-layer example:

f = W_2 \max(0, W_1 x)

x \in \mathbb{R}^D, W_1 \in \mathbb{R}^{H \times D}, W_2 \in \mathbb{R}^{C \times H}

The $max$ here creates some non-linearity between $W1$ and $W2$ . It is called the activation function.

In practice, we will usually add a learnable bias at each layer as well.

Here’s an example of a 3-layer one.

f = W_3 \max(0, W_2 \max(0, W_1 x))

x \in \mathbb{R}^D, W_1 \in \mathbb{R}^{H_1 \times D}, W_2 \in \mathbb{R}^{H_2 \times H_1}, W_3 \in \mathbb{R}^{C \times H_2}

The networks we are talking about here are more accurately called fully-connected networks, or sometimes called multi-layer perceptrons (MLP).

Activation Functions

The most important characteristic of activation functions is to create some non-linearity.

There are multiple activation functions to choose from.

The function $max(0,x)$ is called ReLU. It is a good default choice for most problems.

\text{ReLU}(x) = \max(0, x)

We also have:

Leaky ReLU: $\max(0.1x, x)$ — Allows a small gradient for negative inputs. Avoids dead neurons.
Sigmoid: $\sigma(x) = \frac{1}{1+e^{-x}}$ — S-shaped curve squashing output between 0 and 1.
Tanh: $\tanh(x)$ — S-shaped curve squashing output between -1 and 1.
ELU: Exponential Linear Unit.
GELU & SiLU: Newer functions often used in modern architectures like Transformers.

Choosing an activation function is very empirical.

Architecture

In a Neural Network, we have an input layer, an output layer, and several hidden layers.

Sample Code

import numpy as np
from numpy.random import randn

# Describe the Network
# N: Batch size (number of inputs processed at once)
# D_in: Input dimension (features)
# H: Hidden Layer Dimension
# D_out: Output dimentsion
N, D_in, H, D_out = 64, 1000, 100, 10
x, y = randn(N, D_in), randn(N, D_out)
w1, w2 = randn(D_in, H), randn(H, D_out)

for t in range(2000):
    # Forward pass
    h = 1 / (1 + np.exp(-x.dot(w1)))
    y_pred = h.dot(w2)
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Calculate the analytical gradients
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h.T.dot(grad_y_pred)
    grad_h = grad_y_pred.dot(w2.T)
    grad_w1 = x.T.dot(grad_h * h * (1 - h))

    # Gradient descent
    w1 -= 1e-4 * grad_w1
    w2 -= 1e-4 * grad_w2

The number of neurson in the hidden layers is often related to the capacity of information. More neurons, more capacity.

A regularizer keeps the network simple and prevents overfitting. Do not use size of neural network as a regularizer. Use stronger regularization instead.

Derivatives

f(x, y) = x + y \to \frac{\partial f}{\partial x} = 1, \frac{\partial f}{\partial y} = 1

f(x, y) = \max(x, y) \to \frac{\partial f}{\partial x} = \mathbb{1}(x \geq y), \frac{\partial f}{\partial y} = \mathbb{1}(y \geq x)

f(x, y) = xy \to \frac{\partial f}{\partial x} = y, \frac{\partial f}{\partial y} = x

The derivative on each variable tells you the sensitivity of the whole expression on its value.

add gate: gradient distributor
mul gate: “swap multiplier”
copy gate: gradient adder
max gate: gradient router

Backpropagation

The problem now is to calculate the gradients in order to learn the parameters.

Obviously we can’t simply derive it on paper, for it can be very tedious and not feasible for complex models.

Here’s the better idea: Computational graphs + Backpropagation.

Backpropagation is generally chain rules.

\text{Downstream} = \text{Upstream} \times \text{Local}

\frac{\partial L}{\partial x} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial x}

================================================================================
                    COMPUTATIONAL GRAPH NODE: BACKPROPAGATION
================================================================================

          FORWARD PASS
          Computes output z from inputs x, y
          
       [ Input x ] ---------------------\
                                         \
                                          \
                                           v
                                     _____________
                                    |             |
                                    |   Node f    | -----------------> [ Output z ]
                                    |_____________|
                                           ^
                                          /
                                         /
       [ Input y ] ---------------------/


================================================================================

          BACKWARD PASS
          Chain Rule: Downstream = Upstream * Local
          
                                                                    Upstream Gradient
                                                                        (Received)
                                                                          dL/dz
                                                                            |
                                                                            v
                                     _____________                    _____________
     Downstream Gradient            |             |                  |             |
        (To x)                      | Local Grad  |                  |             |
        dL/dx      <----------------|   dz/dx     | <----------------|             |
          =                         |_____________|                  |             |
    dL/dz * dz/dx                                                    |             |
                                                                     |   Node f    |
                                                                     |             |
                                     _____________                   |             |
     Downstream Gradient            |             |                  |             |
        (To y)                      | Local Grad  |                  |             |
        dL/dy      <----------------|   dz/dy     | <----------------|             |
          =                         |_____________|                  |_____________|
    dL/dz * dz/dy

Modularize: Forward/Backward API

Gate / Node / Function object: Actual PyTorch code

class Multiply(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x, y):
        ctx.save_for_backward(x, y)   # Need to cache some values for use in backward
        z = x * y
        return z
    @staticmethod
    def backward(ctx, grad_z):        # Upstream gradient (grad_z)
        x, y = ctx.saved_tensors
        grad_x = y * grad_z           # dz/dx * dL/dz   <-- Multiply upstream and local gradients
        grad_y = x * grad_z           # dz/dy * dL/dz
        return grad_x, grad_y

Aside: Vector Derivatives

For vector derivatives:

Vector to Scalar

x \in \mathbb{R}^N, y \in \mathbb{R}

The derivative is a gradient.

\frac{\partial y}{\partial x} \in \mathbb{R}^N \quad \left(\frac{\partial y}{\partial x}\right)_n = \frac{\partial y}{\partial x_n}

Vector to Vector

x \in \mathbb{R}^N, y \in \mathbb{R}^M

The derivative is a Jacobian Matrix.

\frac{\partial y}{\partial x} \in \mathbb{R}^{N \times M} \quad \left(\frac{\partial y}{\partial x}\right)_{n,m} = \frac{\partial y_m}{\partial x_n}

Aside: Jacobian Matrix

A Jacobian Matrix is the derivative for a function that maps a Vector to a Vector.

Definition

Let $\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m$ be a function such that each of its first-order partial derivatives exists on $\mathbb{R}^n$ . This function takes a point $\mathbf{x} = (x_1, \ldots, x_n) \in \mathbb{R}^n$ as input and produces the vector $\mathbf{f}(\mathbf{x}) = (f_1(\mathbf{x}), \ldots, f_m(\mathbf{x})) \in \mathbb{R}^m$ as output.

Then the Jacobian matrix of $\mathbf{f}$ , denoted $\mathbf{J}_{\mathbf{f}}$ , is the $m \times n$ matrix whose $(i, j)$ entry is $\frac{\partial f_i}{\partial x_j}$ ; explicitly:

\mathbf{J}_{\mathbf{f}} = \begin{bmatrix} \frac{\partial \mathbf{f}}{\partial x_1} & \cdots & \frac{\partial \mathbf{f}}{\partial x_n} \end{bmatrix} = \begin{bmatrix} \nabla^T f_1 \\ \vdots \\ \nabla^T f_m \end{bmatrix} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

where $\nabla^T f_i$ is the transpose (row vector) of the gradient of the $i$ -th component.

Backpropagation with Vectors

================================================================================
                    COMPUTATIONAL GRAPH NODE: BACKPROP WITH VECTORS
================================================================================

          FORWARD PASS (Green Arrows)
          Computes output vector z from input vectors x, y
          
       [ Vector x ] --------------------\
        Shape: Dx                        \
                                          \
                                           v
                                     _____________          [ Vector z ]
                                    |             |         Shape: Dz
                                    |   Node f    | ----------------->
                                    |_____________|
                                           ^
                                          /
                                         /
       [ Vector y ] --------------------/
        Shape: Dy


================================================================================

          BACKWARD PASS (Red Arrows)
          Matrix-Vector Multiply: Downstream = Local * Upstream
          
                                                                Upstream Gradient
                                                                   (dL/dz)
                                                                   Shape: Dz
                                                                      |
                                                                      v
                                     _____________              _____________
    Downstream Gradient             |             |            |             |
       (dL/dx)                      |  Jacobian   |            |             |
       Shape: Dx      <-------------|  [Dx x Dz]  | <----------|             |
          =                         |    dz/dx    |            |             |
    (dz/dx) * (dL/dz)               |_____________|            |             |
                                                               |             |
                                                               |   Node f    |
                                                               |             |
                                     _____________             |             |
    Downstream Gradient             |             |            |             |
       (dL/dy)                      |  Jacobian   |            |             |
       Shape: Dy      <-------------|  [Dy x Dz]  | <----------|             |
          =                         |    dz/dy    |            |_____________|
    (dz/dy) * (dL/dz)               |_____________|

Jacobians can be sparse. For element-wise operations, the off-diagonal elements are always zero; the Jacobian becomes a diagonal matrix.

Four Fundamental Equations of Backpropagation

These allow us to propagate error ( $\delta$ ) from the output backward through the network. We get them from simple chain rules.

Notation:

$\odot$ : Hadamard (element-wise) product
$\sigma'$ : Derivative of the activation function
$\delta^l$ : Error term for layer $l$

(BP1) Error at the Output Layer

\delta^L = \nabla_a C \odot \sigma'(z^L)

Interpretation: How much the output layer “missed” the target, scaled by the activation’s sensitivity.

(BP2) Propagating Error to Hidden Layers

\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l)

Interpretation: The error at layer $l$ is the weighted sum of errors from the next layer ( $l+1$ ), pulled backward through the transpose of the weights.

(BP3) Gradient for Biases

\frac{\partial C}{\partial b^l_j} = \delta^l_j

Interpretation: The gradient for a bias is exactly equal to the error at that neuron.

(BP4) Gradient for Weights

\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j

Interpretation: The gradient for a weight is the product of the input activation ( $a^{l-1}$ ) and the output error ( $\delta^l$ ).

About this Post

This post is written by Louis C Deng, licensed under CC BY-NC 4.0.

#CS231n #Deep Learning #Computer Vision