Skip to content

CS231n Lecture Note IV: Neural Networks and Backpropagation

Here we are introducing Neural Networks and Backpropagation.

Neural Networks

We will start with a 2-layer example:

f=W2max(0,W1x)f = W_2 \max(0, W_1 x)

xRD,W1RH×D,W2RC×Hx \in \mathbb{R}^D, W_1 \in \mathbb{R}^{H \times D}, W_2 \in \mathbb{R}^{C \times H}

The maxmax here creates some non-linearity between W1W1 and W2W2. It is called the activation function.

In practice, we will usually add a learnable bias at each layer as well.

Here’s an example of a 3-layer one.

f=W3max(0,W2max(0,W1x))f = W_3 \max(0, W_2 \max(0, W_1 x))

xRD,W1RH1×D,W2RH2×H1,W3RC×H2x \in \mathbb{R}^D, W_1 \in \mathbb{R}^{H_1 \times D}, W_2 \in \mathbb{R}^{H_2 \times H_1}, W_3 \in \mathbb{R}^{C \times H_2}

The networks we are talking about here are more accurately called fully-connected networks, or sometimes called multi-layer perceptrons (MLP).

Activation Functions

The most important characteristic of activation functions is to create some non-linearity.

There are multiple activation functions to choose from.

The function max(0,x)max(0,x) is called ReLU. It is a good default choice for most problems.

ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)

We also have:

  • Leaky ReLU: max(0.1x,x)\max(0.1x, x) — Allows a small gradient for negative inputs. Avoids dead neurons.
  • Sigmoid: σ(x)=11+ex\sigma(x) = \frac{1}{1+e^{-x}} — S-shaped curve squashing output between 0 and 1.
  • Tanh: tanh(x)\tanh(x) — S-shaped curve squashing output between -1 and 1.
  • ELU: Exponential Linear Unit.
  • GELU & SiLU: Newer functions often used in modern architectures like Transformers.

Choosing an activation function is very empirical.

Architecture

In a Neural Network, we have an input layer, an output layer, and several hidden layers.

Sample Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
from numpy.random import randn

# Describe the Network
# N: Batch size (number of inputs processed at once)
# D_in: Input dimension (features)
# H: Hidden Layer Dimension
# D_out: Output dimentsion
N, D_in, H, D_out = 64, 1000, 100, 10
x, y = randn(N, D_in), randn(N, D_out)
w1, w2 = randn(D_in, H), randn(H, D_out)

for t in range(2000):
# Forward pass
h = 1 / (1 + np.exp(-x.dot(w1)))
y_pred = h.dot(w2)
loss = np.square(y_pred - y).sum()
print(t, loss)

# Calculate the analytical gradients
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h.T.dot(grad_y_pred)
grad_h = grad_y_pred.dot(w2.T)
grad_w1 = x.T.dot(grad_h * h * (1 - h))

# Gradient descent
w1 -= 1e-4 * grad_w1
w2 -= 1e-4 * grad_w2

The number of neurson in the hidden layers is often related to the capacity of information. More neurons, more capacity.

A regularizer keeps the network simple and prevents overfitting. Do not use size of neural network as a regularizer. Use stronger regularization instead.

Derivatives

f(x,y)=x+yfx=1,fy=1f(x, y) = x + y \to \frac{\partial f}{\partial x} = 1, \frac{\partial f}{\partial y} = 1

f(x,y)=max(x,y)fx=1(xy),fy=1(yx)f(x, y) = \max(x, y) \to \frac{\partial f}{\partial x} = \mathbb{1}(x \geq y), \frac{\partial f}{\partial y} = \mathbb{1}(y \geq x)

f(x,y)=xyfx=y,fy=xf(x, y) = xy \to \frac{\partial f}{\partial x} = y, \frac{\partial f}{\partial y} = x

The derivative on each variable tells you the sensitivity of the whole expression on its value.

  • add gate: gradient distributor
  • mul gate: “swap multiplier”
  • copy gate: gradient adder
  • max gate: gradient router

Backpropagation

The problem now is to calculate the gradients in order to learn the parameters.

Obviously we can’t simply derive it on paper, for it can be very tedious and not feasible for complex models.

Here’s the better idea: Computational graphs + Backpropagation.

Backpropagation is generally chain rules.

Downstream=Upstream×Local\text{Downstream} = \text{Upstream} \times \text{Local}

Lx=Lzzx\frac{\partial L}{\partial x} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial x}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
================================================================================
COMPUTATIONAL GRAPH NODE: BACKPROPAGATION
================================================================================

FORWARD PASS
Computes output z from inputs x, y

[ Input x ] ---------------------\
\
\
v
_____________
| |
| Node f | -----------------> [ Output z ]
|_____________|
^
/
/
[ Input y ] ---------------------/


================================================================================

BACKWARD PASS
Chain Rule: Downstream = Upstream * Local

Upstream Gradient
(Received)
dL/dz
|
v
_____________ _____________
Downstream Gradient | | | |
(To x) | Local Grad | | |
dL/dx <----------------| dz/dx | <----------------| |
= |_____________| | |
dL/dz * dz/dx | |
| Node f |
| |
_____________ | |
Downstream Gradient | | | |
(To y) | Local Grad | | |
dL/dy <----------------| dz/dy | <----------------| |
= |_____________| |_____________|
dL/dz * dz/dy

Modularize: Forward/Backward API

Gate / Node / Function object: Actual PyTorch code

1
2
3
4
5
6
7
8
9
10
11
12
class Multiply(torch.autograd.Function):
@staticmethod
def forward(ctx, x, y):
ctx.save_for_backward(x, y) # Need to cache some values for use in backward
z = x * y
return z
@staticmethod
def backward(ctx, grad_z): # Upstream gradient (grad_z)
x, y = ctx.saved_tensors
grad_x = y * grad_z # dz/dx * dL/dz <-- Multiply upstream and local gradients
grad_y = x * grad_z # dz/dy * dL/dz
return grad_x, grad_y

Aside: Vector Derivatives

For vector derivatives:

  1. Vector to Scalar

xRN,yRx \in \mathbb{R}^N, y \in \mathbb{R}

The derivative is a gradient.

yxRN(yx)n=yxn\frac{\partial y}{\partial x} \in \mathbb{R}^N \quad \left(\frac{\partial y}{\partial x}\right)_n = \frac{\partial y}{\partial x_n}

  1. Vector to Vector

xRN,yRMx \in \mathbb{R}^N, y \in \mathbb{R}^M

The derivative is a Jacobian Matrix.

yxRN×M(yx)n,m=ymxn\frac{\partial y}{\partial x} \in \mathbb{R}^{N \times M} \quad \left(\frac{\partial y}{\partial x}\right)_{n,m} = \frac{\partial y_m}{\partial x_n}

Aside: Jacobian Matrix

A Jacobian Matrix is the derivative for a function that maps a Vector to a Vector.

Definition

Let f:RnRm\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m be a function such that each of its first-order partial derivatives exists on Rn\mathbb{R}^n. This function takes a point x=(x1,,xn)Rn\mathbf{x} = (x_1, \ldots, x_n) \in \mathbb{R}^n as input and produces the vector f(x)=(f1(x),,fm(x))Rm\mathbf{f}(\mathbf{x}) = (f_1(\mathbf{x}), \ldots, f_m(\mathbf{x})) \in \mathbb{R}^m as output.

Then the Jacobian matrix of f\mathbf{f}, denoted Jf\mathbf{J}_{\mathbf{f}}, is the m×nm \times n matrix whose (i,j)(i, j) entry is fixj\frac{\partial f_i}{\partial x_j}; explicitly:

Jf=[fx1fxn]=[Tf1Tfm]=[f1x1f1xnfmx1fmxn]\mathbf{J}_{\mathbf{f}} = \begin{bmatrix} \frac{\partial \mathbf{f}}{\partial x_1} & \cdots & \frac{\partial \mathbf{f}}{\partial x_n} \end{bmatrix} = \begin{bmatrix} \nabla^T f_1 \\ \vdots \\ \nabla^T f_m \end{bmatrix} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

where Tfi\nabla^T f_i is the transpose (row vector) of the gradient of the ii-th component.

Backpropagation with Vectors

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
================================================================================
COMPUTATIONAL GRAPH NODE: BACKPROP WITH VECTORS
================================================================================

FORWARD PASS (Green Arrows)
Computes output vector z from input vectors x, y

[ Vector x ] --------------------\
Shape: Dx \
\
v
_____________ [ Vector z ]
| | Shape: Dz
| Node f | ----------------->
|_____________|
^
/
/
[ Vector y ] --------------------/
Shape: Dy


================================================================================

BACKWARD PASS (Red Arrows)
Matrix-Vector Multiply: Downstream = Local * Upstream

Upstream Gradient
(dL/dz)
Shape: Dz
|
v
_____________ _____________
Downstream Gradient | | | |
(dL/dx) | Jacobian | | |
Shape: Dx <-------------| [Dx x Dz] | <----------| |
= | dz/dx | | |
(dz/dx) * (dL/dz) |_____________| | |
| |
| Node f |
| |
_____________ | |
Downstream Gradient | | | |
(dL/dy) | Jacobian | | |
Shape: Dy <-------------| [Dy x Dz] | <----------| |
= | dz/dy | |_____________|
(dz/dy) * (dL/dz) |_____________|

Jacobians can be sparse. For element-wise operations, the off-diagonal elements are always zero; the Jacobian becomes a diagonal matrix.

Four Fundamental Equations of Backpropagation

These allow us to propagate error (δ\delta) from the output backward through the network. We get them from simple chain rules.

Notation:

  • \odot: Hadamard (element-wise) product
  • σ\sigma': Derivative of the activation function
  • δl\delta^l: Error term for layer ll

(BP1) Error at the Output Layer

δL=aCσ(zL)\delta^L = \nabla_a C \odot \sigma'(z^L)

Interpretation: How much the output layer “missed” the target, scaled by the activation’s sensitivity.

(BP2) Propagating Error to Hidden Layers

δl=((wl+1)Tδl+1)σ(zl)\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l)

Interpretation: The error at layer ll is the weighted sum of errors from the next layer (l+1l+1), pulled backward through the transpose of the weights.

(BP3) Gradient for Biases

Cbjl=δjl\frac{\partial C}{\partial b^l_j} = \delta^l_j

Interpretation: The gradient for a bias is exactly equal to the error at that neuron.

(BP4) Gradient for Weights

Cwjkl=akl1δjl\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j

Interpretation: The gradient for a weight is the product of the input activation (al1a^{l-1}) and the output error (δl\delta^l).

About this Post

This post is written by Louis C Deng, licensed under CC BY-NC 4.0.

#CS231n #Deep Learning #Computer Vision