Here we are introducing Neural Networks and Backpropagation.
Neural Networks
We will start with a 2-layer example:
f = W 2 max ( 0 , W 1 x ) f = W_2 \max(0, W_1 x)
f = W 2 max ( 0 , W 1 x )
x ∈ R D , W 1 ∈ R H × D , W 2 ∈ R C × H x \in \mathbb{R}^D, W_1 \in \mathbb{R}^{H \times D}, W_2 \in \mathbb{R}^{C \times H}
x ∈ R D , W 1 ∈ R H × D , W 2 ∈ R C × H
The m a x max m a x here creates some non-linearity between W 1 W1 W 1 and W 2 W2 W 2 . It is called the activation function .
In practice, we will usually add a learnable bias at each layer as well.
Here’s an example of a 3-layer one.
f = W 3 max ( 0 , W 2 max ( 0 , W 1 x ) ) f = W_3 \max(0, W_2 \max(0, W_1 x))
f = W 3 max ( 0 , W 2 max ( 0 , W 1 x ) )
x ∈ R D , W 1 ∈ R H 1 × D , W 2 ∈ R H 2 × H 1 , W 3 ∈ R C × H 2 x \in \mathbb{R}^D, W_1 \in \mathbb{R}^{H_1 \times D}, W_2 \in \mathbb{R}^{H_2 \times H_1}, W_3 \in \mathbb{R}^{C \times H_2}
x ∈ R D , W 1 ∈ R H 1 × D , W 2 ∈ R H 2 × H 1 , W 3 ∈ R C × H 2
The networks we are talking about here are more accurately called fully-connected networks , or sometimes called multi-layer perceptrons (MLP) .
Activation Functions
The most important characteristic of activation functions is to create some non-linearity.
There are multiple activation functions to choose from.
The function m a x ( 0 , x ) max(0,x) m a x ( 0 , x ) is called ReLU . It is a good default choice for most problems.
ReLU ( x ) = max ( 0 , x ) \text{ReLU}(x) = \max(0, x)
ReLU ( x ) = max ( 0 , x )
We also have:
Leaky ReLU: max ( 0.1 x , x ) \max(0.1x, x) max ( 0 . 1 x , x ) — Allows a small gradient for negative inputs. Avoids dead neurons.
Sigmoid: σ ( x ) = 1 1 + e − x \sigma(x) = \frac{1}{1+e^{-x}} σ ( x ) = 1 + e − x 1 — S-shaped curve squashing output between 0 and 1.
Tanh: tanh ( x ) \tanh(x) tanh ( x ) — S-shaped curve squashing output between -1 and 1.
ELU: Exponential Linear Unit.
GELU & SiLU: Newer functions often used in modern architectures like Transformers.
Choosing an activation function is very empirical.
Architecture
In a Neural Network, we have an input layer , an output layer , and several hidden layers .
Sample Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 import numpy as npfrom numpy.random import randnN, D_in, H, D_out = 64 , 1000 , 100 , 10 x, y = randn(N, D_in), randn(N, D_out) w1, w2 = randn(D_in, H), randn(H, D_out) for t in range (2000 ): h = 1 / (1 + np.exp(-x.dot(w1))) y_pred = h.dot(w2) loss = np.square(y_pred - y).sum () print (t, loss) grad_y_pred = 2.0 * (y_pred - y) grad_w2 = h.T.dot(grad_y_pred) grad_h = grad_y_pred.dot(w2.T) grad_w1 = x.T.dot(grad_h * h * (1 - h)) w1 -= 1e-4 * grad_w1 w2 -= 1e-4 * grad_w2
The number of neurson in the hidden layers is often related to the capacity of information. More neurons, more capacity.
A regularizer keeps the network simple and prevents overfitting. Do not use size of neural network as a regularizer. Use stronger regularization instead.
Derivatives
f ( x , y ) = x + y → ∂ f ∂ x = 1 , ∂ f ∂ y = 1 f(x, y) = x + y \to \frac{\partial f}{\partial x} = 1, \frac{\partial f}{\partial y} = 1
f ( x , y ) = x + y → ∂ x ∂ f = 1 , ∂ y ∂ f = 1
f ( x , y ) = max ( x , y ) → ∂ f ∂ x = 1 ( x ≥ y ) , ∂ f ∂ y = 1 ( y ≥ x ) f(x, y) = \max(x, y) \to \frac{\partial f}{\partial x} = \mathbb{1}(x \geq y), \frac{\partial f}{\partial y} = \mathbb{1}(y \geq x)
f ( x , y ) = max ( x , y ) → ∂ x ∂ f = 1 ( x ≥ y ) , ∂ y ∂ f = 1 ( y ≥ x )
f ( x , y ) = x y → ∂ f ∂ x = y , ∂ f ∂ y = x f(x, y) = xy \to \frac{\partial f}{\partial x} = y, \frac{\partial f}{\partial y} = x
f ( x , y ) = x y → ∂ x ∂ f = y , ∂ y ∂ f = x
The derivative on each variable tells you the sensitivity of the whole expression on its value.
add gate: gradient distributor
mul gate: “swap multiplier”
copy gate: gradient adder
max gate: gradient router
Backpropagation
The problem now is to calculate the gradients in order to learn the parameters.
Obviously we can’t simply derive it on paper, for it can be very tedious and not feasible for complex models.
Here’s the better idea: Computational graphs + Backpropagation.
Backpropagation is generally chain rules.
Downstream = Upstream × Local \text{Downstream} = \text{Upstream} \times \text{Local}
Downstream = Upstream × Local
∂ L ∂ x = ∂ L ∂ z ⋅ ∂ z ∂ x \frac{\partial L}{\partial x} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial x}
∂ x ∂ L = ∂ z ∂ L ⋅ ∂ x ∂ z
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 ================================================================================ COMPUTATIONAL GRAPH NODE: BACKPROPAGATION ================================================================================ FORWARD PASS Computes output z from inputs x, y [ Input x ] ---------------------\ \ \ v _____________ | | | Node f | -----------------> [ Output z ] |_____________ | ^ / / [ Input y ] ---------------------/================================================================================ BACKWARD PASS Chain Rule: Downstream = Upstream * Local Upstream Gradient (Received) dL/dz | v _____________ _____________ Downstream Gradient | | | | (To x) | Local Grad | | | dL /dx <----------------| dz /dx | <----------------| | = |_____________ | | | dL /dz * dz/dx | | | Node f | | | _____________ | | Downstream Gradient | | | | (To y) | Local Grad | | | dL /dy <----------------| dz /dy | <----------------| | = |_____________ | |_____________ | dL /dz * dz/dy
Modularize: Forward/Backward API
Gate / Node / Function object: Actual PyTorch code
1 2 3 4 5 6 7 8 9 10 11 12 class Multiply (torch.autograd.Function): @staticmethod def forward (ctx, x, y ): ctx.save_for_backward(x, y) z = x * y return z @staticmethod def backward (ctx, grad_z ): x, y = ctx.saved_tensors grad_x = y * grad_z grad_y = x * grad_z return grad_x, grad_y
Aside: Vector Derivatives
For vector derivatives:
Vector to Scalar
x ∈ R N , y ∈ R x \in \mathbb{R}^N, y \in \mathbb{R}
x ∈ R N , y ∈ R
The derivative is a gradient.
∂ y ∂ x ∈ R N ( ∂ y ∂ x ) n = ∂ y ∂ x n \frac{\partial y}{\partial x} \in \mathbb{R}^N \quad \left(\frac{\partial y}{\partial x}\right)_n = \frac{\partial y}{\partial x_n}
∂ x ∂ y ∈ R N ( ∂ x ∂ y ) n = ∂ x n ∂ y
Vector to Vector
x ∈ R N , y ∈ R M x \in \mathbb{R}^N, y \in \mathbb{R}^M
x ∈ R N , y ∈ R M
The derivative is a Jacobian Matrix .
∂ y ∂ x ∈ R N × M ( ∂ y ∂ x ) n , m = ∂ y m ∂ x n \frac{\partial y}{\partial x} \in \mathbb{R}^{N \times M} \quad \left(\frac{\partial y}{\partial x}\right)_{n,m} = \frac{\partial y_m}{\partial x_n}
∂ x ∂ y ∈ R N × M ( ∂ x ∂ y ) n , m = ∂ x n ∂ y m
Aside: Jacobian Matrix
A Jacobian Matrix is the derivative for a function that maps a Vector to a Vector.
Definition
Let f : R n → R m \mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m f : R n → R m be a function such that each of its first-order partial derivatives exists on R n \mathbb{R}^n R n . This function takes a point x = ( x 1 , … , x n ) ∈ R n \mathbf{x} = (x_1, \ldots, x_n) \in \mathbb{R}^n x = ( x 1 , … , x n ) ∈ R n as input and produces the vector f ( x ) = ( f 1 ( x ) , … , f m ( x ) ) ∈ R m \mathbf{f}(\mathbf{x}) = (f_1(\mathbf{x}), \ldots, f_m(\mathbf{x})) \in \mathbb{R}^m f ( x ) = ( f 1 ( x ) , … , f m ( x ) ) ∈ R m as output.
Then the Jacobian matrix of f \mathbf{f} f , denoted J f \mathbf{J}_{\mathbf{f}} J f , is the m × n m \times n m × n matrix whose ( i , j ) (i, j) ( i , j ) entry is ∂ f i ∂ x j \frac{\partial f_i}{\partial x_j} ∂ x j ∂ f i ; explicitly:
J f = [ ∂ f ∂ x 1 ⋯ ∂ f ∂ x n ] = [ ∇ T f 1 ⋮ ∇ T f m ] = [ ∂ f 1 ∂ x 1 ⋯ ∂ f 1 ∂ x n ⋮ ⋱ ⋮ ∂ f m ∂ x 1 ⋯ ∂ f m ∂ x n ] \mathbf{J}_{\mathbf{f}} = \begin{bmatrix} \frac{\partial \mathbf{f}}{\partial x_1} & \cdots & \frac{\partial \mathbf{f}}{\partial x_n} \end{bmatrix} = \begin{bmatrix} \nabla^T f_1 \\ \vdots \\ \nabla^T f_m \end{bmatrix} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}
J f = [ ∂ x 1 ∂ f ⋯ ∂ x n ∂ f ] = ⎣ ⎢ ⎢ ⎡ ∇ T f 1 ⋮ ∇ T f m ⎦ ⎥ ⎥ ⎤ = ⎣ ⎢ ⎢ ⎡ ∂ x 1 ∂ f 1 ⋮ ∂ x 1 ∂ f m ⋯ ⋱ ⋯ ∂ x n ∂ f 1 ⋮ ∂ x n ∂ f m ⎦ ⎥ ⎥ ⎤
where ∇ T f i \nabla^T f_i ∇ T f i is the transpose (row vector) of the gradient of the i i i -th component.
Backpropagation with Vectors
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 ================================================================================ COMPUTATIONAL GRAPH NODE: BACKPROP WITH VECTORS ================================================================================ FORWARD PASS (Green Arrows) Computes output vector z from input vectors x, y [ Vector x ] --------------------\ Shape: Dx \ \ v _____________ [ Vector z ] | | Shape : Dz | Node f | -----------------> |_____________ | ^ / / [ Vector y ] --------------------/ Shape: Dy ================================================================================ BACKWARD PASS (Red Arrows) Matrix-Vector Multiply: Downstream = Local * Upstream Upstream Gradient (dL/dz) Shape: Dz | v _____________ _____________ Downstream Gradient | | | | (dL /dx) | Jacobian | | | Shape : Dx <-------------| [Dx x Dz] | <----------| | = | dz /dx | | | (dz /dx) * (dL/dz) |_____________ | | | | | | Node f | | | _____________ | | Downstream Gradient | | | | (dL /dy) | Jacobian | | | Shape : Dy <-------------| [Dy x Dz] | <----------| | = | dz /dy | |_____________ | (dz /dy) * (dL/dz) |_____________ |
Jacobians can be sparse. For element-wise operations, the off-diagonal elements are always zero; the Jacobian becomes a diagonal matrix.
Four Fundamental Equations of Backpropagation
These allow us to propagate error (δ \delta δ ) from the output backward through the network. We get them from simple chain rules.
Notation:
⊙ \odot ⊙ : Hadamard (element-wise) product
σ ′ \sigma' σ ′ : Derivative of the activation function
δ l \delta^l δ l : Error term for layer l l l
(BP1) Error at the Output Layer
δ L = ∇ a C ⊙ σ ′ ( z L ) \delta^L = \nabla_a C \odot \sigma'(z^L)
δ L = ∇ a C ⊙ σ ′ ( z L )
Interpretation: How much the output layer “missed” the target, scaled by the activation’s sensitivity.
(BP2) Propagating Error to Hidden Layers
δ l = ( ( w l + 1 ) T δ l + 1 ) ⊙ σ ′ ( z l ) \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l)
δ l = ( ( w l + 1 ) T δ l + 1 ) ⊙ σ ′ ( z l )
Interpretation: The error at layer l l l is the weighted sum of errors from the next layer (l + 1 l+1 l + 1 ), pulled backward through the transpose of the weights.
(BP3) Gradient for Biases
∂ C ∂ b j l = δ j l \frac{\partial C}{\partial b^l_j} = \delta^l_j
∂ b j l ∂ C = δ j l
Interpretation: The gradient for a bias is exactly equal to the error at that neuron.
(BP4) Gradient for Weights
∂ C ∂ w j k l = a k l − 1 δ j l \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j
∂ w j k l ∂ C = a k l − 1 δ j l
Interpretation: The gradient for a weight is the product of the input activation (a l − 1 a^{l-1} a l − 1 ) and the output error (δ l \delta^l δ l ).