Skip to content

CS231n Lecture Note III: Optimization

With the score function and the loss function, now we focus on how we minimize the loss. Optimization is the process of finding the set of parameters WW that minimize the loss function.

Core idea: iterative refinement

Strategy #2: Follow the slope

In 1-dimension, the derivative of a function:

df(x)dx=limh0f(x+h)f(x)h\frac{df(x)}{dx} = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}

In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension WL\nabla_W L.

The slope in any direction is the dot product of the direction with the gradient The direction of steepest descent is the negative gradient.

Numerical Gradient

We can get an approximate numerical gradient. It is often sufficient to use a very small value (such as 1e-5).

This is easy to write, but might be slow.

Analytic Gradient

We can also use calculus to get the exact value of gradients.

Practical considerations

Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.

It often works better to compute the numeric gradient using the centered difference formula: [f(x+h)−f(x−h)]/2h.

Gradient Descent

1
2
3
4
5
# Vanilla Gradient Descent

while True:
weights_grad = evaluate_gradient(loss_fun, data, weights)
weights += - step_size * weights_grad # perform parameter update

But in large-scale applications, the training data can have on order of millions of examples. It seems wasteful to compute the full loss function over the entire training set in order to perform only a single parameter update.

Stochastic Gradient Descent (SGD)

A very common approach to addressing this challenge is to compute the gradient over batches of the training data, which is called the Mini-batch Gradient Descent.

The gradient from a mini-batch is a good approximation of the gradient of the full objective. Therefore, much faster convergence can be achieved in practice by evaluating the mini-batch gradients to perform more frequent parameter updates.

1
2
3
4
5
6
# Vanilla Minibatch Gradient Descent

while True:
data_batch = sample_training_data(data, 256) # sample 256 examples
weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
weights += - step_size * weights_grad # perform parameter update

The extreme case of this is a setting where the mini-batch contains only a single example. This process is called Stochastic Gradient Descent (SGD).

The size of the mini-batch is a hyperparameter. It is usually based on memory constraints (if any), or set to some value, e.g. 32, 64 or 128. We use powers of 2 in practice because many vectorized operation implementations work faster when their inputs are sized in powers of 2.

Problems with SGD

  1. Jittering
  2. Local minima or saddle points
  3. Gradients from mini-batches might be too noisy.

SGD + Momentum

Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations.

vt+1=ρvt+f(xt)v_{t+1} = \rho v_t + \nabla f(x_t)

xt+1=xtαvt+1x_{t+1} = x_t - \alpha v_{t+1}

1
2
3
4
5
vx = 0
while True:
dx = compute_gradient(x)
vx = rho * vx + dx
x -= learning_rate * vx
  • vv: Velocity (accumulates gradient directions)
  • ρ\rho (rho): Momentum coefficient (friction), typically 0.9 or 0.99.
  • α\alpha (alpha): Learning rate.
  • f(x)\nabla f(x): Gradient.

Aside: Nesterov Momentum

Standard Momentum calculates the gradient first, and then add the previous velocity to it.

Unlike Standard Momentum, Nesterov Momentum adds the previous velocity to the parameter first, and then calculate the gradient with the updated parameters. It looks ahead, making it more stable.

vt+1=ρvtαf(xt+ρvt)v_{t+1} = \rho v_t - \alpha \nabla f(x_t + \rho v_t)

xt+1=xt+vt+1x_{t+1} = x_t + v_{t+1}

We can let x~t=xt+ρvt\tilde{x}_t = x_t + \rho v_t to make it look nicer. Rearrange:

vt+1=ρvtαf(x~t)v_{t+1} = \rho v_t - \alpha \nabla f(\tilde{x}_t)

x~t+1=x~tρvt+(1+ρ)vt+1\tilde{x}_{t+1} = \tilde{x}_t - \rho v_t + (1 + \rho)v_{t+1}

=x~t+vt+1+ρ(vt+1vt)= \tilde{x}_t + v_{t+1} + \rho(v_{t+1} - v_t)

“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction

RMSProp

RMSProp (Root Mean Square Propagation) is an adaptive learning rate optimization algorithm designed to improve the performance and speed of training deep learning models.

It adds element-wise scaling of the gradient based on historical sums of squares in each dimension (with decay).

Mathematical Formulation:

  1. Compute Gradient

gt=θg_t = \nabla \theta

  1. Update Moving Average of Squared Gradients

E[g2]t=γE[g2]t1+(1γ)gt2E[g^2]_t = \gamma E[g^2]_{t-1} + (1 - \gamma)g_t^2

  1. Update Parameters

θt+1=θtηE[g2]t+ϵgt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \odot g_t

where:

  • Learning Rate (η): Controls the step size during parameter updates.
  • Decay Rate (γ): Determines how quickly the moving average of squared gradients decays.
  • Epsilon (ϵ): A small constant (e.g., 1e-8) added to the denominator to prevent division by zero and ensure numerical stability.
1
2
3
4
5
6
7
grad_squared = 0
while True:
dx = compute_gradient(x)
# Update moving average of the squared gradients
grad_squared = decay_rate * grad_squared + (1 - decay_rate) * dx * dx
# Update weights
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)

Aside: AdaGrad

1
2
3
4
5
6
grad_squared = 0
while True:
dx = compute_gradient(x)
# Difference HERE:
grad_squared = dx * dx
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)

RMSProp is basically AdaGrad with decay.

Adam & AdamW

Adam (Adaptive Moment Estimation) optimizer combines the advantages of Momentum and RMSprop techniques to adjust learning rates during training.

1
2
3
4
5
6
7
8
9
10
11
12
13
first_moment = 0
second_moment = 0
for t in range(1, num_iterations):
dx = compute_gradient(x)
# Momentum
first_moment = beta1 * first_moment + (1 - beta1) * dx
# RMSProp
second_moment = beta2 * second_moment + (1 - beta2) * dx * dx
# Bias Correction
first_unbias = first_moment / (1 - beta1 ** t)
second_unbias = second_moment / (1 - beta2 ** t)
# RMSProp
x -= learning_rate * first_unbias / (np.sqrt(second_unbias) + 1e-7)

However, since the first and second momenet estimates start at zero, the initial step might be gigantic. So we implement bias correction to prevent early-stage instability.

For Standard Adam, the regularization is done in x before gradent computation.

For AdamW (Weight Dacay), the regularization term is added to the final x after the moments.

Learning Rate Schedules

The learning rate is a hyperparameter. We may make it decay over time.

  1. Reduce learning rate at a few fixedpoints.
  2. Cosine decay
  3. Linear decay
  4. Inverse sqrt decay
  5. etc.

Empirical rule of thumb: If you increase the batch size by N, also scale the initial learning rate by N.

Second-Order Optimization

We can use gradient and Hessian to form a quadratic approximation, then step to its minima.

Second-Order Taylor Series:

J(θ)J(θ0)+(θθ0)θJ(θ0)+12(θθ0)H(θθ0)J(\boldsymbol{\theta}) \approx J(\boldsymbol{\theta}_0) + (\boldsymbol{\theta} - \boldsymbol{\theta}_0)^\top \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}_0) + \frac{1}{2} (\boldsymbol{\theta} - \boldsymbol{\theta}_0)^\top \mathbf{H} (\boldsymbol{\theta} - \boldsymbol{\theta}_0)

Solving for the critical point we obtain the Newton parameter update:

θ=θ0H1θJ(θ0)\boldsymbol{\theta}^* = \boldsymbol{\theta}_0 - \mathbf{H}^{-1} \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}_0)

This can be bad for deep learning, for the amount of computation required.

  • Quasi-Newton methods (BGFS)
  • L-BGFS (Limited Memory BGFS): Does not store the full inverse Hessian.

Aside: Hessian Matrices

The Hessian matrix (or simply the Hessian), denoted as H\mathbf{H}, represents the second-order partial derivatives of a function. While the gradient (\nabla) tells you the slope (direction of steepest descent), the Hessian tells you the curvature of the loss function surface.

It is a square matrix of second-order partial derivatives of a scalar-valued function, f:RnRf: \mathbb{R}^n \to \mathbb{R}. It describes the local curvature of a function of many variables.

H(f)=[2fx122fx1x22fx1xn2fx2x12fx222fx2xn2fxnx12fxnx22fxn2]\mathbf{H}(f) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}

Hij=2fxixj\mathbf{H}_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}

Hessian Matrices are symmetric.

About this Post

This post is written by Louis C Deng, licensed under CC BY-NC 4.0.

#CS231n #Deep Learning #Computer Vision