With the score function and the loss function, now we focus on how we minimize the loss. Optimization is the process of finding the set of parameters that minimize the loss function.
Strategy #1: A first very bad idea solution: Random search
Core idea: iterative refinement
Strategy #2: Follow the slope
In 1-dimension, the derivative of a function:
In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension .
The slope in any direction is the dot product of the direction with the gradient The direction of steepest descent is the negative gradient.
Numerical Gradient
We can get an approximate numerical gradient. It is often sufficient to use a very small value (such as 1e-5).
This is easy to write, but might be slow.
Analytic Gradient
We can also use calculus to get the exact value of gradients.
Practical considerations
Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.
It often works better to compute the numeric gradient using the centered difference formula: [f(x+h)−f(x−h)]/2h.
Gradient Descent
1 | # Vanilla Gradient Descent |
But in large-scale applications, the training data can have on order of millions of examples. It seems wasteful to compute the full loss function over the entire training set in order to perform only a single parameter update.
Stochastic Gradient Descent (SGD)
A very common approach to addressing this challenge is to compute the gradient over batches of the training data, which is called the Mini-batch Gradient Descent.
The gradient from a mini-batch is a good approximation of the gradient of the full objective. Therefore, much faster convergence can be achieved in practice by evaluating the mini-batch gradients to perform more frequent parameter updates.
1 | # Vanilla Minibatch Gradient Descent |
The extreme case of this is a setting where the mini-batch contains only a single example. This process is called Stochastic Gradient Descent (SGD).
The size of the mini-batch is a hyperparameter. It is usually based on memory constraints (if any), or set to some value, e.g. 32, 64 or 128. We use powers of 2 in practice because many vectorized operation implementations work faster when their inputs are sized in powers of 2.
Problems with SGD
- Jittering
- Local minima or saddle points
- Gradients from mini-batches might be too noisy.
SGD + Momentum
Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations.
1 | vx = 0 |
- : Velocity (accumulates gradient directions)
- (rho): Momentum coefficient (friction), typically 0.9 or 0.99.
- (alpha): Learning rate.
- : Gradient.
Aside: Nesterov Momentum
Standard Momentum calculates the gradient first, and then add the previous velocity to it.
Unlike Standard Momentum, Nesterov Momentum adds the previous velocity to the parameter first, and then calculate the gradient with the updated parameters. It looks ahead, making it more stable.
We can let to make it look nicer. Rearrange:
“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction
RMSProp
RMSProp (Root Mean Square Propagation) is an adaptive learning rate optimization algorithm designed to improve the performance and speed of training deep learning models.
It adds element-wise scaling of the gradient based on historical sums of squares in each dimension (with decay).
Mathematical Formulation:
- Compute Gradient
- Update Moving Average of Squared Gradients
- Update Parameters
where:
- Learning Rate (η): Controls the step size during parameter updates.
- Decay Rate (γ): Determines how quickly the moving average of squared gradients decays.
- Epsilon (ϵ): A small constant (e.g., 1e-8) added to the denominator to prevent division by zero and ensure numerical stability.
1 | grad_squared = 0 |
Aside: AdaGrad
1 | grad_squared = 0 |
RMSProp is basically AdaGrad with decay.
Adam & AdamW
Adam (Adaptive Moment Estimation) optimizer combines the advantages of Momentum and RMSprop techniques to adjust learning rates during training.
1 | first_moment = 0 |
However, since the first and second momenet estimates start at zero, the initial step might be gigantic. So we implement bias correction to prevent early-stage instability.
For Standard Adam, the regularization is done in x before gradent computation.
For AdamW (Weight Dacay), the regularization term is added to the final x after the moments.
Learning Rate Schedules
The learning rate is a hyperparameter. We may make it decay over time.
- Reduce learning rate at a few fixedpoints.
- Cosine decay
- Linear decay
- Inverse sqrt decay
- etc.
Empirical rule of thumb: If you increase the batch size by N, also scale the initial learning rate by N.
Second-Order Optimization
We can use gradient and Hessian to form a quadratic approximation, then step to its minima.
Second-Order Taylor Series:
Solving for the critical point we obtain the Newton parameter update:
This can be bad for deep learning, for the amount of computation required.
- Quasi-Newton methods (BGFS)
- L-BGFS (Limited Memory BGFS): Does not store the full inverse Hessian.
Aside: Hessian Matrices
The Hessian matrix (or simply the Hessian), denoted as , represents the second-order partial derivatives of a function. While the gradient () tells you the slope (direction of steepest descent), the Hessian tells you the curvature of the loss function surface.
It is a square matrix of second-order partial derivatives of a scalar-valued function, . It describes the local curvature of a function of many variables.
Hessian Matrices are symmetric.