If you are diving into the mechanics of neural networks, you will inevitably encounter the backpropagation of the Softmax and Cross-Entropy loss. At first glance, the matrix calculus can feel a bit intimidating. However, once you break it down step-by-step using the chain rule, you will discover that the final gradients are incredibly elegant and intuitive.
In this post, we will walk through the complete mathematical derivation of the gradients for a linear classifier, moving from single variables to full matrix vectorization.
1. The Setup: Forward Propagation
Let’s define our variables for a single training sample. Assume our input features have a dimension of , and we are classifying them into distinct classes.
- Input : A column vector.
- Weights : A matrix.
- Bias : A column vector.
- True Label : The true class index (or a one-hot encoded vector where only the true class index is ).
The Linear Layer (Logits):
First, we compute the raw scores (logits) for each class:
The Softmax Layer:
We convert these raw scores into a valid probability distribution. The probability of the sample belonging to class is:
The Cross-Entropy Loss:
For a single sample where the true class is , the loss only cares about the predicted probability assigned to that true class:
By substituting the Softmax formula into the loss function, we get:
2. The Core Derivation: Gradient with respect to Logits ()
To perform backpropagation, we first need to find how the loss changes with respect to each logit . We denote this gradient as .
We must split this into two scenarios: when is the true class, and when is any other class.
Case 1: Deriving for the true class ()
Applying the chain rule to the logarithm term:
Notice that the fractional term is exactly our definition of !
Case 2: Deriving for an incorrect class ()
Because does not contain , the derivative of the first term is .
Again, the fractional term is :
The Vectorized Form:
We can beautifully combine these two cases using an indicator function (which is if is the true class, and otherwise).
Let be the gradient vector . Its vectorized form is simply:
Intuition Check: This result is incredibly logical. The gradient is simply the Predicted Probability minus the True Probability. If the model is 100% confident and correct (), the gradient is , and no weights will be updated. The larger the error, the larger the gradient pushing the model to learn.
3. Gradients with respect to Weights () and Bias ()
Now that we have the gradient of the loss with respect to the logits (), we use the chain rule to pass this error signal back to our parameters and .
Deriving for :
Let’s look at a single weight element . It only influences the loss through the specific logit .
Since , the local gradient is simply . Therefore:
To vectorize this back into a matrix (the same shape as ), we take the outer product of the column vector and the row vector :
Deriving for :
Since , the local derivative is . Thus, the error signal passes directly through:
4. Scaling up: The Mini-Batch Form
In real-world training, we process samples at a time to stabilize gradients and utilize parallel computing.
- becomes a matrix.
- becomes a matrix.
To find the average gradient across the entire batch, we perform a matrix multiplication and divide by :
Batch Weight Gradient:
Batch Bias Gradient:
Sum the errors across all samples for each class, then average:
Summary
By systematically applying the chain rule, we transformed a seemingly complex matrix calculus problem into clean, highly efficient linear algebra operations. Understanding this dynamic is the fundamental key to grasping how classification networks “learn” from their mistakes.