Skip to content

Demystifying Softmax Loss: A Step-by-Step Derivation for Linear Classifiers

If you are diving into the mechanics of neural networks, you will inevitably encounter the backpropagation of the Softmax and Cross-Entropy loss. At first glance, the matrix calculus can feel a bit intimidating. However, once you break it down step-by-step using the chain rule, you will discover that the final gradients are incredibly elegant and intuitive.

In this post, we will walk through the complete mathematical derivation of the gradients for a linear classifier, moving from single variables to full matrix vectorization.

1. The Setup: Forward Propagation

Let’s define our variables for a single training sample. Assume our input features have a dimension of DD, and we are classifying them into CC distinct classes.

  • Input xx: A D×1D \times 1 column vector.
  • Weights WW: A C×DC \times D matrix.
  • Bias bb: A C×1C \times 1 column vector.
  • True Label yy: The true class index (or a C×1C \times 1 one-hot encoded vector where only the true class index is 11).

The Linear Layer (Logits):
First, we compute the raw scores (logits) zz for each class:

z=Wx+bz = Wx + b

The Softmax Layer:
We convert these raw scores into a valid probability distribution. The probability of the sample belonging to class ii is:

pi=ezij=1Cezjp_i = \frac{e^{z_i}}{\sum_{j=1}^C e^{z_j}}

The Cross-Entropy Loss:
For a single sample where the true class is yy, the loss only cares about the predicted probability assigned to that true class:

L=log(py)L = -\log(p_y)

By substituting the Softmax formula into the loss function, we get:

L=zy+log(j=1Cezj)L = -z_y + \log\left(\sum_{j=1}^C e^{z_j}\right)

2. The Core Derivation: Gradient with respect to Logits (zz)

To perform backpropagation, we first need to find how the loss changes with respect to each logit ziz_i. We denote this gradient as Lzi\frac{\partial L}{\partial z_i}.

We must split this into two scenarios: when ii is the true class, and when ii is any other class.

Case 1: Deriving for the true class (i=yi = y)

Lzy=zy(zy+log(j=1Cezj))\frac{\partial L}{\partial z_y} = \frac{\partial}{\partial z_y} \left( -z_y + \log\left(\sum_{j=1}^C e^{z_j}\right) \right)

Applying the chain rule to the logarithm term:

Lzy=1+1j=1Cezjezy\frac{\partial L}{\partial z_y} = -1 + \frac{1}{\sum_{j=1}^C e^{z_j}} \cdot e^{z_y}

Notice that the fractional term is exactly our definition of pyp_y!

Lzy=py1\frac{\partial L}{\partial z_y} = p_y - 1

Case 2: Deriving for an incorrect class (iyi \neq y)
Because zyz_y does not contain ziz_i, the derivative of the first term is 00.

Lzi=0+1j=1Cezjezi\frac{\partial L}{\partial z_i} = 0 + \frac{1}{\sum_{j=1}^C e^{z_j}} \cdot e^{z_i}

Again, the fractional term is pip_i:

Lzi=pi\frac{\partial L}{\partial z_i} = p_i

The Vectorized Form:
We can beautifully combine these two cases using an indicator function I(i=y)\mathbb{I}(i=y) (which is 11 if ii is the true class, and 00 otherwise).

Let dzdz be the gradient vector Lz\frac{\partial L}{\partial z}. Its vectorized form is simply:

dz=pydz = p - y

Intuition Check: This result is incredibly logical. The gradient is simply the Predicted Probability minus the True Probability. If the model is 100% confident and correct (py1p_y \approx 1), the gradient is 00, and no weights will be updated. The larger the error, the larger the gradient pushing the model to learn.

3. Gradients with respect to Weights (WW) and Bias (bb)

Now that we have the gradient of the loss with respect to the logits (dzdz), we use the chain rule to pass this error signal back to our parameters WW and bb.

Deriving for WW:
Let’s look at a single weight element WijW_{ij}. It only influences the loss through the specific logit ziz_i.

LWij=LziziWij\frac{\partial L}{\partial W_{ij}} = \frac{\partial L}{\partial z_i} \cdot \frac{\partial z_i}{\partial W_{ij}}

Since zi=Wikxk+biz_i = \sum W_{ik} x_k + b_i, the local gradient ziWij\frac{\partial z_i}{\partial W_{ij}} is simply xjx_j. Therefore:

LWij=dzixj\frac{\partial L}{\partial W_{ij}} = dz_i \cdot x_j

To vectorize this back into a C×DC \times D matrix (the same shape as WW), we take the outer product of the column vector dzdz and the row vector xTx^T:

LW=dzxT\frac{\partial L}{\partial W} = dz \cdot x^T

Deriving for bb:
Since z=Wx+bz = Wx + b, the local derivative zb\frac{\partial z}{\partial b} is 11. Thus, the error signal passes directly through:

Lb=dz\frac{\partial L}{\partial b} = dz

4. Scaling up: The Mini-Batch Form

In real-world training, we process NN samples at a time to stabilize gradients and utilize parallel computing.

  • XX becomes a D×ND \times N matrix.
  • dZ=PYdZ = P - Y becomes a C×NC \times N matrix.

To find the average gradient across the entire batch, we perform a matrix multiplication and divide by NN:

Batch Weight Gradient:

LW=1NdZXT\frac{\partial L}{\partial W} = \frac{1}{N} dZ \cdot X^T

Batch Bias Gradient:
Sum the errors across all NN samples for each class, then average:

Lb=1Ni=1NdZ(i)\frac{\partial L}{\partial b} = \frac{1}{N} \sum_{i=1}^N dZ^{(i)}

Summary

By systematically applying the chain rule, we transformed a seemingly complex matrix calculus problem into clean, highly efficient linear algebra operations. Understanding this dz=pydz = p - y dynamic is the fundamental key to grasping how classification networks “learn” from their mistakes.

About this Post

This post is written by Louis C Deng, licensed under CC BY-NC 4.0.

#Deep Learning #Mathematics