EN | AI

March 29, 2026

Demystifying Softmax Loss: A Step-by-Step Derivation for Linear Classifiers

If you are diving into the mechanics of neural networks, you will inevitably encounter the backpropagation of the Softmax and Cross-Entropy loss. At first glance, the matrix calculus can feel a bit intimidating. However, once you break it down step-by-step using the chain rule, you will discover that the final gradients are incredibly elegant and intuitive.

In this post, we will walk through the complete mathematical derivation of the gradients for a linear classifier, moving from single variables to full matrix vectorization.

1. The Setup: Forward Propagation

Let’s define our variables for a single training sample. Assume our input features have a dimension of $D$ , and we are classifying them into $C$ distinct classes.

Input $x$ : A $D \times 1$ column vector.
Weights $W$ : A $C \times D$ matrix.
Bias $b$ : A $C \times 1$ column vector.
True Label $y$ : The true class index (or a $C \times 1$ one-hot encoded vector where only the true class index is $1$ ).

The Linear Layer (Logits):
First, we compute the raw scores (logits) $z$ for each class:

z = Wx + b

The Softmax Layer:
We convert these raw scores into a valid probability distribution. The probability of the sample belonging to class $i$ is:

p_i = \frac{e^{z_i}}{\sum_{j=1}^C e^{z_j}}

The Cross-Entropy Loss:
For a single sample where the true class is $y$ , the loss only cares about the predicted probability assigned to that true class:

L = -\log(p_y)

By substituting the Softmax formula into the loss function, we get:

L = -z_y + \log\left(\sum_{j=1}^C e^{z_j}\right)

2. The Core Derivation: Gradient with respect to Logits ( $z$ )

To perform backpropagation, we first need to find how the loss changes with respect to each logit $z_i$ . We denote this gradient as $\frac{\partial L}{\partial z_i}$ .

We must split this into two scenarios: when $i$ is the true class, and when $i$ is any other class.

Case 1: Deriving for the true class ( $i = y$ )

\frac{\partial L}{\partial z_y} = \frac{\partial}{\partial z_y} \left( -z_y + \log\left(\sum_{j=1}^C e^{z_j}\right) \right)

Applying the chain rule to the logarithm term:

\frac{\partial L}{\partial z_y} = -1 + \frac{1}{\sum_{j=1}^C e^{z_j}} \cdot e^{z_y}

Notice that the fractional term is exactly our definition of $p_y$ !

\frac{\partial L}{\partial z_y} = p_y - 1

Case 2: Deriving for an incorrect class ( $i \neq y$ )
Because $z_y$ does not contain $z_i$ , the derivative of the first term is $0$ .

\frac{\partial L}{\partial z_i} = 0 + \frac{1}{\sum_{j=1}^C e^{z_j}} \cdot e^{z_i}

Again, the fractional term is $p_i$ :

\frac{\partial L}{\partial z_i} = p_i

The Vectorized Form:
We can beautifully combine these two cases using an indicator function $\mathbb{I}(i=y)$ (which is $1$ if $i$ is the true class, and $0$ otherwise).

Let $dz$ be the gradient vector $\frac{\partial L}{\partial z}$ . Its vectorized form is simply:

dz = p - y

Intuition Check: This result is incredibly logical. The gradient is simply the Predicted Probability minus the True Probability. If the model is 100% confident and correct ( $p_y \approx 1$ ), the gradient is $0$ , and no weights will be updated. The larger the error, the larger the gradient pushing the model to learn.

3. Gradients with respect to Weights ( $W$ ) and Bias ( $b$ )

Now that we have the gradient of the loss with respect to the logits ( $dz$ ), we use the chain rule to pass this error signal back to our parameters $W$ and $b$ .

Deriving for $W$ :
Let’s look at a single weight element $W_{ij}$ . It only influences the loss through the specific logit $z_i$ .

\frac{\partial L}{\partial W_{ij}} = \frac{\partial L}{\partial z_i} \cdot \frac{\partial z_i}{\partial W_{ij}}

Since $z_i = \sum W_{ik} x_k + b_i$ , the local gradient $\frac{\partial z_i}{\partial W_{ij}}$ is simply $x_j$ . Therefore:

\frac{\partial L}{\partial W_{ij}} = dz_i \cdot x_j

To vectorize this back into a $C \times D$ matrix (the same shape as $W$ ), we take the outer product of the column vector $dz$ and the row vector $x^T$ :

\frac{\partial L}{\partial W} = dz \cdot x^T

Deriving for $b$ :
Since $z = Wx + b$ , the local derivative $\frac{\partial z}{\partial b}$ is $1$ . Thus, the error signal passes directly through:

\frac{\partial L}{\partial b} = dz

4. Scaling up: The Mini-Batch Form

In real-world training, we process $N$ samples at a time to stabilize gradients and utilize parallel computing.

$X$ becomes a $D \times N$ matrix.
$dZ = P - Y$ becomes a $C \times N$ matrix.

To find the average gradient across the entire batch, we perform a matrix multiplication and divide by $N$ :

Batch Weight Gradient:

\frac{\partial L}{\partial W} = \frac{1}{N} dZ \cdot X^T

Batch Bias Gradient:
Sum the errors across all $N$ samples for each class, then average:

\frac{\partial L}{\partial b} = \frac{1}{N} \sum_{i=1}^N dZ^{(i)}

Summary

By systematically applying the chain rule, we transformed a seemingly complex matrix calculus problem into clean, highly efficient linear algebra operations. Understanding this $dz = p - y$ dynamic is the fundamental key to grasping how classification networks “learn” from their mistakes.

About this Post

This post is written by Louis C Deng, licensed under CC BY-NC 4.0.

#Deep Learning #Mathematics