EN | AI

April 7, 2026

Uncovering Batch & Layer Normalization

Batch normalization and layer normalization improve training stability and reduce sensitivity to initialization by normalizing intermediate activations.

When training a deep neural network, the distribution of inputs to each layer can change as the network’s weights are updated. If the weights become too large, the inputs to subsequent layers may grow excessively; if the weights shrink toward zero, the inputs may diminish accordingly. These shifts in input distributions make training more difficult, as each layer must continuously adapt to changing conditions. This phenomenon is known as internal covariate shift.

Normalization allows us to use much higher learning rates and be less careful about initialization.

Plain Normalization

If normalization (e.g., centering the activations to zero mean) is performed outside of the gradient descent step, the optimizer will remain “blind” to the normalization’s effects.

In formal terms, if the optimizer treats the mean-subtraction operation as a fixed constant, the gradient updates will fail to reflect the true dynamics of the network.

Consider a layer that computes $x = u + b$ and subsequently normalizes it: $\hat{x} = x - E[x]$ .

If the gradient $\frac{\partial \ell}{\partial b}$ is computed without considering how $E[x]$ depends on $b$ , the optimizer will attempt to adjust $b$ to minimize the loss.

Because the normalization step subsequently subtracts the updated mean, the update $\Delta b$ is effectively cancelled. The output $\hat{x}$ remains numerically identical to its state prior to the update:

(u + b + \Delta b) - E[u + b + \Delta b] = u + b - E[u + b]

Since the output—and consequently the loss—remains invariant despite the update, the optimizer will continue to increase $b$ in a futile attempt to reach a lower loss. This results in unbounded parameter growth while the network’s predictive performance stagnates.

To maintain training stability, normalization must be included within the computational graph so that gradients correctly capture its dependence on the parameters.

However, performing full whitening across all examples can be computationally expensive. This is why techniques like Batch Normalization and Layer Normalization are used instead.

Batch Normalization

Batch Normalization performs normalization for each training mini-batch.

In Batch Normalization, we normalize each scalar feature independently, by making it have the mean of zero and the variance of 1.

\hat{x}^{(k)} = \frac{x^{(k)} - \mathbb{E}[x^{(k)}]}{\sqrt{\mathrm{Var}[x^{(k)}] + \epsilon}}

But simply normalizing each input of a layer may change what the layer can represent. The authors make sure that the transformation inserted in the network can represent the identity transform. Thus introducing:

y^{(k)} = \gamma^{(k)} \hat{x}^{(k)} + \beta^{(k)}

The parameters here are learnable along with the original model parameters, and restore the representation power of the network.

By setting $\gamma^{(k)} = \sqrt{\mathrm{Var}[x^{(k)}]}$ and $\beta^{(k)} = \mathbb{E}[x^{(k)}]$ , we could recover the original activations, if that were the optimal thing to do.

Algorithm 1: Batch Normalization Forward Pass

The forward pass of the Batch Normalization layer transforms a mini-batch of activations $\mathcal{B} = \{x_{1 \dots m}\}$ into a normalized and linearly scaled output $\{y_i\}$ . This process ensures that the input to subsequent layers maintains a stable distribution throughout training.

Input: Values of $x$ over a mini-batch: $\mathcal{B} = \{x_{1 \dots m}\}$ ; Parameters to be learned: $\gamma, \beta$ .
Output: $\{y_i = \text{BN}_{\gamma, \beta}(x_i)\}$ .

Mini-batch Mean:

\mu_{\mathcal{B}} \leftarrow \frac{1}{m} \sum_{i=1}^m x_i

Mini-batch Variance:

\sigma_{\mathcal{B}}^2 \leftarrow \frac{1}{m} \sum_{i=1}^m (x_i - \mu_{\mathcal{B}})^2

Normalize:

\hat{x}_i \leftarrow \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}}

Scale and Shift:

y_i \leftarrow \gamma \hat{x}_i + \beta \equiv \text{BN}_{\gamma, \beta}(x_i)

Differentiation

To train the network using stochastic gradient descent, we must compute the gradient of the loss function $\ell$ with respect to the input $x_i$ and the learnable parameters $\gamma$ and $\beta$ . This is achieved by applying the chain rule through the computational graph of the BN transform.

1. Gradients for Learnable Parameters

The parameters $\gamma$ and $\beta$ are updated based on their contribution to all samples in the mini-batch:

Gradient w.r.t. $\beta$ : Since $\frac{\partial y_i}{\partial \beta} = 1$ , the gradient is the sum of the upstream gradients:

\frac{\partial \ell}{\partial \beta} = \sum_{i=1}^m \frac{\partial \ell}{\partial y_i}

Gradient w.r.t. $\gamma$ : Since $\frac{\partial y_i}{\partial \gamma} = \hat{x}_i$ , the gradient is the sum of the product of the upstream gradient and the normalized input:

\frac{\partial \ell}{\partial \gamma} = \sum_{i=1}^m \frac{\partial \ell}{\partial y_i} \cdot \hat{x}_i

2. Gradient w.r.t. Intermediate Statistics

The gradient propagates backward from $y_i$ to the normalized value $\hat{x}_i$ , and then to the batch statistics $\mu_{\mathcal{B}}$ and $\sigma_{\mathcal{B}}^2$ :

Gradient w.r.t. $\hat{x}_i$ :

\frac{\partial \ell}{\partial \hat{x}_i} = \frac{\partial \ell}{\partial y_i} \cdot \gamma

Gradient w.r.t. $\sigma_{\mathcal{B}}^2$ : This accounts for how the variance affects every $\hat{x}_i$ in the batch:

\frac{\partial \ell}{\partial \sigma_{\mathcal{B}}^2} = \sum_{i=1}^m \frac{\partial \ell}{\partial \hat{x}_i} \cdot (x_i - \mu_{\mathcal{B}}) \cdot \frac{-1}{2} (\sigma_{\mathcal{B}}^2 + \epsilon)^{-3/2}

Gradient w.r.t. $\mu_{\mathcal{B}}$ : The mean affects the loss both directly through the numerator of $\hat{x}_i$ and indirectly through the variance calculation:

\frac{\partial \ell}{\partial \mu_{\mathcal{B}}} = \left( \sum_{i=1}^m \frac{\partial \ell}{\partial \hat{x}_i} \cdot \frac{-1}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}} \right) + \frac{\partial \ell}{\partial \sigma_{\mathcal{B}}^2} \cdot \frac{\sum_{i=1}^m -2(x_i - \mu_{\mathcal{B}})}{m}

3. Gradient w.r.t. Input $x_i$

Finally, the gradient with respect to the original input $x_i$ is a combination of three paths: the direct path through $\hat{x}_i$ , the path through the variance $\sigma_{\mathcal{B}}^2$ , and the path through the mean $\mu_{\mathcal{B}}$ :

\frac{\partial \ell}{\partial x_i} = \frac{\partial \ell}{\partial \hat{x}_i} \cdot \frac{1}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}} + \frac{\partial \ell}{\partial \sigma_{\mathcal{B}}^2} \cdot \frac{2(x_i - \mu_{\mathcal{B}})}{m} + \frac{\partial \ell}{\partial \mu_{\mathcal{B}}} \cdot \frac{1}{m}

Training and Inference with Batch Normalization

The normalization of activations allows efficient training, but is neither necessary nor desirable during inference. Thus, once the network has been trained, we use the normalization using the population, as opposed to the mini-batch:

\hat{x} = \frac{x - \mathbb{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}}

In practice, we use the fixed moving average calculated during training. During training, the mini-batch statistics are stochastic estimates of the true data distribution. The moving average serves as a stable, low-variance estimate of the population mean ( $\mathbb{E}[x]$ ) and population variance ( $\mathrm{Var}[x]$ ).

These running statistics are typically updated at each training step $t$ using a momentum coefficient $\alpha$ (usually 0.9 or 0.99):

\hat{\mu}_{new} = \alpha \hat{\mu}_{old} + (1 - \alpha) \mu_{\mathcal{B}}

\hat{\sigma}^2_{new} = \alpha \hat{\sigma}^2_{old} + (1 - \alpha) \sigma^2_{\mathcal{B}}

Layer Normalization

Layer Normalization (LayerNorm) is an alternative normalization technique that normalizes across the features of a single sample rather than across a mini-batch.

For an input vector $x \in \mathbb{R}^d$ , LayerNorm computes the mean and variance over the feature dimension, ensuring that each individual sample has zero mean and unit variance.

Unlike Batch Normalization, it does not depend on batch statistics, which makes it particularly suitable for recurrent neural networks and transformer architectures where batch sizes may be small or variable.

Similar to BatchNorm, LayerNorm includes learnable parameters $\gamma$ and $\beta$ to scale and shift the normalized output, preserving the model’s representational capacity.

BN vs. LN

Reference

Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167.

About this Post

This post is written by Louis C Deng, licensed under CC BY-NC 4.0.

#Normalization #Deep Learning

Uncovering Batch & Layer Normalization

Plain Normalization

Batch Normalization

Algorithm 1: Batch Normalization Forward Pass

Differentiation

1. Gradients for Learnable Parameters

2. Gradient w.r.t. Intermediate Statistics

3. Gradient w.r.t. Input $x_i$

Training and Inference with Batch Normalization

Layer Normalization

Reference

About this Post

CS231n Lecture Note VII: Recurrent Neural Networks

CS231n Lecture Note VI: CNN Architectures and Training

Uncovering Batch & Layer Normalization

Plain Normalization

Batch Normalization

Algorithm 1: Batch Normalization Forward Pass

Differentiation

1. Gradients for Learnable Parameters

2. Gradient w.r.t. Intermediate Statistics

3. Gradient w.r.t. Input xix_ixi​

Training and Inference with Batch Normalization

Layer Normalization

Reference

About this Post

CS231n Lecture Note VII: Recurrent Neural Networks

CS231n Lecture Note VI: CNN Architectures and Training

3. Gradient w.r.t. Input $x_i$