EN | CS231n

May 3, 2026

CS231n Lecture Note: Generative Models

Discriminative Model: Learns the conditional probability $p(y|x)$ . It focuses on predicting a label $y$ given the input data $x$ .
Generative Model: Learns the probability distribution of the data itself, $p(x)$ .
Conditional Generative Model: Learns the probability $p(x|y)$ , which is the likelihood of observing specific data $x$ given a certain label $y$ .

The Density Functions $p(x)$ :

Likelihood: A density function assigns a positive number to each possible $x$ ; a higher number indicates that the specific value of $x$ is more likely.
Normalization: Density functions must be normalized so that the total probability across all possible values of $x$ equals 1, expressed as:

\int_{X} p(x)dx = 1

Competition: Because the total area must equal 1, different values of $x$ “compete” for density within the distribution.

With Generative Models, all possible images compete for probability mass. Model can “reject” unreasonable inputs by giving them small probability mass.

The Bayes’ Theorem:

P(x|y) = \frac{P(y|x) P(x)}{P(y)}

This means with two of the three models, we can get the third one.

Autoregressive Models

Maximum Likelihood Estimation (MLE) The core objective is to find the best parameters ( $W$ ) for our function $f(x, W)$ so that it accurately models the true distribution $p(x)$ of our data.

To train the model using a dataset $x^{(1)}, x^{(2)}, \dots, x^{(N)}$ , we solve for the optimal weights $W^*$ through these steps:

We start by trying to maximize the joint probability of all training data points. This is represented as the product of individual probabilities: $\arg \max_{W} \prod_{i} p(x^{(i)})$ .

Since multiplying many small probabilities can lead to numerical instability (underflow), we apply a logarithm. Because the log is a monotonic function, maximizing the log-likelihood is equivalent to maximizing the likelihood, but it allows us to swap the product for a sum: $\arg \max_{W} \sum_{i} \log p(x^{(i)})$ .

By substituting our model function $f(x, W)$ for $p(x)$ , we arrive at our final objective: $\arg \max_{W} \sum_{i} \log f(x^{(i)}, W)$ .

Autoregressive models assume that the data $x$ is not a single static point, but a sequence of components:

x = (x_1, x_2, ..., x_T)

An autoregressive model predicts the “next” part of the data based on what has already been generated.

It iterates through the elements, determining the probability of the next element given all previous elements.

To calculate the probability of the entire sequence $p(x)$ , these models decompose the joint probability using the chain rule. Instead of looking at the whole sequence at once, the model breaks it down into a series of conditional probabilities:

p(x) = \prod_{t=1}^{T} p(x_t | x_1, ..., x_{t-1})

Variational Autoencoders (VAE)

Variational Autoencoders (VAE) define an intractable density that we cannot explicitly compute or optimize. We can determine its lower bound.

The autoencoder is an unsupervised method for learning to extract features $z$ from inputs $x$ , without labels.

It trains an encoder and a decoder, which gets the intermediate representation of the input and then reconstructs the input data back. The intermediate representation $z$ has lower dimensionality.

The loss function is the L2 distance between input and reconstructed data.

After training, we can use encoder for downstream tasks.

For generative tasks, we can force all $z$ to come from a known distribution so that we can sample $z$ to get generate outputs.

We assume the latent factor $z$ conforms to Gaussian distribution. We train the model with maximum likelihood.

p_\theta(x) = \frac{p_\theta(x|z)p_\theta(z)}{p_\theta(z|x)}

We apply Bayes’ Rules. Since we cannot get the posterior $p_\theta(z|x)$ , we can train another network that learns $q_\phi(z|x) \approx p_\theta(z|x)$ .

For VAE, we train two networks.

The Encoder Network ( $q_\phi(z|x)$ ): takes input data $x$ and maps it to a distribution over latent codes $z$ . Instead of a single point, it outputs the parameters—mean ( $\mu_{z|x}$ ) and variance ( $\Sigma_{z|x}$ )—of a distribution (typically Gaussian).

The Decoder Network ( $p_\theta(x|z)$ ): takes a latent code $z$ and reconstructs the data $x$ . It outputs a distribution over the data, defined by its own mean ( $\mu_{x|z}$ ) and variance ( $\sigma^2$ ).

The fundamental challenge in VAEs is that the true posterior $p_\theta(z|x)$ is intractable. To solve this, VAEs use Variational Inference:

Approximation: We ensure that our learned encoder distribution $q_\phi(z|x)$ is approximately equal to the true posterior $p_\theta(z|x)$ .
Estimation: If this approximation holds, we can estimate the data likelihood $p_\theta(x)$ using the following relationship:

p_\theta(x) \approx \frac{p_\theta(x|z)p(z)}{q_\phi(z|x)}

Joint Training: We jointly train both the encoder and decoder to maximize this evidence.

ELBO

Building on the previous concepts, this derivation explains how we can optimize the log-likelihood of our data, $\log p_\theta(x)$ , even when the true posterior is unknown.

\log p_\theta(x) = \log \frac{p_\theta(x | z)p(z)}{p_\theta(z | x)} = \log \frac{p_\theta(x | z)p(z)q_\phi(z | x)}{p_\theta(z | x)q_\phi(z | x)}

= E_{Z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x), p(z)) + D_{KL}(q_\phi(z|x), p_\theta(z|x))

The derivation uses the properties of logarithms and expectations to decompose the log-likelihood into three distinct terms:

The Reconstruction Term ( $E_{Z \sim q_\phi(z|x)}[\log p_\theta(x|z)]$ ):
- This measures how well the decoder can reconstruct the original input $x$ from the latent code $z$ sampled from the encoder.
- In training, we want to maximize this to ensure the model retains data fidelity.
The Prior Regularization Term ( $D_{KL}(q_\phi(z|x), p(z))$ ):
- This is the Kullback-Leibler (KL) divergence between the encoder’s distribution and a simple prior (usually a standard normal distribution).
- It acts as a regularizer, forcing the latent space to be continuous and well-structured. We subtract this term, meaning we want the encoder to stay close to our prior.
The Approximation Error ( $D_{KL}(q_\phi(z|x), p_\theta(z|x))$ ):
- This measures the divergence between our approximate posterior (the encoder) and the true, intractable posterior.
- Because KL divergence is always $\ge 0$ , we know that the first two terms combined form a lower bound on the total log-likelihood.

Since we cannot calculate the true posterior (the third term), we ignore it and maximize the first two terms—the Evidence Lower Bound (ELBO). By maximizing the ELBO, we simultaneously improve the quality of our data generation and the efficiency of our latent representation.

This is the lower bound of our maximal likelihood.

To train this, we first run input data through encoder to get distribution over $z$ , which should match the unit Gaussian. We sample $z$ from the output distribution and run it through decoder to get predicted data mean. Predicted mean should match x in L2.

Reconstruction loss wants $\Sigma_{z|x} = 0$ and $\mu_{z|x}$ to be unique for each x, so decoder can deterministically reconstruct x.

Prior loss wants $\Sigma_{z|x} = \mathbf{I}$ and $\mu_{z|x} = 0$ so encoder output is always a unit Gaussian.

Generative Adversarial Networks

We don’t model $p(x)$ any more. We only draw samples from it.

Generative Adversarial Networks: Have data $x_i$ drawn from distribution $p_{data}(x)$ . Want to sample from $p_{data}$

We introduce a latent variable z with simple prior $p(z)$ (e.g. unit Gaussian). Sample $z \sim p(z)$ and pass to a Generator Network $x = G(z)$ . Then x is a sample from the Generator distribution $p_G$ .

Train Generator Network G to convert z into fake data x sampled from $p_G$ , fooling the discriminator D.

Train Discriminator Network D to classify data as real or fake

We jointly train G and D with a minimax game, hoping $p_G$ to converge to $p_{data}$ .

\min_G \max_D \left( E_{x \sim p_{data}} [\log D(x)] + E_{z \sim p(z)} [\log(1 - D(G(z)))] \right)

Generator wants $D(G(z)) = 1$ . Train generator to minimize $-\log(D(G(z)))$ and discriminator to maximize $\log(1-D(G(z)))$ so generator gets strong gradients at start.

Examples of GAN include DC-GAN and StyleGAN.

Latent space is smooth. Given latent vectors z0 and z1, we can interpolate between them and have the resulting image interpolating smoothly between samples.

However, the training of GAN is unstable.

Diffusion & Flow Matching

We model the transformation between a simple noise distribution and the complex data distribution.

Flow Matching: Have data $x$ drawn from distribution $p_{data}$ and noise $z$ drawn from a simple prior $p_{noise}$ (e.g. unit Gaussian). We define a path that interpolates between them over time $t \in [0, 1]$ .

On each training iteration, we sample $x \sim p_{data}$ , $z \sim p_{noise}$ , and a time step $t \sim \text{Uniform}[0, 1]$ . We construct a noisy sample $x_t$ and a target velocity $v$ :

x_t = (1 - t)x + tz, \quad v = z - x

Train a Velocity Network $f_\theta(x_t, t)$ to predict the vector $v$ that points from the data toward the noise. This is optimized using a regression loss:

L = \| f_\theta(x_t, t) - v \|_2^2

During Inference (Sampling), we start with pure noise $x_1 \sim p_{noise}$ and move backward toward the data distribution. We choose a number of steps $T$ (often $T=50$ ):

For $t$ in $[1, 1 - \frac{1}{T}, 1 - \frac{2}{T}, \dots, 0]$ :

Evaluate the predicted velocity: $v_t = f_{\theta}(x_t, t)$
Take a small step: $x = x - v_t/T$

This approach, known as Optimal Transport Flow Matching, results in straight trajectories between noise and data.

Classifier-Free Guidance (CFG): To control the generation process, we introduce a condition $y$ (e.g., a text prompt). During training, we occasionally drop $y$ so the model learns both conditional and unconditional distributions.

During sampling, we compute two separate velocities for a noisy sample $x_t$ :

Unconditional velocity: $v^\emptyset = f_\theta(x_t, y_\emptyset, t)$ (points toward $p(x)$ )
Conditional velocity: $v^y = f_\theta(x_t, y, t)$ (points toward $p(x|y)$ )

We combine these using a guidance weight $w$ to shift the direction more strongly toward the condition:

v^{cfg} = (1 + w)v^y - w v^\emptyset

Increasing $w$ improves how well the image matches the prompt but can reduce sample diversity.

Noise Schedules: We may use a non-uniform noise schedule. The common choice is logit-normal sampling. For high-res data, we often shift to higher noise to account for pixel correlations.

Diffusion Distillation: We use distillation algorithms reduce the number of steps (sometimes all the way to 1).

Examples of these models include Stable Diffusion (which uses a similar diffusion objective) and Flux.

Unlike GANs, these models are much more stable to train and scale better to high-resolution data. However, sampling is typically slower because it requires multiple evaluations of the network.

Latent Diffusion Models (LDM)

Latent Diffusion Models are basically VAE + GAN + Diffusion.

We train encoder + decoder to convert images to latents. Then train diffusion model to remove noise from latents. We run decoder to get image from denoised latents.

The encoder + decoder is a VAE. For the decoder, we use GAN to prevent the outputs from being blurry.

Generalized Diffusion

This framework provides a unified view of generative modeling, where models like DDPM, DDIM, and Flow Matching are seen as specific configurations of a general training objective.

Core Concept: We define a forward process that transitions from clean data $x$ to noise $z$ over a continuous or discrete time $t$ , and train a network to reverse or predict aspects of this transition.

1. Training Procedure

For every training step, we sample the components of the transition:

Samples: $x \sim p_{data}$ (clean data), $z \sim p_{noise}$ (standard noise), and $t \sim p_t$ (time step).
Noisy Sample Construction: We create an intermediate state $x_t$ using time-dependent scalar functions $a(t)$ and $b(t)$ : $x_t = a(t)x + b(t)z$
Ground Truth Target: We define what the model should predict ( $y_{gt}$ ) using another set of functions $c(t)$ and $d(t)$ : $y_{gt} = c(t)x + d(t)z$

2. Optimization

The neural network $f_\theta$ takes the noisy sample and the time step to predict the target. We minimize the Mean Squared Error:

L = \| f_\theta(x_t, t) - y_{gt} \|_2^2

3. Framework Unification

By varying the coefficients $a, b, c,$ and $d$ , this single loss function covers different popular models:

Standard Diffusion (DDPM): Predicts the noise $z$ $z$ added to the data.
- $a(t) = \sqrt{\bar{\alpha}_t}, \quad b(t) = \sqrt{1 - \bar{\alpha}_t}$
- Target: $c(t)=0, d(t)=1 \implies y_{gt} = z$
Flow Matching (Optimal Transport): Predicts the velocity $v$ $v$ along a straight line.
- $a(t) = 1-t, \quad b(t) = t$
- Target: $c(t)=-1, d(t)=1 \implies y_{gt} = z - x$
Data Prediction: The model directly estimates the clean sample $x$ $x$ from the noisy input.
- Target: $c(t)=1, d(t)=0 \implies y_{gt} = x$

4. Classifier-Free Guidance (CFG)

To steer these models during sampling (e.g., using a text prompt $y$ ), we compute a weighted average of the conditional and unconditional predictions:

v^{cfg} = (1 + w)f_\theta(x_t, y, t) - w f_\theta(x_t, y_\emptyset, t)

Unconditional: $y_\emptyset$ (null or empty prompt).
Guidance Scale $w$ : Higher values force the model to follow the condition $y$ more strictly, often at the cost of visual variety.

Summary: While GANs rely on a competitive game (minimax), Generalized Diffusion relies on direct regression. This makes training significantly more stable and allows for high-quality, diverse image and signal synthesis across various domains like photogrammetry and 3D reconstruction.

Perspectives on Diffusion Models

Diffusion models are multifaceted and can be understood through several mathematical and conceptual frameworks. Rather than just a single algorithm, they represent a class of models that bridge the gap between simple noise and complex data distributions.

1. Deep Latent Variable Perspective

In this view, diffusion is treated similarly to a Variational Autoencoder (VAE) but with a fixed encoder and a learned decoder.

Forward Process: We define a fixed “noising” process where Gaussian noise is iteratively added to the data $x_0$ , moving through latent states $x_1, \dots, x_t, \dots, x_T$ .
Backward Process: A neural network is trained to approximate the reverse step $p_\theta(x_{t-1}|x_t)$ , effectively learning to “undo” the noise.
Optimization: The model is trained by optimizing the variational lower bound (VLB), ensuring the generated samples represent the underlying data distribution.

2. Score-Based Perspective

Instead of modeling the probability density $p(x)$ directly, we can model its gradient with respect to the input.

Score Function: Defined as $s(x) = \frac{\partial}{\partial x} \log p(x)$ .
Vector Field: The score function acts as a vector field that points toward areas of high probability density in the data space.
Learning: The diffusion model learns a neural network to approximate this score function for $p_{data}$ . During sampling, we “walk” along this vector field to find high-density regions (real data).

3. Stochastic Differential Equations (SDEs)

For a more continuous mathematical treatment, the noising process can be described as an SDE.

The Equation: We describe infinitesimal changes in data $x$ , time $t$ , and noise $w$ as: $dx = f(x, t)dt + g(t)dw$
Reverse SDE: Diffusion models learn a neural network to solve the reverse version of this equation, allowing for continuous-time generation and more flexible sampling strategies.

Summary of Perspectives

Beyond the frameworks above, diffusion models can also be viewed as:

Autoencoders: Specifically, denoising autoencoders applied at different noise levels.
Recurrent Neural Networks: Since they apply the same network iteratively over time steps.
Autoregressive Models: When viewed as predicting the “next” state in a sequence from noise to data.
Expectation Estimators: Estimating the conditional mean of the data given the noisy observation.

About this Post

This post is written by Louis C Deng, licensed under CC BY-NC 4.0.

#CS231n #Deep Learning