Skip to content

CS231n Lecture Note: Generative Models

  • Discriminative Model: Learns the conditional probability p(yx)p(y|x). It focuses on predicting a label yy given the input data xx.
  • Generative Model: Learns the probability distribution of the data itself, p(x)p(x).
  • Conditional Generative Model: Learns the probability p(xy)p(x|y), which is the likelihood of observing specific data xx given a certain label yy.

The Density Functions p(x)p(x):

  • Likelihood: A density function assigns a positive number to each possible xx; a higher number indicates that the specific value of xx is more likely.
  • Normalization: Density functions must be normalized so that the total probability across all possible values of xx equals 1, expressed as:

Xp(x)dx=1\int_{X} p(x)dx = 1

  • Competition: Because the total area must equal 1, different values of xx “compete” for density within the distribution.

With Generative Models, all possible images compete for probability mass. Model can “reject” unreasonable inputs by giving them small probability mass.

The Bayes’ Theorem:

P(xy)=P(yx)P(x)P(y)P(x|y) = \frac{P(y|x) P(x)}{P(y)}

This means with two of the three models, we can get the third one.

Autoregressive Models

Maximum Likelihood Estimation (MLE) The core objective is to find the best parameters (WW) for our function f(x,W)f(x, W) so that it accurately models the true distribution p(x)p(x) of our data.

To train the model using a dataset x(1),x(2),,x(N)x^{(1)}, x^{(2)}, \dots, x^{(N)}, we solve for the optimal weights WW^* through these steps:

We start by trying to maximize the joint probability of all training data points. This is represented as the product of individual probabilities: argmaxWip(x(i))\arg \max_{W} \prod_{i} p(x^{(i)}) .

Since multiplying many small probabilities can lead to numerical instability (underflow), we apply a logarithm. Because the log is a monotonic function, maximizing the log-likelihood is equivalent to maximizing the likelihood, but it allows us to swap the product for a sum: argmaxWilogp(x(i))\arg \max_{W} \sum_{i} \log p(x^{(i)}) .

By substituting our model function f(x,W)f(x, W) for p(x)p(x), we arrive at our final objective: argmaxWilogf(x(i),W)\arg \max_{W} \sum_{i} \log f(x^{(i)}, W) .

Autoregressive models assume that the data xx is not a single static point, but a sequence of components:

x=(x1,x2,...,xT)x = (x_1, x_2, ..., x_T)

An autoregressive model predicts the “next” part of the data based on what has already been generated.

It iterates through the elements, determining the probability of the next element given all previous elements.

To calculate the probability of the entire sequence p(x)p(x), these models decompose the joint probability using the chain rule. Instead of looking at the whole sequence at once, the model breaks it down into a series of conditional probabilities:

p(x)=t=1Tp(xtx1,...,xt1)p(x) = \prod_{t=1}^{T} p(x_t | x_1, ..., x_{t-1})

Variational Autoencoders (VAE)

Variational Autoencoders (VAE) define an intractable density that we cannot explicitly compute or optimize. We can determine its lower bound.

The autoencoder is an unsupervised method for learning to extract features zz from inputs xx, without labels.

It trains an encoder and a decoder, which gets the intermediate representation of the input and then reconstructs the input data back. The intermediate representation zz has lower dimensionality.

The loss function is the L2 distance between input and reconstructed data.

After training, we can use encoder for downstream tasks.

For generative tasks, we can force all zz to come from a known distribution so that we can sample zz to get generate outputs.

We assume the latent factor zz conforms to Gaussian distribution. We train the model with maximum likelihood.

pθ(x)=pθ(xz)pθ(z)pθ(zx)p_\theta(x) = \frac{p_\theta(x|z)p_\theta(z)}{p_\theta(z|x)}

We apply Bayes’ Rules. Since we cannot get the posterior pθ(zx)p_\theta(z|x) , we can train another network that learns qϕ(zx)pθ(zx)q_\phi(z|x) \approx p_\theta(z|x) .

For VAE, we train two networks.

The Encoder Network (qϕ(zx)q_\phi(z|x)): takes input data xx and maps it to a distribution over latent codes zz. Instead of a single point, it outputs the parameters—mean (μzx\mu_{z|x}) and variance (Σzx\Sigma_{z|x})—of a distribution (typically Gaussian).

The Decoder Network (pθ(xz)p_\theta(x|z)): takes a latent code zz and reconstructs the data xx. It outputs a distribution over the data, defined by its own mean (μxz\mu_{x|z}) and variance (σ2\sigma^2).

The fundamental challenge in VAEs is that the true posterior pθ(zx)p_\theta(z|x) is intractable. To solve this, VAEs use Variational Inference:

  1. Approximation: We ensure that our learned encoder distribution qϕ(zx)q_\phi(z|x) is approximately equal to the true posterior pθ(zx)p_\theta(z|x).
  2. Estimation: If this approximation holds, we can estimate the data likelihood pθ(x)p_\theta(x) using the following relationship:

pθ(x)pθ(xz)p(z)qϕ(zx)p_\theta(x) \approx \frac{p_\theta(x|z)p(z)}{q_\phi(z|x)}

  1. Joint Training: We jointly train both the encoder and decoder to maximize this evidence.

ELBO

Building on the previous concepts, this derivation explains how we can optimize the log-likelihood of our data, logpθ(x)\log p_\theta(x), even when the true posterior is unknown.

logpθ(x)=logpθ(xz)p(z)pθ(zx)=logpθ(xz)p(z)qϕ(zx)pθ(zx)qϕ(zx)\log p_\theta(x) = \log \frac{p_\theta(x | z)p(z)}{p_\theta(z | x)} = \log \frac{p_\theta(x | z)p(z)q_\phi(z | x)}{p_\theta(z | x)q_\phi(z | x)}

=EZqϕ(zx)[logpθ(xz)]DKL(qϕ(zx),p(z))+DKL(qϕ(zx),pθ(zx))= E_{Z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x), p(z)) + D_{KL}(q_\phi(z|x), p_\theta(z|x))

The derivation uses the properties of logarithms and expectations to decompose the log-likelihood into three distinct terms:

  1. The Reconstruction Term (EZqϕ(zx)[logpθ(xz)]E_{Z \sim q_\phi(z|x)}[\log p_\theta(x|z)]):

    • This measures how well the decoder can reconstruct the original input xx from the latent code zz sampled from the encoder.
    • In training, we want to maximize this to ensure the model retains data fidelity.
  2. The Prior Regularization Term (DKL(qϕ(zx),p(z))D_{KL}(q_\phi(z|x), p(z))):

    • This is the Kullback-Leibler (KL) divergence between the encoder’s distribution and a simple prior (usually a standard normal distribution).
    • It acts as a regularizer, forcing the latent space to be continuous and well-structured. We subtract this term, meaning we want the encoder to stay close to our prior.
  3. The Approximation Error (DKL(qϕ(zx),pθ(zx))D_{KL}(q_\phi(z|x), p_\theta(z|x))):

    • This measures the divergence between our approximate posterior (the encoder) and the true, intractable posterior.
    • Because KL divergence is always 0\ge 0, we know that the first two terms combined form a lower bound on the total log-likelihood.

Since we cannot calculate the true posterior (the third term), we ignore it and maximize the first two terms—the Evidence Lower Bound (ELBO). By maximizing the ELBO, we simultaneously improve the quality of our data generation and the efficiency of our latent representation.

This is the lower bound of our maximal likelihood.

To train this, we first run input data through encoder to get distribution over zz, which should match the unit Gaussian. We sample zz from the output distribution and run it through decoder to get predicted data mean. Predicted mean should match x in L2.

Reconstruction loss wants Σzx=0\Sigma_{z|x} = 0 and μzx\mu_{z|x} to be unique for each x, so decoder can deterministically reconstruct x.

Prior loss wants Σzx=I\Sigma_{z|x} = \mathbf{I} and μzx=0\mu_{z|x} = 0 so encoder output is always a unit Gaussian.

Generative Adversarial Networks

We don’t model p(x)p(x) any more. We only draw samples from it.

Generative Adversarial Networks: Have data xix_i drawn from distribution pdata(x)p_{data}(x). Want to sample from pdatap_{data}

We introduce a latent variable z with simple prior p(z)p(z) (e.g. unit Gaussian). Sample zp(z)z \sim p(z) and pass to a Generator Network x=G(z)x = G(z). Then x is a sample from the Generator distribution pGp_G.

Train Generator Network G to convert z into fake data x sampled from pGp_G , fooling the discriminator D.

Train Discriminator Network D to classify data as real or fake

We jointly train G and D with a minimax game, hoping pGp_G to converge to pdatap_{data} .

minGmaxD(Expdata[logD(x)]+Ezp(z)[log(1D(G(z)))])\min_G \max_D \left( E_{x \sim p_{data}} [\log D(x)] + E_{z \sim p(z)} [\log(1 - D(G(z)))] \right)

Generator wants D(G(z))=1D(G(z)) = 1. Train generator to minimize log(D(G(z)))-\log(D(G(z))) and discriminator to maximize log(1D(G(z)))\log(1-D(G(z))) so generator gets strong gradients at start.

Examples of GAN include DC-GAN and StyleGAN.

Latent space is smooth. Given latent vectors z0 and z1, we can interpolate between them and have the resulting image interpolating smoothly between samples.

However, the training of GAN is unstable.

Diffusion & Flow Matching

We model the transformation between a simple noise distribution and the complex data distribution.

Flow Matching: Have data xx drawn from distribution pdatap_{data} and noise zz drawn from a simple prior pnoisep_{noise} (e.g. unit Gaussian). We define a path that interpolates between them over time t[0,1]t \in [0, 1].

On each training iteration, we sample xpdatax \sim p_{data}, zpnoisez \sim p_{noise}, and a time step tUniform[0,1]t \sim \text{Uniform}[0, 1]. We construct a noisy sample xtx_t and a target velocity vv:

xt=(1t)x+tz,v=zxx_t = (1 - t)x + tz, \quad v = z - x

Train a Velocity Network fθ(xt,t)f_\theta(x_t, t) to predict the vector vv that points from the data toward the noise. This is optimized using a regression loss:

L=fθ(xt,t)v22L = \| f_\theta(x_t, t) - v \|_2^2

During Inference (Sampling), we start with pure noise x1pnoisex_1 \sim p_{noise} and move backward toward the data distribution. We choose a number of steps TT (often T=50T=50):

For tt in [1,11T,12T,,0][1, 1 - \frac{1}{T}, 1 - \frac{2}{T}, \dots, 0]:

  1. Evaluate the predicted velocity: vt=fθ(xt,t)v_t = f_{\theta}(x_t, t)
  2. Take a small step: x=xvt/Tx = x - v_t/T

This approach, known as Optimal Transport Flow Matching, results in straight trajectories between noise and data.

Classifier-Free Guidance (CFG): To control the generation process, we introduce a condition yy (e.g., a text prompt). During training, we occasionally drop yy so the model learns both conditional and unconditional distributions.

During sampling, we compute two separate velocities for a noisy sample xtx_t:

  • Unconditional velocity: v=fθ(xt,y,t)v^\emptyset = f_\theta(x_t, y_\emptyset, t) (points toward p(x)p(x))
  • Conditional velocity: vy=fθ(xt,y,t)v^y = f_\theta(x_t, y, t) (points toward p(xy)p(x|y))

We combine these using a guidance weight ww to shift the direction more strongly toward the condition:

vcfg=(1+w)vywvv^{cfg} = (1 + w)v^y - w v^\emptyset

Increasing ww improves how well the image matches the prompt but can reduce sample diversity.

Noise Schedules: We may use a non-uniform noise schedule. The common choice is logit-normal sampling. For high-res data, we often shift to higher noise to account for pixel correlations.

Diffusion Distillation: We use distillation algorithms reduce the number of steps (sometimes all the way to 1).

Examples of these models include Stable Diffusion (which uses a similar diffusion objective) and Flux.

Unlike GANs, these models are much more stable to train and scale better to high-resolution data. However, sampling is typically slower because it requires multiple evaluations of the network.

Latent Diffusion Models (LDM)

Latent Diffusion Models are basically VAE + GAN + Diffusion.

We train encoder + decoder to convert images to latents. Then train diffusion model to remove noise from latents. We run decoder to get image from denoised latents.

The encoder + decoder is a VAE. For the decoder, we use GAN to prevent the outputs from being blurry.

Generalized Diffusion

This framework provides a unified view of generative modeling, where models like DDPM, DDIM, and Flow Matching are seen as specific configurations of a general training objective.

Core Concept: We define a forward process that transitions from clean data xx to noise zz over a continuous or discrete time tt, and train a network to reverse or predict aspects of this transition.

1. Training Procedure

For every training step, we sample the components of the transition:

  • Samples: xpdatax \sim p_{data} (clean data), zpnoisez \sim p_{noise} (standard noise), and tptt \sim p_t (time step).
  • Noisy Sample Construction: We create an intermediate state xtx_t using time-dependent scalar functions a(t)a(t) and b(t)b(t):

    xt=a(t)x+b(t)zx_t = a(t)x + b(t)z

  • Ground Truth Target: We define what the model should predict (ygty_{gt}) using another set of functions c(t)c(t) and d(t)d(t):

    ygt=c(t)x+d(t)zy_{gt} = c(t)x + d(t)z

2. Optimization

The neural network fθf_\theta takes the noisy sample and the time step to predict the target. We minimize the Mean Squared Error:

L=fθ(xt,t)ygt22L = \| f_\theta(x_t, t) - y_{gt} \|_2^2

3. Framework Unification

By varying the coefficients a,b,c,a, b, c, and dd, this single loss function covers different popular models:

  • Standard Diffusion (DDPM): Predicts the noise zz added to the data.
    • a(t)=αˉt,b(t)=1αˉta(t) = \sqrt{\bar{\alpha}_t}, \quad b(t) = \sqrt{1 - \bar{\alpha}_t}
    • Target: c(t)=0,d(t)=1    ygt=zc(t)=0, d(t)=1 \implies y_{gt} = z
  • Flow Matching (Optimal Transport): Predicts the velocity vv along a straight line.
    • a(t)=1t,b(t)=ta(t) = 1-t, \quad b(t) = t
    • Target: c(t)=1,d(t)=1    ygt=zxc(t)=-1, d(t)=1 \implies y_{gt} = z - x
  • Data Prediction: The model directly estimates the clean sample xx from the noisy input.
    • Target: c(t)=1,d(t)=0    ygt=xc(t)=1, d(t)=0 \implies y_{gt} = x

4. Classifier-Free Guidance (CFG)

To steer these models during sampling (e.g., using a text prompt yy), we compute a weighted average of the conditional and unconditional predictions:

vcfg=(1+w)fθ(xt,y,t)wfθ(xt,y,t)v^{cfg} = (1 + w)f_\theta(x_t, y, t) - w f_\theta(x_t, y_\emptyset, t)

  • Unconditional: yy_\emptyset (null or empty prompt).
  • Guidance Scale ww: Higher values force the model to follow the condition yy more strictly, often at the cost of visual variety.

Summary: While GANs rely on a competitive game (minimax), Generalized Diffusion relies on direct regression. This makes training significantly more stable and allows for high-quality, diverse image and signal synthesis across various domains like photogrammetry and 3D reconstruction.

Perspectives on Diffusion Models

Diffusion models are multifaceted and can be understood through several mathematical and conceptual frameworks. Rather than just a single algorithm, they represent a class of models that bridge the gap between simple noise and complex data distributions.

1. Deep Latent Variable Perspective

In this view, diffusion is treated similarly to a Variational Autoencoder (VAE) but with a fixed encoder and a learned decoder.

  • Forward Process: We define a fixed “noising” process where Gaussian noise is iteratively added to the data x0x_0, moving through latent states x1,,xt,,xTx_1, \dots, x_t, \dots, x_T.
  • Backward Process: A neural network is trained to approximate the reverse step pθ(xt1xt)p_\theta(x_{t-1}|x_t), effectively learning to “undo” the noise.
  • Optimization: The model is trained by optimizing the variational lower bound (VLB), ensuring the generated samples represent the underlying data distribution.

2. Score-Based Perspective

Instead of modeling the probability density p(x)p(x) directly, we can model its gradient with respect to the input.

  • Score Function: Defined as s(x)=xlogp(x)s(x) = \frac{\partial}{\partial x} \log p(x).
  • Vector Field: The score function acts as a vector field that points toward areas of high probability density in the data space.
  • Learning: The diffusion model learns a neural network to approximate this score function for pdatap_{data}. During sampling, we “walk” along this vector field to find high-density regions (real data).

3. Stochastic Differential Equations (SDEs)

For a more continuous mathematical treatment, the noising process can be described as an SDE.

  • The Equation: We describe infinitesimal changes in data xx, time tt, and noise ww as:

    dx=f(x,t)dt+g(t)dwdx = f(x, t)dt + g(t)dw

  • Reverse SDE: Diffusion models learn a neural network to solve the reverse version of this equation, allowing for continuous-time generation and more flexible sampling strategies.

Summary of Perspectives

Beyond the frameworks above, diffusion models can also be viewed as:

  1. Autoencoders: Specifically, denoising autoencoders applied at different noise levels.
  2. Recurrent Neural Networks: Since they apply the same network iteratively over time steps.
  3. Autoregressive Models: When viewed as predicting the “next” state in a sequence from noise to data.
  4. Expectation Estimators: Estimating the conditional mean of the data given the noisy observation.

About this Post

This post is written by Louis C Deng, licensed under CC BY-NC 4.0.

#CS231n #Deep Learning