Deep Generative Model, Basic thoery of GAN, Comparison to VAE, Equation derivations
Basic Idea of GAN

Each dimension of input vector represents some characteristic, which means that change the input of the noise, some of the characteristic of the output image will change.

The outout of discriminator is commonly a scalar. The larger value means realistic, while smaller means fake.
Algorithm
Algorithm Initialize $\theta_{d}$ for $\mathrm{D}$, and $\theta_{g}$ for $\mathrm{G}$
 In each training iteration:
 Sample m examples $\{ x^{1}, x^{2}, \ldots, x^{m} \}$ from database
 Sample m noise samples $\{z^{1}, z^{2}, \ldots, z^{m}\}$ from a distribution
 Obtaining generated data $\{\tilde{x}^{1}, \tilde{x}^{2}, \ldots, \tilde{x}^{m}\}, \tilde{x}^{i}=G\left(z^{i}\right)$
 Update discriminator paramters $\theta_{d}$ to maximize
\(\begin{aligned} &\tilde{V}=\frac{1}{m} \sum_{i=1}^{m} \log D\left(x^{i}\right)+\frac{1}{m} \sum_{i=1}^{m} \log \left(1D\left(\tilde{x}^{i}\right)\right) \\ &\theta_{d} \leftarrow \theta_{d}+\eta \nabla \tilde{V}\left(\theta_{d}\right) \end{aligned}\)  Sample m noise samples $\{z^{1}, z^{2}, \ldots, z^{m}\}$ from a distribution
 Update generator parameters $\theta_{g}$ to maximize
\(\begin{aligned} &\tilde{V}=\frac{1}{m} \sum_{i=1}^{m} \log \left(D\left(G\left(z^{i}\right)\right)\right) \\ &\theta_{g} \leftarrow \theta_{g}\eta \nabla \tilde{V}\left(\theta_{g}\right) \end{aligned}\)
 Takeaway:
 Fix generator G, update discriminator D. Discriminator learns to assign high scores to real objects and low scores to generated result. Gradient descent of the final score.
 Fix discriminator D, and update generator G. Gnerator learns to fool the discriminator. Gradient ascent of the final score.
Generator
 Generator can be designed as autoencoder
 Encoder part of the generator maps the original images to the latent space as the compact representation of the input object.
 Decoder part of the generator reconstructs the original object from code to generated result as closely as possible.
 After the training, we can discard the encoder part, replacing it with the input code as whatever we want.
 Change the input vector of the input noise, the generated result will vary smoothly, but may not make sense e.g. linear interpolation of the input noise
TODO VAE
Comparing to VAE
TODO TODO result in what?
 VAE: The relation between components are notcorrelated since neurons in the same layer of the network will not influence each other and they are independent
 Thus VAE can not discriminate naunced between features, which cause the result more vague than GAN
 To catch the correpondence among pixels, deep structure needed.
Discriminator
 Use discriminator to generate?
 Inference: Generate object $\widetilde{x}$ that
\(\widetilde{x}=\arg \max _{x \in X} D(x)\)  Enumerate all possible $x$, which is not always feasible, because the above equations can only be solved when it is linear. However linearity constrain the ability of the network.
 Inference: Generate object $\widetilde{x}$ that
 Algorithm:
 Given a set of positive examples and negative example generated randomly
 In each iteration
 Learn a discriminator D that can discriminate positive and negative examples
 Find relatively good examples from negative examples by $\widetilde{x}=\arg \max _{x \in X} D(x)$
 Replace the negative examples with relatively good examples
 From the above figure, we noticed that discriminator can only optimize the distribution corresponding to the generated distribution, which is one of the reason why GAN is difficult to optimize.
TODO structured learning
GAN
 From Discriminator’s point of view
 Use Generator to generate negative samples, replacing infeasible sovled
argmax
function
 Use Generator to generate negative samples, replacing infeasible sovled
 From Generator’s point of view
 Still generate the object without correspondence (pixelbypixel)
 But it is led by the instruction of the discriminator with global view
Basic Theory
Generation
 The realistic images from dataset can be regarded as the lowdimensional manifold in highdimensional space. Sampling from the dataset is the process of sampling from the manifold in the space.
Info: Manifold, in short, generalizes the triangle, circle, squares, etc. to higher dimensions. If we take a closer look at the Ndimensional manifold (Nmanifold), it looks like N1 dimensional manifold.

The ensemble of images from the sampling constitute the data distribution $P_{data}(x)$, which only occupies the tiny part of the whole highdimensional space.

The point outside of the manifold means there will be low probablity of the true sampling.

Building a generator is the process of building a mapping from one known distribution to the data distribution in high level.
Maximum Likelihood Estimation
 Target
 Given a data distribution $P_{data}(x)$
 We target at a distribution $P_{G}(x;\theta)$ parameterized by $\theta$
 To find $\theta$ such that $P_{G}(x;\theta)$ closed to $P_{data}(x)$
 e.g. if we assume $P_{G}(x;\theta)$ is Guassian Mixture Model, $\theta$ are means and variances of the Gaussians
 Step
 Derive $\{ x^{1}, x^{2}, \ldots, x^{m} \}$ by sampling from $P_{data}(x)$
 We can compute corresponding $P_{G}(x^{i};\theta)$
 Then find $\theta^{\ast}$ maximizing the likelihood
 Maximum Likelihood Estimation = Minimize KL Divergence
 KLdivergence measures the difference between two distribution $P_{\text {data}}$ and $P_{\text {G}}$
\(\begin{aligned} KL\left(P_{\text {data}} \ P_{G}\right) &= P_{\text {data}} \log P_{data} P_{data} \log P_{\text {G}} \\[8px] &= P_{\text {data}} \log \frac{P_{data}}{P_{\text {G}}} \end{aligned}\)  $x \sim data$ means $x$ is sampled from the distribution of data.
 $P_{data}(x)$ means the probablity of the distribution
data
at $x$ sampling point.  $P_{G}(x;\theta)$ means the probablity of the distribution
G
at $x$ sampling point with parameter $\theta$.
 KLdivergence measures the difference between two distribution $P_{\text {data}}$ and $P_{\text {G}}$
TODO: Expectation Maximum
 What if $P_G$ is not Gaussian Mixture Model but more complex and general model?
 The mean and variance of the more complex model is difficult to get, since the explicit expression of the corresponding probability density function is unknown.
 Thus parameters $\theta$ are implicit and impossible to found.
Drawback of GMM
 In the past, Generator is built as Guassian Mixture model (GMM) with explicit parameters $\theta$.
 As mentioned before, GMM fails to represent the lowdimensional manifold which images lie along in highdimensional space.
 Therefore, no matter what
mean
andvariance
of GMM looks like, it fails to resemble the target distribution.  As a result, the output of GMM is obscure and fuzzy.
Generator
 Generator
G
is a network, which defines a probability distirbution $P_G$, mapping the input distribution to the target. The network has an ability to approximate any functions.
 The input distribution is not decisive, normal distribution, uniform distribution, etc.
 Even the simplest distribution of input $z$ can be mapped to the complex one.
 The explicit formulations of $P_G$ and $P_{data}$ are still unknown, so it is impossible to do gradient descent on $G$ to minimize divergence.
 But we can sample data from two distribution $P_G$ and $P_{data}$.
 Sampling from $P_{data}$: randomly sample from the database.
 Sampling from $P_G$: randomly sample from normal, fed into the network and get outputs.
 How to compute the divergence between samplings from two distributions? Discriminator
 But we can sample data from two distribution $P_G$ and $P_{data}$.
Info: Div
is the divergence between two distribution, e.g. KLdivergence, JSdivergence.
Discriminator

Recall the objective function for generator
G
:
\(G^{*}=\arg \min _G \operatorname{Div}\left(P_{G}, P_{\text {data}}\right)\)  The objective function for discriminator
D
:
\(D^{*}=\arg \max _D \operatorname{V}(D, G) \\ V(G, D)=E_{x \sim P_{\text {data}}}[\log D(x)]+E_{x \sim P_{G}}[\log (1D(x))]\)  This is exactly same as training a Binary Classifier
 $P_{data}$ is binary class
1
 $P_G$ is binary class
0
D
is a binary classifier Binary classifier is to maximize Cross Entropy rather than Mean Square Error, since MSEbased objective is nonconvex optimization problem
 $P_{data}$ is binary class
 The maximum objective is related to JS divergence.
 Given
G
, the optimal $D^*$ maximizing:
\(\begin{aligned} D^{*}&=\arg \max _D \operatorname{V}(D, G) \\[10px] V(G, D) &=E_{x \sim P_{\text {data}}}[\log D(x)]+E_{x \sim P_{G}}[\log (1D(x))] \\[10px] &=\int_{x} P_{d a t a}(x) \log D(x) d x+\int_{x} P_{G}(x) \log (1D(x)) d x \\[10px] &=\int_{x}\left[P_{d a t a}(x) \log D(x)+P_{G}(x) \log (1D(x))\right] d x \end{aligned}\)  Assume that
D(x)
can represent any function. Consider the integral as the sum of each piecewise function(分段函数) for $x_1, x_2, …$, since we have assumed
D(x)
can represent any function, and neural network indeed can be regarded as piecewise function mathematically.  When each piecewise function reaches the largest individually, the final integral must also be the largest.
 Consider the integral as the sum of each piecewise function(分段函数) for $x_1, x_2, …$, since we have assumed
 Discretize the above equation, then the objective will be:
\(\begin{cases} P_{d a t a}(x) \log D(x)+P_{G}(x) \log (1D(x)) \\[10px] a = P_{data},\ b = P_{G} \end{cases}\) Solve for the extremum point:
\(\begin{aligned} \mathrm{f}(D) &=\operatorname{alog}(D)+b \log (1D) \\[8px] \frac{d \mathrm{f}(D)}{d D} &=a \times \frac{1}{D}+b \times \frac{1}{1D} \times (1) =0 \\[8px] a \times\left(1D^{*}\right) &=b \times D^{*} \\[5px] 0< D^{*} &=\frac{a}{a+b} =\frac{P_{\text {data}}(x)}{P_{\text {data}}(x)+P_{G}(x)}<1 \\[8px] 0< D^{*} &<1 \end{aligned}\)
 Solve for the extremum point:
 Substitue the optimal $D^* = \frac{P_{\text {data}}(x)}{P_{\text {data}}(x)+P_{G}(x)}$:
\(\begin{aligned} V &=E_{x \sim P_{\text {data}}}[\log D(x)]+E_{x \sim P_{G}}[\log (1D(x))] \\[15px] &=E_{x \sim P_{\text {data}}}\left[\log \frac{P_{\text {data}}(x)}{P_{\text {data}}(x)+P_{G}(x)}\right]+E_{x \sim P_{G}}\left[\log \frac{P_{G}(x)}{P_{\text {data}}(x)+P_{G}(x)}\right] \\[15px] &=\int_{x} P_{\text {data}}(x) \log \frac{P_{\text {data}}(x)}{P_{\text {data}}(x)+P_{G}(x)} d x+\int_{x} P_{G}(x) \log \frac{P_{G}(x)}{P_{\text {data}}(x)+P_{G}(x)} d x \\[15px] &=\int_{x} P_{\text {data}}(x) \log \frac{\frac{1}{2}P_{\text {data}}(x)}{\frac{1}{2}[P_{\text {data}}(x)+P_{G}(x)]} d x+\int_{x} P_{G}(x) \log \frac{\frac{1}{2}P_{G}(x)}{\frac{1}{2}[P_{\text {data}}(x)+P_{G}(x)]} d x \\[15px] &=2log2 + \int_{x} P_{\text {data}}(x) \log \frac{P_{\text {data}}(x)}{\frac{1}{2}[P_{\text {data}}(x)+P_{G}(x)]} d x+\int_{x} P_{G}(x) \log \frac{P_{G}(x)}{\frac{1}{2}[P_{\text {data}}(x)+P_{G}(x)]} d x \\[15px] &=2log2 + K L\left(P_{\text {data}} \ \frac{P_{\text {data}}(x)+P_{G}(x)}{2}\right) + K L\left(P_{\text {G}} \ \frac{P_{\text {data}}(x)+P_{G}(x)}{2}\right) \\[15px] &=2log2 + \color{red}{2JS\left(P_{\text {data}} \ P_{\text {G}}\right)} \end{aligned}\)
 JS divergence is symmetric, while KL divergence is asymmetric. \(\begin{aligned} JS\left(P_{\text {data}} \ P_{G}\right) &= \frac{1}{2}[KL(P_{\text {data}} \ M) + KL(P_{\text {G}} \ M)] \\ M &= \frac{1}{2}(P_{data} + P_{G}) \end{aligned}\)
TODO: inverse KL divergence
GAN
 The objective of Generator
G
is to minimize the divergence between $P_G$ and $P_{data}$  The divergence is measured by Discriminator
D
, which maximizes the cross entropy as a binary classifier.  The maximum objective is related to JS divergence.
 For each fixed generator, we can find a discriminator with the maximum $V(D, G_i)$
 For each $V(D, G_i)$, we can get a minimum generator
 In each iteration
 Fix generator
G
, update discriminator $D$.  Fix discriminator
D
, update generator $G$.
 Fix generator
Algorithm
 To find the best
G
minimizing the loss function
\(G^{*}=\arg \min _G \max _D \operatorname{V}(D, G) = \arg\min_G L(G)\\ \theta_{G} \leftarrow \theta_{G}\eta \partial L(G) / \partial \theta_{G}\) Let the maximum value $\max_D V(G,D)$ as $L(G)$
 $\theta_G$ defines
G
 The gradient of the max function is the gradient of max value of each piecewise function.
 Given $G_0$
 Find $D^*_0$ maximizing $V(G_0, D)$ , using Gradient Ascent
 $V(G_0, D_{0}^{*} )$ is the JS divergence between $P_data(x)$ and $P_{G_0}(x)$
 Obtain $G_1$
\(\theta_{G} \leftarrow \theta_{G}\eta \partial V\left(G, D_{0}^{*}\right) / \partial \theta_{G}\)  Give $G_1$
 Find $D^*_0$ maximizing $V(G_0, D)$ , using Gradient Ascent
 $V(G_1, D_1^*)$ is the JS divergence between $P_data(x)$ and $P_{G_0}(x)$
 Obtain $G_2$ \(\theta_{G} \leftarrow \theta_{G}\eta \partial V\left(G, D_{1}^{*}\right) / \partial \theta_{G}\)
Critical: Actually, after we get the optimal $D_i^*$ by maximizing the JSdivergence, given the fixed $G_i$, we do gradient descent optimizing $G_i$ a little bit per step to finally get $G_{i+1}$. However, in the process of the gradient descent of $G_i$ to $G_{i+1}$, the divergence function $V(G_i, D_i)$ may change a lot, which cause the maximum objective value of JSdivergence
shift to somewhere else.
 Therefore, the important thing is that we don’t update $G$ too much for every steps, and $D$ is welltrained comparing to current $G$, so that the function of $V(G_i, D_i)$ won’t change a lot and $D_i^* \approx D_{i+1}^*$.
In practical
 Actually, We can not get the expectation value of $P_{data}$ or $P_G$
 In other words, we can not calculate the maximum value of the below equation:
\(V =E_{x \sim P_{\text {data }}}[\log D(x)]+E_{x \sim P_{G}}[\log (1D(x))]\)
 In other words, we can not calculate the maximum value of the below equation:
 But we can do sampling from $P_{data}$ or $P_G$ to approximate the expectation.
 Given $G$, to compute $\max_D V(G,D)$
 Sample ${\tilde{x}^{1}, \tilde{x}^{2}, \ldots, \tilde{x}^{m}}$ from $P_{data}(x)$, sample ${x^{1}, x^{2}, \ldots, x^{m}}$ from generator $P_G(x)$
 To maximize:
\(\tilde{V}=\frac{1}{m} \sum_{i=1}^{m} \log D(x^{i})+\frac{1}{m} \sum_{i=1}^{m} \log (1D(\tilde{x}^{i}))\)
 Thus, $D$ is a binary classifier with sigmoid output, which can be designed to be very deep.
 ${\tilde{x}^{1}, \tilde{x}^{2}, \ldots, \tilde{x}^{m}}$ from $P_{data}(x)$ is
Positive examples
.  ${x^{1}, x^{2}, \ldots, x^{m}}$ from generator $P_G(x)$ is
Negative examples
.  This is equivalent to minimize Crossentropy.
 ${\tilde{x}^{1}, \tilde{x}^{2}, \ldots, \tilde{x}^{m}}$ from $P_{data}(x)$ is
Total Algorithm
Conclusion
 In each training iteration:
 Learning D: find lower bound of $\max_D V(G,D)$ and repeat
k
times for fully trained. Sample
m
examples $\{ x^{1}, x^{2}, \ldots, x^{m} \}$ from database distribution $P_{data}(x)$.  Sample
m
noise samples $\{z^{1}, z^{2}, \ldots, z^{m}\}$ from the prior $P_{prior}(z)$.  Obtaining generated data $\{\tilde{x}^{1}, \tilde{x}^{2}, \ldots, \tilde{x}^{m}\}, \tilde{x}^{i}=G\left(z^{i}\right), \tilde{x}^i = G(z^i)$.
 Update discriminator paramters $\theta_{d}$ to maximize.
\(\begin{aligned} &\tilde{V}=\frac{1}{m} \sum_{i=1}^{m} \log D\left(x^{i}\right)+\frac{1}{m} \sum_{i=1}^{m} \log \left(1D\left(\tilde{x}^{i}\right)\right) \\ &\theta_{d} \leftarrow \theta_{d}\color{red}{+}\eta \nabla \tilde{V}\left(\theta_{d}\right) \end{aligned}\)
 Sample
 Learning G: find the minimum of $\min_G V(G,D)$ only once.
 Sample another
m
noise samples $\{z^{1}, z^{2}, \ldots, z^{m}\}$ from the prior $P_{prior}(z)$.  Update generator parameters $\theta_{g}$ to maximize
\(\begin{aligned} &\tilde{V}=\underset{\text{irrelevant to } G}{\boxed{\frac{1}{m} \sum_{i=1}^{m} \log D\left(x^{i}\right) } } +\frac{1}{m} \sum_{i=1}^{m} \log \left(D\left(G\left(z^{i}\right)\right) \right) \\ &\theta_{g} \leftarrow \theta_{g}\color{red}{}\eta \nabla \tilde{V}\left(\theta_{g}\right) \end{aligned}\)
 Learning D: find lower bound of $\max_D V(G,D)$ and repeat
Note that it is impossible for $D$ to find the maximum value of $\max_D V(G,D)$, which means that the JSdivergence can never be reached, because:
1. The learning ability of $D$ is limited due to the iteration times and learning rate, the $D$ can not convergence in several update steps.
2. Even with the infinite training step and the convergence achieved, it is more likely for $D$ to fall in local minima.
3. Even without trapped in local minima, the capcity of the generatlization of $D$ is limited, which contradict the assumption that $D$ can represent all functions.
In Real Implementation
 The training of $G$ would be slow because of the initial output of $D(x)$ is closed to $0$ and the corresponding gradient to $G$ is very small. \(V =\underset{\text{irrelevant to } G}{\boxed{E_{x \sim P_{\text {data } }}[\log D(x)]}}+E_{x \sim P_{G}}[\log (1D(x))]\)
 Two workarounds
 Replace the target with the following, greatly increase the gradient of the very beginning stage, Minimax GAN (MMGAN) \(V = E_{x \sim P_{G}}[ \log (D(x))]\)
 Replace the negative label of $G$ with the positive label, Nonsaturating GAN (NSGAN)