Learning where and what to draw

2 minute read

Modified:

The note of Learning where and what to draw

Overview

Previous method so far only used conditioning variable such as a class label or non-localized caption, and didn’t allow for controlling the location information of objects.

This model learns to perform content controllable and location controllable image synthesis, that is what and where. There are two ways to encode spatial constraints, one is incorporating spatial masking and cropping modules into a text-conditional GAN with spatial transformers, another is locating part of the object by a set of normalized coordinates (x, y) with multiplicative gating mechanism.

Previous work

Two types of transformation from text to image has been proposed, trained to learn one-to-one mappings from the latent space to pixel space, and to learn probabilistic models to approximate the distribution of each pixel TOREAD. Besides, GAN has relatively better sharpness compared to VAE models in general.

Spatial Transformer Networks (STN) is an effective visual attention mechanism

Network structure

GAN and Joint embedding structure used in this model is articulated in the note of Generative Adversarial Text to Image Synthesis

Bounding-box conditional model

bounding-box model

Keypoint conditional model

keypoint model

Conditional keypoint generation model

The purpose is to have acceess to all of the conditional distributions of keypoints, given a subset of observed keypoints and the text description. However, a simple autoencoder is sparse and would not suffice.

The keypoint generation GAN is proposed to build datasets with full keypoints using gating mechanism given only a few of visible variable.

The discriminator (DD) of the GAN is simple which only distinguish real keypoints and text (kreal,treal)\left(\mathbf{k}_{real}, \mathbf{t}_{real}\right) from synthetic keypoints.

The generator (GG) is relatively complicated and formulated as follows:

Gk(z,t,k,s):=sβŠ™k+(1βˆ’s)βŠ™f(z,t,k) G_{k}(z, \mathbf{t}, \mathbf{k}, \mathbf{s}) :=\mathbf{s} \odot \mathbf{k}+(1-\mathbf{s}) \odot f(z, \mathbf{t}, \mathbf{k})

where ki:={xi,yi,vi}k_{i} := \left\{ x_{i}, y_{i}, v_{i} \right\}, i=1,...,Ki = 1,...,K and xx and yy indicate thee rows and columns position, vv is a bit set to 1 if the part is absolutely visible and 0 otherwise. Note that k∈[0,1]KΓ—3\mathbf{k} \in[0,1]^{K \times 3} encode the keypoints into a matrix. Switch units s∈{0,1}K\mathbf{s} \in\{0,1\}^{K} is to specified previously by for example user, which set the ii-th entry of KK to 1 if the corresponding ii-th part is 1 and 0 otherwise. βŠ™\odot denotes pointwise multiplication and ff is a 3-layer fully-connected network to transform concatenated zz, tt and flattened kk of shape RZ+T+3K\mathbb{R}^{Z+T+3 K} to the shape of R3K\mathbb{R}^{3 K}

As stated above, parameters of ff is trainable. TOREAD

In order for GkG_{k} to capture all of the conditional distributions over keypoints, during training switch units ss is randomly sampled.

Leave a Comment