The note of Learning where and what to draw
Previous method so far only used conditioning variable such as a class label or non-localized caption, and didn’t allow for controlling the location information of objects.
This model learns to perform content controllable and location controllable image synthesis, that is what and where. There are two ways to encode spatial constraints, one is incorporating spatial masking and cropping modules into a text-conditional GAN with spatial transformers, another is locating part of the object by a set of normalized coordinates (x, y) with multiplicative gating mechanism.
Two types of transformation from text to image has been proposed, trained to learn one-to-one mappings from the latent space to pixel space, and to learn probabilistic models to approximate the distribution of each pixel TOREAD. Besides, GAN has relatively better sharpness compared to VAE models in general.
Spatial Transformer Networks (STN) is an effective visual attention mechanism
GAN and Joint embedding structure used in this model is articulated in the note of Generative Adversarial Text to Image Synthesis
Bounding-box conditional model
Keypoint conditional model
Conditional keypoint generation model
The purpose is to have acceess to all of the conditional distributions of keypoints, given a subset of observed keypoints and the text description. However, a simple autoencoder is sparse and would not suffice.
The keypoint generation GAN is proposed to build datasets with full keypoints using gating mechanism given only a few of visible variable.
The discriminator () of the GAN is simple which only distinguish real keypoints and text from synthetic keypoints.
The generator () is relatively complicated and formulated as follows:
where , and and indicate thee rows and columns position, is a bit set to 1 if the part is absolutely visible and 0 otherwise. Note that encode the keypoints into a matrix. Switch units is to specified previously by for example user, which set the -th entry of to 1 if the corresponding -th part is 1 and 0 otherwise. denotes pointwise multiplication and is a 3-layer fully-connected network to transform concatenated , and flattened of shape to the shape of
As stated above, parameters of is trainable. TOREAD
In order for to capture all of the conditional distributions over keypoints, during training switch units is randomly sampled.