Text2Scene: Generating Compositional Scenes from Textual Descriptions

6 minute read


The note of Text2Scene: Generating Compositional Scenes from Textual Descriptions


Unlike the previous work, this paper doesn’t adopt Generative adversarial network, but a combination of encoder-decoder architecture with a semi-parametric retrieval-based approach TOREAD. Under minor modification, this model performs decent generation of different forms of scene representation, including clip-art generation on Abstract Scenes, semantic layout on COCO and compositional image generation on COCO.


Difficulties encountered

Generating rich textual representation has two main challenges, one is indirect hint at the presense of attributes from input textual description (e.g. β€œMike is surprised” should change facial attributes on the generated object β€œMike”), and another is relative spatial configurations within the text (e.g. β€œMike is runing towards to his girlfriend” confines the orientation of β€œMike” dependent on β€œhis girlfriend”)

Relative work is as follows:

  • Semi-parametric Image Synthesis proposed a retrieval-based semi-parametric method for image synthesis given an input by a human. But different from the previous work using ground-truth semantic layout as input, this work learns to predict the location and layout of the object indirectly from the text.

  • Image generation from scene graphs, it proposed a graph-convolution model to generate from structured scene graph where objects and their relationship are provided as inputs, while this work the presence of objects is inferred from text.

  • Inferring semantic layout for hierarchical text-to-image synthesis generates the layout as the intermediate representations in separably trained modules, but this work generate pixel-wise outputs with semi-parametric retrieval module without advesarial training.

  • Visual dialog for collaborative drawing performs pictorial generation from chat logs, compared to out works that need text much considerably more underspecified.

Network structure

Using a sequence-to-sequence approach, this model arranges the generated object sequentially along with their attributes (locations, sizes, aspect ratios, pose, appearance and more) on an initially empty canvas.

Generally, Text2Scene model consists of

  • a text encoder that maps input sentence to a set of embedding representations
  • an object decoder that predicts the next generated object conditioned on the current scene state
  • an attribute decoder that determines the attribute of the next object


Text encoder

To compute for each word ii given input text

hiE=BiGRU⁑(xi,hiβˆ’1E,hi+1E) h_{i}^{E}=\operatorname{BiGRU}\left(x_{i}, h_{i-1}^{E}, h_{i+1}^{E}\right)

here BIGRU is a bidirectional GRU cell, xi x_{i} is a word embedding and hiEh_{i}^{E} is a vector encoding the current word and its context.

Image encoder and recurrent module

To use a convolutional network Ξ©\Omega to encode current canvas Bt{B}_{t} into a CΓ—HΓ—W\mathcal{C} \times H \times W feature map representing the current scene state, and to model the history of the scene state {htD} \left\{ h_{t}^{D} \right\} by a convolutional GRU. htD h_{t}^{D} is an important representation of both temporal and spatial dynamical information.

htD=ConvGRU⁑(Ξ©(Bt),htβˆ’1D) h_{t}^{D}=\operatorname{ConvGRU}\left(\Omega\left(B_{t}\right), h_{t-1}^{D}\right)

In case that htDh_{t}^{D} fails to capture small objects, previous step otβˆ’1 {o}_{t-1} is provided as input to the downstream decoders.

Attention-based object decoder

The object decoder based on attention mechanism outputs the likelihhod score on all objects in object vocabulary library V\mathcal{V}, and takes as input:

  • the current scene state {htD} \left\{ h_{t}^{D} \right\}
  • the text features {(hiE,xi)} \left\{\left(h_{i}^{E}, x_{i}\right)\right\}
  • the preiviously predicted object otβˆ’1 o_{t-1}

Spatial attention convolutional network

where Ξ¨\Psi is a convolutional network with spatial attention on {htD} \left\{ h_{t}^{D} \right\} to collect spatial contexts for the object prediction, e.g. what objects have already been added. Then by average pooling layer, the attended features are fused into a vector utou_{t}^{o}.

uto=Β AvgPoolingΒ (Ξ¨o(htD)) u_{t}^{o}=\text { AvgPooling }\left(\Psi^{o}\left(h_{t}^{D}\right)\right)

Text-based attention module

Text-based attention module Ξ¦\Phi uses utou_{t}^{o} to attend to the text features {(hiE,xi)} \left\{\left(h_{i}^{E}, x_{i}\right)\right\} , and encodes the knowledge of all described objects having been added to the scene thus far ideally.

cto=Ξ¦o([uto;otβˆ’1],{(hiE,xi)}) c_{t}^{o}=\Phi^{o}\left(\left[u_{t}^{o} ; o_{t-1}\right],\left\{\left(h_{i}^{E}, x_{i}\right)\right\}\right)

Likelihood predicting perceptrons

Θo\Theta^{o} is two-layers perceptrons to predict the likelihhod of the next object p(ot)p\left(o_{t}\right) using a softmax function.

p(ot)∝Θo([uto;otβˆ’1;cto]) p\left(o_{t}\right) \propto \Theta^{o}\left(\left[u_{t}^{o} ; o_{t-1} ; c_{t}^{o}\right]\right)

Attention-based attribute decoder

For each spatial location in htDh_{t}^{D}, This part predicts both location likelihood {lti}i=1…N\left\{l_{t}^{i}\right\}_{i=1 \ldots N} and attribute likelihoods {Rtk}\left\{R_{t}^{k}\right\} to the object oto_{t}. Here possible locations are discretized into the same resolution of htDh_{t}^{D}. TOREAD

β€œZoom in” module

Ξ¦a\Phi^{a} is to zoom in the language context of oto_{t} by attending to the input text feature.

cta=Ξ¦a(ot,{(hiE,xi)}) {c_{t}^{a}=\Phi^{a}\left(o_{t},\left\{\left(h_{i}^{E}, x_{i}\right)\right\}\right)}

Location prediction convolution network

Compared to ctoc_{t}^{o} which contain information of objects that have not been added yet, ctac_{t}^{a} focuses more specifically on contents realted to the current object oto_{t}.

uta=Ξ¨a([htD;cta]) {u_{t}^{a}=\Psi^{a}\left(\left[h_{t}^{D} ; c_{t}^{a}\right]\right)}

Then Ξ¨a\Psi^{a}, a CNN spatially attending to htDh_{t}^{D}, is to find an affordable location to append oto_{t}.

Likelihood predicting convolutional network

Θa\Theta^{a} is implemented by a convolutional network with softmax classifiers over each value of ltl_{t} and the discrete RtkR_{t}^{k}.

p(lt,{Rtk})=Θa([uta;ot;cta]) {p\left(l_{t},\left\{R_{t}^{k}\right\}\right)=\Theta^{a}\left(\left[u_{t}^{a} ; o_{t} ; c_{t}^{a}\right]\right)}

Foreground patch embedding

For the third mission to generate images composed of patches retrieved from others, a particular QtQ_{t} is proposed to predict every location in the output feature map but is used at test time to retrieve similar patches from pre-computed collection of object segments from other images TOREAD. A patch embedding network using a CNN reduces the foreground patch of target image into a 1D vector FtF_{t}. To minimize the β„“2\ell_{2}-distance between QtQ_{t} and FtF_{t}, it uses the triplet embedding loss (PcolorP^\text {color}, PmaskP^\text {mask}, PcontextP^\text {context}) to minimize the distance of βˆ₯Qt,Ftβˆ₯2\left\|Q_{t}, F_{t}\right\|_{2} and maximize the distance of βˆ₯Qt,Fkβˆ₯2\left\|Q_{t}, F_{k}\right\|_{2}. Here FkF_{k} is the feature of a negative patch randonly selected from the same category of FtF_{t}. TOREAD Ltriplet(Qt,Ft)=max⁑{βˆ₯Qt,Ftβˆ₯2βˆ’βˆ₯Qt,Fkβˆ₯2+Ξ±,0} L_{\text {triplet}}\left(Q_{t}, F_{t}\right)=\max \left\{\left\|Q_{t}, F_{t}\right\|_{2}-\left\|Q_{t}, F_{k}\right\|_{2}+\alpha, 0\right\}

where Ξ±\alpha is a margin hyper-parameter.


The loss function with reference values (Ot,lt,{Rtk},Ft)\left(O_{t}, l_{t},\left\{R_{t}^{k}\right\}, F_{t}\right) is:

L=βˆ’woβˆ‘tlog⁑p(ot)βˆ’wlβˆ‘tlog⁑p(lt)βˆ’βˆ‘kwkβˆ‘tlog⁑p(Rtk)+weβˆ‘tLtriplet(Qt,Ft)+waOLattnO+waALattnA L= -w_{o} \sum_{t} \log p\left(o_{t}\right)-w_{l} \sum_{t} \log p\left(l_{t}\right) -\sum_{k} w_{k} \sum_{t} \log p\left(R_{t}^{k}\right)\\ +w_{e} \sum_{t} L_{t r i p l e t}\left(Q_{t}, F_{t}\right) +w_{a}^{O} L_{a t t n}^{O}+w_{a}^{A} L_{a t t n}^{A}

where Lattnβˆ—L_{a t t n}^{*} are regularization terms inspired by the doubly stochastic attention module propose in section 4.2.1 of Show, attend and tell: Neural image caption generation with visual attention and ww are hyper-parameters controlling the relative contribution of each loss.


Text2Scene model demonstrates the capacity on both abstract and real images, which opens the possibility for future work on transfer learning across domains.

Leave a Comment