The note of Semi-parametric Image Synthesis
It presents a semi-parametric approach of photographic image synthesis from semantic layouts.
- Parametric method
Current parametric method generally refers to using deep networks to represent all data concerning photographic appearance in weights. It has advantages of end-to-end training over highly expressive models.
- Non-parametric method
Non-parametric method in the past draw on the database of image segments which is used to retrieve photographic references provided as source material. It has an ability to draw on large databases of original photographic content at test time.
- Semi-parametric method
This paper combines the complementary strengths of parametric and nonparametric techniques. Given a semantic layout, the system retrieves compatible segments from the database. The retrieved segments are used as raw material for synthesis and then composited onto the canvas. Deep networks help to align the components, resolve occlusion relationships and rectify the canvas to the final photographic image as output.
In summary, it draws an analogy that the practices of human painter who do not draw purely on memory, comparing to the weights of neural networks, but also use external references from around actual world, in contrast to the database of image segmentations, to reproduce the detailed object appearance.
The first and crucial thing is to build the External memory bank (database) . A set of color images and corresponding semantic layouts make a pair to generate with different semantic categories.
At test time, Semantic map , where is the size of images and is the number of semantic classes, which was not seen during training, is decomposed into connected components . For each , a compatible segment from is retrieved based on shape, location and context and then aligned to by a spatial Transformation network. With the help of the Ordering network, relative front-back order of segments is determined and canvas is synthesized with deliberately elided boundaries of retrieved segments.
A Synthesis network uses the canvas and the input layout as input, inpainting the missing regions, harmonizing retrieved segments, blending boundaries, synthesizing shadows and otherwise adjusting the final appearance.
Furthermore, cascaded refinement network is used to convert coarse imcomplete layouts to dense pixelwise layouts.
Representation of memory bank
A segment is associated with a tuple , where is a color image that contains the segment (other pixels are zeroed out as showed in the first image), is a binary mask that specifies the segment’s footprint, and is a semantic map representing the semantic context around within a bounding box enlarged by 25% to the .
Given a semantic layout at test time, and is computed for each semantic segment , by analogy with the definitions of . Then the most compatible segment is computed over the intersection-over-union score TOREAD as follows where the first term measures the overlap of the shapes while another measures the similarity of surrounding which helps the retrieve when surrounding affects appearance. iterates over segments in that have the same semantic class as .
The transformation network is designed to transform to align via translation, rotation, scaling and clipping while preserving the integrity of the appearance.
By simulating the inconsistencies in shape, scale and location that encounters at test time, can be trained with the input which is applied random affine transformations and cropped from .
The loss function for is as follows, and is defined over the color images rather than the mask so as to be more specific and better to constran the transformation.
The ordering network is to determine the front-back ordering of adjacent object segments whose output is a -dimensional one-hot vector as multi-classcification problem with absolute cross-entropy loss.
For the training of this network, it is totally take the advantage of datasets like Cityscapes and NYU.
The synthesis network has an encoder-decoder structure with skip connections to synthesize the final photo.
Encoder constructs a multiscale representation of the input based on VGG-19, and capture long-range correlations that can help the decoder harmonize color, lighting, and texture.
Decoder uses this representation to synthesize progressively finer feature maps, culmiating in full-resolution output. It also based on the cascaded refinement network as the framework showed above.
To train the network , artifacts of canvas with poor quality at test time must be simulated. Given a semantic layout and a corresponding color image from the training set, stenciling, color transfer and boundary elision to the pair is implemented to synthesize the simulated canvas . Thus is trained to take the pair and recover the original image .
where is the feature tensor in layer .
Stenciling: for each training image which is masked out the region of retrieved segments from set training, the network will learn to generate the context and foreground.
Color transfer: different segments on the canvas generally have inconsistent tone and illumination since xxxx. Therefore, to modify the color distribution of in , segment with the same semantic class fom is retrieved and used as the target of transferring.
Boundary elision: segment boundary is masked out randomly so as that the network is forced to learn to synthesize content near boundaries. Inconsistencies along boundries arise not only inside segments. TOREAD
SIMS approach demonstrate to produce considerably more realistic image than recent purely parametric techniques (entirely depend on weights of neural network without database).
SIMS in a sense lower-bounded by the performance of parametric methods, that if the memory bank is not useful, the network can simply ignore the canvas and perform parametric synthesis based on the input semantic layout. TOREAD
Future work includes:
acceleration of SIMS
other forms of input can be used, like semantic instance segmentation or textual descriptions
the frontier of video synthesis