Controlling image content with FC layers

A few days ago I got an idea to generate images based on fully connected layers of the VGG19 net. As the these layers should contain abstracted information on what the net sees in the input image, perhaps in this way I could make neural-style to generate images based on what the net thinks it is seeing (like houses, trees, people), rather than using the actual shapes it is seeing.

This idea is maybe influenced by Google’s inceptionism, at least in technical terms. While admitting that, I actually strongly dislike the typical images produced by Google’s solution.

Neural-style, following the original paper by Leon Gatys et al, does not use the fully connected layers at all. When I tried assigning, say, FC8 as a content layer, neural_style terminated with a “size mismatch” error. The creator of neural_style, jcjohnson, pointed me the way to the source of this error. I had bumped into a grey area in my understanding of neural networks: how a convolutional net handles images of varying size in practice. For a conv layer, it turned out, the varying image dimensions are no problem: the layer dimensions adapt themselved to the dimensions of the input image. The fully connected layers, however, are of fixed size. This is the reason for the mismatch.

The VGG19 net has been trained to classify images of 224x224px, and the dimensions of the FC layers match with the conv layers when using exactly this size. I then patched neural_style to resize input images to 224×224, and yes, it worked. At least somehow… most of the images produced were quite boring, and the optimization was not behaving well. The loss was jumping up and down instead of converging. Some of the images resembled the kind of psychedelic kitsch which is not my favorite style. One of the images was a faint imaginary landscape, though, that looked promising to me.

11215105_10154163669088729_1319398496884564628_n 12670624_10154163671323729_170637435934694477_n 1235514_10154163615258729_7272599325165600309_n

Between my experiments, I also looked more closely into how neural-style constructs the net it is using internally. For obvious reasons, it did not take into account adding the FC layers. I made some further modifications. At some point I noticed that the optimization was working much better towards a minimum loss. Likewise, I felt I was getting better images.

12928162_10154167413238729_8249157662987593086_n 12919906_10154167373038729_5938800725002576143_n 12718000_10154167285813729_1874033585272655881_n

In the images I have used the same photo both for content and style. Content has been taken from layer fc7 or fc8, style from the conv layers.

I liked what I was seeing, but being limited to 224×224 px was really annoying. I had been told that converting the fc layers to conv layers would solve that, but I didn’t feel ready for that. Instead, I inserted a SpatialAdaptiveMaxPooling layer between the conv and fc layers to ensure that the lowest fc layer always gets input of the correct dimensions (7x7x512).

It looks like this works. Furthermore, I rather like the style of the resulting images, which is generated without using any separate style image. At the present, I am not really sure how much of the feeling of the images results from the use of an fc layer, and how much from the application of the photo as a style image, kind of dissecting the image into elements and then reconstructing a new one. It appears that the role of the style effect is essential. Adding too much weight to the fc layers results in psychedelic chaos á la Google inceptionism. I will continue my experiments.






In der Altstadt