In my previous post, Controlling image content with FC layers, I described how I had modified neural-style to optimize content based on the classifications obtained from the top-most, fully connected layers (which neural-style does not use). I had found that the modified program produced visually quite interesting results, but I was not fully sure whether this was due to the use of FC layers or merely from the use of one and same image for both content and style.
I have since then confirmed that the modified program, which I have given the name Neural Mirage, actually optimized contents according to classifications while the details are obtained through the style of the original photo. The result is a totally new image, in which the spatial arrangement of the original is totally absent. The image has been deconstructed and then reconstructed, sometimes looking like a map, sometimes like a puzzle in which the pieces have been reordered, even repainted from different perspectives, while they still appear to fit together seamlessly. Like in the view from Munich below.
Or in the view of the Helsinki harbor seen from the sea:
Up to this point, I had used the original VGG16 net in these tests, modified as described in the previous post.
I then wanted to monitor what classes the program is seeing in the target image, and how it manages to work towards the correct classes. I was able to see that the content losses from FC8-layer were diminishing properly, but I also wanted to see which specific classes were seen present in the target image as well as in each of the intermediate results.
Making sure that there was a Softmax layer on top of the upmost FC layer, I got an array of probabilities for each of the 1000 classes in VGG16. I did not have a list of names for the classes, though, and my attempt to reconstruct one produced such improbable results that I had find something else. At this point I found a VGG16 trained for places in http://places.csail.mit.edu/downloadCNN.html together with a list of 205 classes like “alley”, “river”, “desert sand”. My Mirage could now display, at the beginning, which features it is seeing in the content image, like for the Helsinki harbor:
----------- seeing the features -------------- 0.4539 /b/basilica 21 0.0939 /c/cathedral/outdoor 61 0.0704 /c/church/outdoor 62 0.0832 /h/harbor 90 0.0673 /p/palace 133
The first iterations, starting with an image of random noise, usually see not much else than desert sand:
Iteration 10 / 1000 Content 1 loss: 10157.208443 Style 1 loss: 29152.839661 Style 2 loss: 7147411.132812 Style 3 loss: 10749539.062500 Style 4 loss: 164257.827759 Style 5 loss: 438.794106 Total loss: 18100956.865281 0.3538 /d/desert/sand 68 0.0625 /m/marsh 116
Usually the content classes emerge quite early, and from there the optimization mainly continues to refine the style. Sometimes the optimization fails to find the correct classes, and if this persists for an image, then it usually helps to increase the content weight.
Iteration 70 / 1000 Content 1 loss: 495.897047 Style 1 loss: 27438.602448 Style 2 loss: 233755.386353 Style 3 loss: 186934.814453 Style 4 loss: 14508.877754 Style 5 loss: 177.618906 Total loss: 463311.196961 0.4232 /b/basilica 21 0.0789 /c/cathedral/outdoor 61 0.0668 /c/church/outdoor 62 0.1070 /h/harbor 90 0.0911 /p/palace 133
The VGG Places net is fun to work with, now that one can monitor what it is really seeing in the images. Otherwise, it might be that the images it produces are a little less detailed than those in which the default VGG19 was used. But not enough tests have been conducted to be sure.