Staying visual in a CLIP world

When CLIP based image synthesis started to emerge in January/February 2021, I was torn between curiousity and perplexity. Suddenly it was possible to synthesise images from mere text prompts. On the one hand, I was really in need of something new to refresh my image creation process (mainly GAN based, see my article). On the other hand I had moved into visual arts in order to take distance from the written word, which had always come to me so naturally. Turning to the visual meant for me a turn away from analytical thought, towards immediacy, feeling and intuition. Generating images from text ran counter to thinking visually, let alone feeling visually.

Anyhow, I did my own experiments and gradually started to look for ways to use CLIP together with as much visual control as possible. First, this involved adding visual inputs to the process otherwise dominated by CLIP, later on even moving CLIP to a secondary role in a mainly visual process.

An obvious way to achieve visual control is to start the process from a given image. This already, together with a moderate learning rate and other settings, is enough to keep CLIP close to the original image while developing it somehow into the direction given in the prompt. In the following video, the starting images were black and white photos taken on film.

Alternatively, especially when there is no direct way to initialise the image generator with a given image, one can have an objective function to evaluate the distance from the original image, too, so that the process will seek a balance between textual and visual control.

For someone like me, with a preference for transformative techniques (see my article) above purely generative, using CLIP in combination with my own image materials turned out, at times at least, to excel in style transfer. Even if the style most often did not feel my own. Still doesn’t.

Here, I used a set of photos taken at home, already originally meant to look a bit like landscapes, developed further in that direction using CLIP. Still these, like the examples above, are mere experiments to me.

I have also experimented using CLIP with other types of image generators besides the now ubiquitous VQGAN, including hooking CLIP with my own GANs. Yet it was a totally untrained GAN generator that gave me the most interesting results. Here, the generative capabilities are very limited, still CLIP is trying to push it further, resulting in a special style. Here I feel that I am coming to my own ground, I can relate to these images in a personal way, even as my own works.

I then realised that the simple “input text, wait, get an image” approach was tedious and restrictive. What if CLIP were used in an interactive session, allowing you to watch the image evolve while having control over it all the time. Adding a new seed/target image, applying masks to develop only a part of the image at a time, changing the prompt on the fly, and so on. So I made Picsyn, and it was a totally different experience. I could start from an image with a prompt, then at some point make it develop into more ice-like, then towards Giacometti, not all the way but just a bit. The possibilities felt endless and they still are, even if the process is still quite simple and straightforward. It can yield quite satisfying images, finally, but also function in a realtime performance in which the image is constantly evolving under realtime control. The example below is purely experimental, but I am sure that with this interactive process, I would be able to make also works that are really my own.

Eventually, I want to tame CLIP into just another tool in my toolbox. I have resumed my GAN based work, even made some progress to get it working better than before. I have experimented with using CLIP both in pre- and post-processing the images in an otherwise GAN-based process. Modifying the dataset images, maybe ever so slightly, and maybe smoothing out the GAN output, but only if it works in the context. With this process, with images like those shown below, I am squarely within my own territory, having enough visual control and the resulting style is my own.

Another way of using CLIP output as raw material is mixed media and collage. In the following picture I have used CLIP in an interactive process to make an image, made a print on fine art paper and then attached a black and white large format (4×5″) film negative on top of the image. Here, again, I am within an experimental domain, very promising though, again something I can relate to, have enough control.