Introduction to VQGAN+CLIP

Here is a tutorial on how to operate VQGAN+CLIP by Katherine Crowson! No coding knowledge necessary.

machine learning
image synthesis
  1. Home
  2. Google Doc
  3. Introduction to VQGAN+CLIP

Introduction to VQGAN+CLIP

Here is a tutorial on how to operate VQGAN+CLIP by Katherine Crowson! No coding knowledge necessary.

machine learning, image synthesis, graphics, design, surrealism, unreal

This is a brief tutorial on how to operate VQGAN+CLIP by Katherine Crowson. You don’t need any coding knowledge to operate it - my own knowledge of coding is very minimal.

I did not make this software, I just wanted to bring it to the public’s attention. Katherine Crowson is on Twitter @RiversHaveWings, and the notebook that inspired it (The Big Sleep, combining BigGAN+CLIP in the same way) was written by @advadnoun. Currently, the notebook is updated and maintained by @somewheresy, who regularly ports new editions through his GitHub.

If you’d like to see some really trippy video applications of this technique, check out the videos on @GlennIsZen‘s YouTube Page:

Note: I purchased a subscription to Google Colab Pro, which gives priority to better and faster GPUs, and decreases the time taken before Collab times out. This is not a necessary measure, and you can do all of this without it. If you want to run generations for longer periods of time or have it run in the background without interruption, Colab Pro is a reasonable option. Additionally, there is Colab Pro+, which provides an even higher priority than Pro, although this package is $50/month, which is certainly outside my own price range.

STEP 1: Google Colab

Go to this Google Colab Notebook:

(Updated GitHub link)

Google Colab is like Google Docs for code. It gives you access to dedicated GPUs that Google operates through the cloud, so you don’t need to use your own computer’s processing power in order to run the program. This is useful, because generating art with AI can be very intensive.

You’ll see that the page is split into lots of individual “cells.” Most of these don’t require a great detail of interaction, so don’t worry.

STEP 2: Setting Up the Notebook

At the top-right of the Notebook, you’ll see a button that says “Connect.” When you click this, Colab will connect you to a runtime using one of their dedicated GPUs so that you can begin running code. Run the first cell titled “MIT Licence.” The next cell down is optional, and will inform you what type of GPU your runtime has connected to. If you want to try connecting to a different runtime with a different GPU, go up to the “Runtime” tab, and click “Factory Reset Runtime,” which will disconnect you from the GPU and allow you to connect from scratch.

The next pair of cells allow you to make use of either Colab’s own internal file storage, or your own personal Google Drive for the importing/exporting of images. If you want to use Colab’s folder, run the “Use Temp Filesystem” cell. If you want to use your own Google Drive, run the “Connect Google Drive” and “Make a new folder & set root path to that folder” cells.

Next, run the “Setup, Installing Libraries” cell, which will install all the necessary code libraries used in the image generation process.

STEP 3: Models

The next cell, “Selection of models to download,” allows you to choose which of VQGAN’s different models you would like to make use of. Simply click each model to toggle it on or off, and then run the cell to install them. If you change your mind, you can always edit your selection and run the cell again to download additional models.

Each model instructs VQGAN to draw from a different dataset - for example, the Imagenet_16384 model trains VQGAN on the Imagenet 16384 dataset of images. The Sflckr model trains VQGAN on a dataset derived from the image website Flickr. Each dataset is useful for slightly different things, so experiment with each one and find out which one generates the result you prefer!

Additionally, you can toggle whether to download the data necessary for the notebook to run Katherine Crowson’s GAN Diffusion software, which has been integrated into the VQGAN+CLIP notebook and can be found underneath the VQGAN Execution cell.

Once your models are installed, run the “Load libraries and definitions” cell.

STEP 4: Execution

This step is where we will determine the parameters of the image we want to make, and generate the image!

There are two parameters that are used in both VQGAN+CLIP and GAN Diffusion, so they are separated into the “Global Parameters” cell. These parameters are defined as follows:

seed: the seed will determine the map of noise that VQGAN will use as its initial image - similar to how the concept of a seed works in Minecraft. Setting the value of the seed to -1 will generate a random image every time. Using any positive integer will generate the same sheet of noise each time, allowing for comparisons in style and tone between different images.

display_frequency: VQGAN+CLIP takes a starting image and iterates it dozens or hundreds of times until it eventually plateaus into a consistent image. The Display Frequency will determine how many iterations the machine will run before printing something into the text box below the cell. For instance, setting display_frequency to 1 will display every iteration VQGAN makes in the Execution cell. Setting display_frequency to 33 will only show you the 1st, 33rd, 66th, 99th images, and so on.

The next cell, “VQGAN+CLIP Parameters and Execution,” contains all the remaining parameters that are exclusive to VQGAN+CLIP. When you run this cell, the image will begin to generate underneath using the parameters that were set when you pressed the Run Cell button. The parameters are defined as follows:

prompts: these are text prompts that CLIP will convert into suggestions for VQGAN. For instance, entering “a cityscape in the style of Van Gogh” into this box will prompt VQGAN to generate an image of a cityscape, using the artistic style of Van Gogh.

width: the width of the generated image in pixels.

height: the height of the generated image in pixels.

clip_model: the model of CLIP used by the machine.

vqgan_model: the model of VQGAN used by the machine.

initial_image: an image for the machine to begin with in place of a noise sheet. If you’re using Colab’s internal storage, simply enter the name of the image file you’ve imported, making sure to include the file stem (e.g. “olive picture.png”). You can drag these images into the file folder on the left side of the screen. The folder is marked with a small icon on the left taskbar.

target_images: similar to the text prompts, target images are pictures that VQGAN will “aim for” when generating the image. It can be used in combination with text prompts, or by itself. Setting the same image as both the Initial Image and the Target Image will have any additional text or image prompts function similarly to a filter.

max_iterations: the number of iterations the machine will run through before terminating the process.

Additionally, there are four Advanced Parameters that mess with how VQGAN processes and functions. These tend to be pretty arcane and can generate wildly incoherent results, so mess and experiment with them at your own peril.

STEP 5: Tip and Tricks

Some useful tips!

1. Once you’ve run the GAN for the first time, you don’t need to run all the cells all over again for your next attempt. As long as you don’t close the Notebook tab in your web browser, all you need to do to generate another image is run the “Parameters” cell and then the “Do the Execution” cell. You shouldn’t need to bother with any of the earlier cells unless you want to download more datasets.

2. You can stop the procedure at hand by going up to “Runtime” and clicking “Interrupt Execution.” Useful if you want to halt the output midway through and move on to video generation, or if you want to run a new prompt without waiting for the current one to finish.

3. In the text prompt section, you can enter multiple prompts by separating them with the “|” symbol.

4. You can ascribe percentage weights to your different prompts by adding a colon and then a number to each one, adding up to 100. For example, “a cityscape:50 | nightmare artist:25 | photorealism:25”

5. In the text prompt section, adding phases like “unreal engine,” “hyperrealistic,” “photorealistic,” and “render” produces more HD-like results, which is pretty funny.

6. If you want to add your own images for use in the “initial_image” or “target_images” section, go to the left side of the screen and click on the little File icon. Drag and drop your images into the folder, and then type the name of the folder (e.g. image.jpg, face.png) into the relevant section.

STEP 6: Aspect Ratios and Re-Sizing

This might just be me, but I struggle to generate images in sizes larger than 700x700 pixels. Collab runs out of memory. Typically, I use that total number of pixels (700x700 = 490000) whenever I try other aspect ratios. Here are some useful ones:

1:1 - 700x700

4:3 - 808x606

16:9 - 928x522

2:1 - 988x494

1.66:1 - 904x544

2:3 - 568x852

4:5 - 624x780

10:13 - 600x780

7:10 - 588x840

3:5 - 540x900

5:6 - 630x756

11:19 - 528x912

Cinemascope (2.4:1) - 1084:452

Manga (250:353) - 576x812

Super 8 (1:1.33) - 600x798

Introduction to VQGAN+CLIP
Tags Machine learning, Image synthesis, Graphics, Design, Surrealism, Unreal
Type Google Doc
Published 03/05/2024, 21:43:33


The ChatGPT Prompt Book - - Rev 2
AI Literacy
ChatGPT & Education
ICML2020_Machine Learning Production Pipeline
ML Visuals by
=GPT3() Demo Template