oh wow, never thought of it that way

swyx · on Nov 10, 2023

> consistency model tries to predict the movement trajectory by providing the current position in the image space. Hence, what used to be a step-by-step process becomes a one-step process.

no that wasnt a sufficient explanation for me. what is the prediction method here? why was diffusion necessary in the past? what tradeoffs does this approach have?

ativzzz · on Nov 11, 2023

> what tradeoffs does this approach have?

From playing around with it a bit locally, LCM is much, much faster, but generally the detail is much lower using the latest SDXL model.

If your prompt is very simple, such as "a boy looking at the moon in a forest" it does pretty well. If your prompt is much more complex and asks for a lot more detail or uses other LoRas, it doesn't do nearly as well as other samplers and generates lower quality, worse matching images. These other samplers take 30-40 steps so it's several times slower.

From what I've seen though, if you use control net, or passing some guideline images in and rely on a simple prompt, like an existing video that you're trying to change the style of, LCM can generate images in near real time like the OP on an RTX4090 and maybe slower cards on smaller/older models

Another benefit is the decreased experimental time. So you can more quickly iterate over seeds to find output you like, and then maybe you can spend some time with other samplers/upscalers with that seed to make the result higher quality

ttul · on Nov 11, 2023

If you use a tool like ComfyUI, you can use LCM-LoRa to generate fast but not perfect outputs and then refine them with the old school sampler. I’ve been playing around and have found the quality of LCM to be excellent when used in combination with IP Adapter instead of text prompts for conditioning.

stavros · on Nov 11, 2023

Is it possible to use the seed from the LCM on the full-blown model to get a more detailed image, or is it a different latent space/decoder?

dvrp · on Nov 11, 2023

exactly. and there’s more things that you can do in terms of what you use as input for the NN!

billconan · on Nov 10, 2023

to my defense, if you look at the original paper

https://arxiv.org/pdf/2303.01469.pdf

the exact neural network used for the prediction method is omitted. apparently many neural networks can be used for this prediction method as long as they fulfill certain requirements.

> why was diffusion necessary in the past?

in the paper, one way to train a consistency model is distilling an existing diffusion model. But it can be trained from scratch too.

"why was it necessary in the past " doesn't bother me that much. Before people know to use molding to make candles, they did it by dipping threads into wax. Why was thread dipping necessary? it's just a stepping stone of technology development.

dvrp · on Nov 10, 2023

Exactly.

The stepping stone way of seeing things reminds me a lot to the thesis behind the book “ Why Greatness Cannot Be Planned: The Myth of the Objective” (2015).

quadrature · on Nov 10, 2023

from https://arxiv.org/abs/2310.04378 it sounds like its a form of distillation of an SDL model. So im guessing it can't be directly trained, but once you have a trained diffusion model you can distil a predictor which cuts out the iterative steps.

While it can do 1-step the output quality looks a ton better with additional steps.

dvrp · on Nov 10, 2023

You can train it directly. This is from the paper “An LCM demands merely 32 A100 GPUs Hours training for 2-step inference [...]”

T-A · on Nov 11, 2023

Let's quote the whole sentence, shall we?

An LCM demands merely 32 A100 GPUs Hours training for 2-step inference, as de- picted in Figure 1. [1]

Now let's look at the caption under Figure 1:

LCMs can be distilled from any pre-trained Stable Diffusion (SD) in only 4,000 training steps (∼32 A100 GPU Hours) for generating high quality 768×768 resolution images in 2∼4 steps or even one step, significantly accelerating text-to-image generation.

The training mentioned in your quote is distillation: it requires a previously trained SD model.

You could have told just by reading the introduction:

We propose a one-stage guided distillation method to efficiently convert a pretrained guided diffusion model into a latent consistency model by solving an augmented PF-ODE.

[1] Section 4.2 in https://arxiv.org/abs/2310.04378