Seems the structure of UNet hasn't changed other than the text encoder input (768 to 1024). The biggest change is on the text encoder, switched from ViT-L14 to ViT-H14 and fine-tuned based on https://arxiv.org/pdf/2109.01903.pdf.
Seems the 768-v model, if used properly, can substantially speed-up the generation, but not exactly sure yet. Seems straightforward to switch to 512-base model for my app next week.
I'm disappointed they didn't push parameter count higher, but I suppose they want to maintain the ability to run on older/lower end consumer GPUs. Unfortunately it severely limits how high-quality the output can be.
They're motivating that choice via this paper: https://arxiv.org/pdf/2203.15556.pdf The paper shows that you can get better performance than gpt-3 with a much smaller model if you bump up the training time and training data like x4.
Larger models are still much better. Google's parti model can do text perfectly and follows prompts way more accurately than Stable Diffusion. It's 20B parameters and with the latest int8 optimizations it should be possible to get that running on a consumer 24GB card in theory.
I think they're looking into larger models later though
Can’t forget time it takes to run inference, even on the latest A100/H100. Generating in under e.g. ten seconds enables more use cases (and so on until high fps video is possible).
Seems the 768-v model, if used properly, can substantially speed-up the generation, but not exactly sure yet. Seems straightforward to switch to 512-base model for my app next week.