Thanks so much for this! This was an exact problem my team had at a recent hackathon, and we decided to use diffusion instead. We theorised that the initial image parameter might help, but we didn't have time to test it. Documenting this was really helpful. If you have any other articles or advice on using VQGAN (or other generative models) with CLIP (or other language embeddings) I'd love to see them.
Replying again as I think my previous message didn't appear: Glad you found the article useful. While Diffusion can produce more coherent images, I prefer VQGAN as it's much more malleable. I'm writing a whole series of posts on AI generative art, so let me know if there's anything it would be useful to cover.
Thanks, I'm glad you found it useful. I'm working on a series of posts about VQGAN+CLIP, so if there's anything you'd like me to cover please do let me know.