Gemma 4 is excellent for image to prompt
I used Qwen 3 8b VL for a long time for image to prompt but now that I have tried Gemma4 26b I am delighted with how much more detail can be extracted from the image, and how much it can improve the prompt. I've also tried larger Qwen3 models but they can't even approach the Gemma models.
From the LM studio, I start Gemma, give him a picture and make a prompt of it just and structure according to the image model that I use mostly Zit sometimes Flux, ERNIE-Image I haven't tried yet, but I don't see a reason why I wouldn't have great results on it.
https://redd.it/1snw7nt
@rStableDiffusion
I used Qwen 3 8b VL for a long time for image to prompt but now that I have tried Gemma4 26b I am delighted with how much more detail can be extracted from the image, and how much it can improve the prompt. I've also tried larger Qwen3 models but they can't even approach the Gemma models.
From the LM studio, I start Gemma, give him a picture and make a prompt of it just and structure according to the image model that I use mostly Zit sometimes Flux, ERNIE-Image I haven't tried yet, but I don't see a reason why I wouldn't have great results on it.
https://redd.it/1snw7nt
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Ernie Image Turbo is not bad at all (Using INT8 quant and Gemini for prompt enhancement, RTX 30 series GPU with low vram)
https://redd.it/1snxxh6
@rStableDiffusion
https://redd.it/1snxxh6
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: Ernie Image Turbo is not bad at all (Using INT8 quant and Gemini for prompt enhancement…
Explore this post and more from the StableDiffusion community
Klein 9B: Better quality at 1056x1584 than at 832x1216, which would be close to 1MP.
I always generated images in 832x1216 or 1024x1024x, and when I did the upscale with Seedvr2
but I noticed that when generating the images directly in 1056x1584 the lighting and skin color become more realistic, in anatomy with 3 arms or 6 fingers, it happens in both 832x1216 and 1024x1024x, so just generate a prompt with more seed to correct it
Do you generate with a resolution close to 1mp which would be around 1024x or above that? I'm referring directly to ksample and not a post-ksample upscale model
https://redd.it/1snzldm
@rStableDiffusion
I always generated images in 832x1216 or 1024x1024x, and when I did the upscale with Seedvr2
but I noticed that when generating the images directly in 1056x1584 the lighting and skin color become more realistic, in anatomy with 3 arms or 6 fingers, it happens in both 832x1216 and 1024x1024x, so just generate a prompt with more seed to correct it
Do you generate with a resolution close to 1mp which would be around 1024x or above that? I'm referring directly to ksample and not a post-ksample upscale model
https://redd.it/1snzldm
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
We can finally watch TNG in 16:9
https://www.youtube.com/watch?v=LRKPhQiHLVI
https://redd.it/1so5d8g
@rStableDiffusion
https://www.youtube.com/watch?v=LRKPhQiHLVI
https://redd.it/1so5d8g
@rStableDiffusion
YouTube
Expanding Classic Shows from 4:3 to 16:9 with AI
This video uses WanGP and the LTX 2.3 model with outpainting to expand classic 4:3 TV shows to a widescreen 16:9. Each 20s clip took about 10 minutes to expand, after quite a bit of trial-and-error. In some cases multiple takes were used and the best parts…
Cheaper Qwen VAE for Anima (and it's training)
https://huggingface.co/Anzhc/Qwen2D-VAE
https://github.com/Anzhc/anzhc-qwen2d-comfyui/tree/main
Just a modification of Qwen Image VAE that allows you to not waste time on parts that are useless in case of non-video models. I have tried it with lora training as well, as far as i see works same, so you can use it to save time on caching, or drastically speed up VAE processing in e2e training pipelines.
Overall, from my tests, this vae produces identical results to original, but at 3x less vram, and at better speed.
Caching 51 images in 768px with full vae - 37 seconds
Caching 51 images in 1024px with modified vae - 34 seconds
(I know they are not the same resolution, but i was lazy)
VRAM picture:
https://preview.redd.it/shdvwje5esvg1.png?width=580&format=png&auto=webp&s=3b99db58f52b519680b2dafb2de6bb80aa577e4b
Comfyui loading:
https://preview.redd.it/vslikw1yesvg1.png?width=647&format=png&auto=webp&s=8aa6f2d138f2c4955aa7358d78e34ec04488d695
85mb vs 242mb
Some bench from chatgpt:
https://preview.redd.it/me8gokk5fsvg1.png?width=757&format=png&auto=webp&s=482786eb94c25969e6bf764744b95065648de1b5
Benchmark results:
https://preview.redd.it/q2vw2bpcesvg1.png?width=1159&format=png&auto=webp&s=995a05c4bd7d55ebee31cc5f202599efa78f383a
Left: Modified, right: full qwen vae
Basically noise change. Difference in decode in practice returns +-0.
Works interchangeable with original on image content:
https://preview.redd.it/1ttkadtresvg1.png?width=2346&format=png&auto=webp&s=5328906d80372a241be96fc91a985dc2a52bcbb5
(other way around works too ofc)
Whole thing is basically collapsing Conv3D to Conv2D, which apparently resulted in virtually no loss in image encode/decode, while making VAE 3x smaller and 2.5x faster.
Idk, that's it, use it if you want. I was just fed up with how inefficient usage of temporal vaes was for non-temporal goon models.
After installing the node, you can just replace your qwen vae with qwen2d one, that's it.
https://redd.it/1so865j
@rStableDiffusion
https://huggingface.co/Anzhc/Qwen2D-VAE
https://github.com/Anzhc/anzhc-qwen2d-comfyui/tree/main
Just a modification of Qwen Image VAE that allows you to not waste time on parts that are useless in case of non-video models. I have tried it with lora training as well, as far as i see works same, so you can use it to save time on caching, or drastically speed up VAE processing in e2e training pipelines.
Overall, from my tests, this vae produces identical results to original, but at 3x less vram, and at better speed.
Caching 51 images in 768px with full vae - 37 seconds
Caching 51 images in 1024px with modified vae - 34 seconds
(I know they are not the same resolution, but i was lazy)
VRAM picture:
https://preview.redd.it/shdvwje5esvg1.png?width=580&format=png&auto=webp&s=3b99db58f52b519680b2dafb2de6bb80aa577e4b
Comfyui loading:
https://preview.redd.it/vslikw1yesvg1.png?width=647&format=png&auto=webp&s=8aa6f2d138f2c4955aa7358d78e34ec04488d695
85mb vs 242mb
Some bench from chatgpt:
https://preview.redd.it/me8gokk5fsvg1.png?width=757&format=png&auto=webp&s=482786eb94c25969e6bf764744b95065648de1b5
Benchmark results:
https://preview.redd.it/q2vw2bpcesvg1.png?width=1159&format=png&auto=webp&s=995a05c4bd7d55ebee31cc5f202599efa78f383a
Left: Modified, right: full qwen vae
Basically noise change. Difference in decode in practice returns +-0.
Works interchangeable with original on image content:
https://preview.redd.it/1ttkadtresvg1.png?width=2346&format=png&auto=webp&s=5328906d80372a241be96fc91a985dc2a52bcbb5
(other way around works too ofc)
Whole thing is basically collapsing Conv3D to Conv2D, which apparently resulted in virtually no loss in image encode/decode, while making VAE 3x smaller and 2.5x faster.
Idk, that's it, use it if you want. I was just fed up with how inefficient usage of temporal vaes was for non-temporal goon models.
After installing the node, you can just replace your qwen vae with qwen2d one, that's it.
https://redd.it/1so865j
@rStableDiffusion
huggingface.co
Anzhc/Qwen2D-VAE · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
I made an entire cinematic shortfilm using LTX 2.3 in a week. How does it hold up? - The Felt Fox (statistics/details in comments)
https://youtu.be/yKZM66tcl9M
https://redd.it/1so8o8g
@rStableDiffusion
https://youtu.be/yKZM66tcl9M
https://redd.it/1so8o8g
@rStableDiffusion
YouTube
The Felt Fox - cinematic storytelling using AI, made with $0 on consumer hardware
This shortfilm was created entirely using local generative AI models and free open source tools. It was completed in roughly 1 week (though some days were 12+ hours of work), and cost $0 to produce (excluding intangible expenses like electricity from running…
ComfyUI_RaykoStudio has been updated!
# Making an outpaint is now even easier! The new RS Outpaint node provides 100% expansion of your image within the limits you set!
https://preview.redd.it/wlq03x5iugvg1.jpg?width=1670&format=pjpg&auto=webp&s=dc7c61f63316cdce9d1c866c2cc28e7d2d5665de
https://preview.redd.it/8d5pkijjugvg1.jpg?width=1222&format=pjpg&auto=webp&s=8949cac782a375b9e20ba588a692bb7ed1fc1615
Link to nodes pack: [https://github.com/Raykosan/ComfyUI\_RaykoStudio](https://github.com/Raykosan/ComfyUI_RaykoStudio)
https://redd.it/1so5458
@rStableDiffusion
# Making an outpaint is now even easier! The new RS Outpaint node provides 100% expansion of your image within the limits you set!
https://preview.redd.it/wlq03x5iugvg1.jpg?width=1670&format=pjpg&auto=webp&s=dc7c61f63316cdce9d1c866c2cc28e7d2d5665de
https://preview.redd.it/8d5pkijjugvg1.jpg?width=1222&format=pjpg&auto=webp&s=8949cac782a375b9e20ba588a692bb7ed1fc1615
Link to nodes pack: [https://github.com/Raykosan/ComfyUI\_RaykoStudio](https://github.com/Raykosan/ComfyUI_RaykoStudio)
https://redd.it/1so5458
@rStableDiffusion