Gemma 4 is excellent for image to prompt

I used Qwen 3 8b VL for a long time for image to prompt but now that I have tried Gemma4 26b I am delighted with how much more detail can be extracted from the image, and how much it can improve the prompt. I've also tried larger Qwen3 models but they can't even approach the Gemma models.
From the LM studio, I start Gemma, give him a picture and make a prompt of it just and structure according to the image model that I use mostly Zit sometimes Flux, ERNIE-Image I haven't tried yet, but I don't see a reason why I wouldn't have great results on it.

https://redd.it/1snw7nt
@rStableDiffusion
Klein 9B: Better quality at 1056x1584 than at 832x1216, which would be close to 1MP.

I always generated images in 832x1216 or 1024x1024x, and when I did the upscale with Seedvr2



but I noticed that when generating the images directly in 1056x1584 the lighting and skin color become more realistic, in anatomy with 3 arms or 6 fingers, it happens in both 832x1216 and 1024x1024x, so just generate a prompt with more seed to correct it



Do you generate with a resolution close to 1mp which would be around 1024x or above that? I'm referring directly to ksample and not a post-ksample upscale model

https://redd.it/1snzldm
@rStableDiffusion
I have extracted the Lora from Ernie Image Turbo.
https://redd.it/1so42k8
@rStableDiffusion
Cheaper Qwen VAE for Anima (and it's training)

https://huggingface.co/Anzhc/Qwen2D-VAE

https://github.com/Anzhc/anzhc-qwen2d-comfyui/tree/main

Just a modification of Qwen Image VAE that allows you to not waste time on parts that are useless in case of non-video models. I have tried it with lora training as well, as far as i see works same, so you can use it to save time on caching, or drastically speed up VAE processing in e2e training pipelines.

Overall, from my tests, this vae produces identical results to original, but at 3x less vram, and at better speed.

Caching 51 images in 768px with full vae - 37 seconds
Caching 51 images in 1024px with modified vae - 34 seconds

(I know they are not the same resolution, but i was lazy)

VRAM picture:

https://preview.redd.it/shdvwje5esvg1.png?width=580&format=png&auto=webp&s=3b99db58f52b519680b2dafb2de6bb80aa577e4b

Comfyui loading:

https://preview.redd.it/vslikw1yesvg1.png?width=647&format=png&auto=webp&s=8aa6f2d138f2c4955aa7358d78e34ec04488d695

85mb vs 242mb

Some bench from chatgpt:

https://preview.redd.it/me8gokk5fsvg1.png?width=757&format=png&auto=webp&s=482786eb94c25969e6bf764744b95065648de1b5




Benchmark results:

https://preview.redd.it/q2vw2bpcesvg1.png?width=1159&format=png&auto=webp&s=995a05c4bd7d55ebee31cc5f202599efa78f383a


Left: Modified, right: full qwen vae

Basically noise change. Difference in decode in practice returns +-0.


Works interchangeable with original on image content:

https://preview.redd.it/1ttkadtresvg1.png?width=2346&format=png&auto=webp&s=5328906d80372a241be96fc91a985dc2a52bcbb5

(other way around works too ofc)


Whole thing is basically collapsing Conv3D to Conv2D, which apparently resulted in virtually no loss in image encode/decode, while making VAE 3x smaller and 2.5x faster.


Idk, that's it, use it if you want. I was just fed up with how inefficient usage of temporal vaes was for non-temporal goon models.

After installing the node, you can just replace your qwen vae with qwen2d one, that's it.

https://redd.it/1so865j
@rStableDiffusion
ComfyUI_RaykoStudio has been updated!

# Making an outpaint is now even easier! The new RS Outpaint node provides 100% expansion of your image within the limits you set!

https://preview.redd.it/wlq03x5iugvg1.jpg?width=1670&format=pjpg&auto=webp&s=dc7c61f63316cdce9d1c866c2cc28e7d2d5665de

https://preview.redd.it/8d5pkijjugvg1.jpg?width=1222&format=pjpg&auto=webp&s=8949cac782a375b9e20ba588a692bb7ed1fc1615

Link to nodes pack: [https://github.com/Raykosan/ComfyUI\_RaykoStudio](https://github.com/Raykosan/ComfyUI_RaykoStudio)

https://redd.it/1so5458
@rStableDiffusion