Where are Steps 2 and 3 in Qwen 2509 Image Edit?
I am using the Qwen 2509 Image edit template found in the Comfyui templates section, and when I enter the Subgraph I only see Step 1 - Load Models, and Step 4 - Prompt. The tutorials I've seen online have a Step 2 - Upload image for editing and Step 3 - Image size. Where are these?
https://preview.redd.it/wt87c2ecv11h1.png?width=3600&format=png&auto=webp&s=cba9109379eab9216e10e7bd83a05ebf99e74f6f
https://redd.it/1tcq4y5
@rStableDiffusion
I am using the Qwen 2509 Image edit template found in the Comfyui templates section, and when I enter the Subgraph I only see Step 1 - Load Models, and Step 4 - Prompt. The tutorials I've seen online have a Step 2 - Upload image for editing and Step 3 - Image size. Where are these?
https://preview.redd.it/wt87c2ecv11h1.png?width=3600&format=png&auto=webp&s=cba9109379eab9216e10e7bd83a05ebf99e74f6f
https://redd.it/1tcq4y5
@rStableDiffusion
Guy posts a real painting, disguising it as a generated image. AI critics have a lot to critique.
https://x.com/SHL0MS/status/2054280631807316329
https://redd.it/1tcrjkf
@rStableDiffusion
https://x.com/SHL0MS/status/2054280631807316329
https://redd.it/1tcrjkf
@rStableDiffusion
X (formerly Twitter)
𒐪 (@SHL0MS) on X
i just generated an image in the style of a Monet painting using AI
please describe, in as much detail as possible, what makes this inferior to a real Monet painting
please describe, in as much detail as possible, what makes this inferior to a real Monet painting
Someone posted a real Monet to twitter but said it was AI generated. The replies are amazing, pretentious and confidently wrong
https://redd.it/1tcxmdy
@rStableDiffusion
https://redd.it/1tcxmdy
@rStableDiffusion
Anima TrainFlow — Simple One-Page LoRA Trainer for Anima 2B (Portable, 6GB VRAM, Optimized Config)
https://redd.it/1tcxhoq
@rStableDiffusion
https://redd.it/1tcxhoq
@rStableDiffusion
Qwen-Image-VAE-2.0 Technical Report
arxiv.org/pdf/2605.13565
"We present Qwen-Image-VAE-2.0, a suite of high-compression Variational Autoencoders (VAEs) that achieve significant advances in both reconstruction fidelity and diffusability. To address the reconstruction bottlenecks of high compression, we adopt an improved architecture featuring Global Skip Connections (GSC) and expanded latent channels. Moreover, we scale training to billions of images and incorporate a synthetic rendering engine to improve performance in text-rich scenarios. To tackle the convergence challenges of high-dimensional latent space, we implement an enhanced semantic alignment strategy to make the latent space highly amenable to diffusion modeling. To optimize computational efficiency, we leverage an asymmetric and attention-free encoder-decoder backbone to minimize encoding overhead. We present a comprehensive evaluation of Qwen-Image-VAE-2.0 on public reconstruction benchmarks. To evaluate performance in text-rich scenarios, we propose OmniDoc-TokenBench, a new benchmark comprising a diverse collection of real-world documents coupled with specialized OCR-based evaluation metrics. Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction performance, demonstrating exceptional capabilities in both general domains and text-rich scenarios at high compression ratio. Furthermore, downstream DiT experiments reveal our models possess superior diffusability, significantly accelerating convergence compared to existing high-compression baselines. These establish Qwen-Image-VAE-2.0 as a leading model with high compression, superior reconstruction, and exceptional diffusability."
Key innovations:
Global Skip Connections (GSC): This architectural change allows the model to "remember" fine details from the original image and pass them directly through the compression bottleneck, significantly improving the clarity of the final output.
Asymmetric & Attention-Free Backbone: They made the encoder (which processes the image) very lightweight and fast while keeping the decoder (which reconstructs the image) powerful. By removing "Attention" layers in the VAE itself, they drastically reduced the computational cost (FLOPs).
Semantic Alignment Strategy: To make the model better for generating images (diffusability), they forced the latent space to align more closely with visual "meaning." This helps downstream models learn much faster.
Synthetic Rendering for Text: They trained the model on billions of images, including a massive set of synthetically rendered documents. This makes this VAE exceptionally good at reconstructing OCR-rich images (documents, posters, covers etc.) where most other VAEs fail.
alibaba/OmniDoc-TokenBench
"We conduct a comprehensive evaluation on OmniDoc-TokenBench (\~3K text-rich images, 256×256 resolution). Models are grouped by spatial compression factor and sorted by NED within each group.
Our Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction across all compression ratios. The f16c128 variant attains SSIM 0.9706 and PSNR 30.45 dB, surpassing the best f8 baseline (FLUX.1-dev at 0.9364 / 26.24 dB) despite 2× higher spatial compression. In terms of text fidelity (NED), f16c128 reaches 0.9617, exceeding all evaluated VAEs. Even under extreme f32 compression, our f32c192 achieves NED 0.8555, surpassing multiple f16
arxiv.org/pdf/2605.13565
"We present Qwen-Image-VAE-2.0, a suite of high-compression Variational Autoencoders (VAEs) that achieve significant advances in both reconstruction fidelity and diffusability. To address the reconstruction bottlenecks of high compression, we adopt an improved architecture featuring Global Skip Connections (GSC) and expanded latent channels. Moreover, we scale training to billions of images and incorporate a synthetic rendering engine to improve performance in text-rich scenarios. To tackle the convergence challenges of high-dimensional latent space, we implement an enhanced semantic alignment strategy to make the latent space highly amenable to diffusion modeling. To optimize computational efficiency, we leverage an asymmetric and attention-free encoder-decoder backbone to minimize encoding overhead. We present a comprehensive evaluation of Qwen-Image-VAE-2.0 on public reconstruction benchmarks. To evaluate performance in text-rich scenarios, we propose OmniDoc-TokenBench, a new benchmark comprising a diverse collection of real-world documents coupled with specialized OCR-based evaluation metrics. Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction performance, demonstrating exceptional capabilities in both general domains and text-rich scenarios at high compression ratio. Furthermore, downstream DiT experiments reveal our models possess superior diffusability, significantly accelerating convergence compared to existing high-compression baselines. These establish Qwen-Image-VAE-2.0 as a leading model with high compression, superior reconstruction, and exceptional diffusability."
Key innovations:
Global Skip Connections (GSC): This architectural change allows the model to "remember" fine details from the original image and pass them directly through the compression bottleneck, significantly improving the clarity of the final output.
Asymmetric & Attention-Free Backbone: They made the encoder (which processes the image) very lightweight and fast while keeping the decoder (which reconstructs the image) powerful. By removing "Attention" layers in the VAE itself, they drastically reduced the computational cost (FLOPs).
Semantic Alignment Strategy: To make the model better for generating images (diffusability), they forced the latent space to align more closely with visual "meaning." This helps downstream models learn much faster.
Synthetic Rendering for Text: They trained the model on billions of images, including a massive set of synthetically rendered documents. This makes this VAE exceptionally good at reconstructing OCR-rich images (documents, posters, covers etc.) where most other VAEs fail.
alibaba/OmniDoc-TokenBench
"We conduct a comprehensive evaluation on OmniDoc-TokenBench (\~3K text-rich images, 256×256 resolution). Models are grouped by spatial compression factor and sorted by NED within each group.
Our Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction across all compression ratios. The f16c128 variant attains SSIM 0.9706 and PSNR 30.45 dB, surpassing the best f8 baseline (FLUX.1-dev at 0.9364 / 26.24 dB) despite 2× higher spatial compression. In terms of text fidelity (NED), f16c128 reaches 0.9617, exceeding all evaluated VAEs. Even under extreme f32 compression, our f32c192 achieves NED 0.8555, surpassing multiple f16
LTX Director - All-In-One Timeline Editor. I2V, T2V, FLFF, Prompt Relay, Custom Audio, and more! Unlock LTX 2.3's full potential!
https://youtu.be/fZgtkRcu4_k
https://redd.it/1tczxqw
@rStableDiffusion
https://youtu.be/fZgtkRcu4_k
https://redd.it/1tczxqw
@rStableDiffusion
YouTube
LTX Director - The All-In-One Timeline Editor. I2V, T2V, FLFF, Prompt Relay, Custom Audio, and more!
A Complete Timeline Editor For LTX 2.3.
Download for free here: https://github.com/WhatDreamsCost/WhatDreamsCost-ComfyUI
Example workflows here:
https://github.com/WhatDreamsCost/WhatDreamsCost-ComfyUI/tree/main/example_workflows
Main Features:
- Fully…
Download for free here: https://github.com/WhatDreamsCost/WhatDreamsCost-ComfyUI
Example workflows here:
https://github.com/WhatDreamsCost/WhatDreamsCost-ComfyUI/tree/main/example_workflows
Main Features:
- Fully…
AsymFLUX.2-klein-9B - Pixel Space Model.
Pixel-space text-to-image model AsymFLUX.2-klein finetuned from black-forest-labs/FLUX.2-klein-base-9B, using the AsymFlow method proposed in the paper:
https://preview.redd.it/moe2i7xjt51h1.png?width=3518&format=png&auto=webp&s=a56904867faa1523161bb71b4414939cfd9277a2
HF: Lakonik/AsymFLUX.2-klein-9B · Hugging Face
Paper: \[2605.12964\ Asymmetric Flow Models](https://arxiv.org/abs/2605.12964)
Code: LakonLab/docs/AsymFlow.md at main · Lakonik/LakonLab
https://redd.it/1td9ojh
@rStableDiffusion
Pixel-space text-to-image model AsymFLUX.2-klein finetuned from black-forest-labs/FLUX.2-klein-base-9B, using the AsymFlow method proposed in the paper:
https://preview.redd.it/moe2i7xjt51h1.png?width=3518&format=png&auto=webp&s=a56904867faa1523161bb71b4414939cfd9277a2
HF: Lakonik/AsymFLUX.2-klein-9B · Hugging Face
Paper: \[2605.12964\ Asymmetric Flow Models](https://arxiv.org/abs/2605.12964)
Code: LakonLab/docs/AsymFlow.md at main · Lakonik/LakonLab
https://redd.it/1td9ojh
@rStableDiffusion
anima pv2 vs anima pv3 vs anima-base v1
[A close-up, high-contrast illustration depicts a terrifying embrace between two female figures against a dark, shadowy background. In the foreground, a young woman with long, messy blonde hair and pale skin sits in a state of distress. She wears a loose, white, long-sleeved dress or gown that appears slightly soiled. Her blue eyes are wide and filled with tears, and her mouth is slightly open in a grimace of fear as she looks forward.Looming directly behind her is a monstrous, demonic figure with long, disheveled black hair that blends into the darkness. This figure has glowing red eyes and a wide, menacing grin that reveals sharp teeth. She is embracing the blonde woman from behind, her body pressed close. Her left hand, which appears blackened or gloved, grips the blonde woman's chin and jaw forcefully, tilting her head slightly. The dark figure extends a long, pink tongue, licking the side of the blonde woman's face near her cheek, adding to the predatory and violating nature of the scene. The lighting is dramatic, highlighting the blonde woman's tears and the texture of her white dress while casting the attacker mostly in shadow, emphasizing the horror and intensity of the moment. The art style is painterly with visible brushstrokes, giving it a gritty, textured look reminiscent of dark anime or horror manga.](https://preview.redd.it/sk66rj8n861h1.png?width=2592&format=png&auto=webp&s=80c937ab94ad392e6cd621e87da4392ae88c79bd)
[A full-body, front-facing shot of a dark, multi-limbed silhouette figure rising from a mass of indistinct, shadowy forms at the bottom of the frame. The central figure has long, wild black hair flowing upward and outward as if caught in wind or supernatural force; its face is partially visible — pale with sharp features, eyes closed or downcast, expression serene yet ominous. Extending from its torso are eight elongated arms, each ending in clawed hands splayed in dynamic, reaching poses — some pointing upward, others outward or downward, creating a radial symmetry around the body. Behind the figure’s head glows a large, textured circular halo or sunburst pattern rendered in beige and ochre tones, radiating thin lines outward like rays of light or energy; within this circle, near the top center, appears a single black Japanese kanji character “神” \(kami\/god\). The background resembles aged parchment or canvas, stained with rust-colored smudges and faint vertical striations, enhancing the antique, ritualistic feel. Lighting is high-contrast: the figure is nearly pure black against the luminous backdrop, emphasizing form through negative space while leaving facial details and limb contours sharply defined. The atmosphere is mythic, divine, and terrifying — blending Eastern iconography with grotesque multiplicity to evoke a deity of chaos, power, or judgment emerging from primordial darkness within a sacred, weathered pictorial field.](https://preview.redd.it/dzs8x5wr861h1.png?width=2592&format=png&auto=webp&s=47538e7afc05ef14a2b42bee7a3122ae31b2b6f5)
[A full-body, side-profile shot of two individuals standing back-to-back against a stark white background. The taller figure on the left is a man with shoulder-length dark hair falling across his forehead and neck; he wears a black long-sleeved shirt that clings to his muscular torso, revealing defined shoulders and collarbones under dramatic lighting. His face is turned slightly downward, eyes half-lidded, expression somber or contemplative. Behind him and to the right stands a shorter individual — likely a woman — with short, spiky dark hair and sharp facial features; she wears a form-fitting black turtleneck dress or top, her body angled away but head turned toward the viewer’s left, gaze steady and intense. A small geometric earring or accessory glints at her left earlobe. Lighting originates from the front-right, casting deep shadows along their backs and sides while illuminating parts of their faces, necks, and arms in high contrast. The composition emphasizes physical proximity without touch,
[A close-up, high-contrast illustration depicts a terrifying embrace between two female figures against a dark, shadowy background. In the foreground, a young woman with long, messy blonde hair and pale skin sits in a state of distress. She wears a loose, white, long-sleeved dress or gown that appears slightly soiled. Her blue eyes are wide and filled with tears, and her mouth is slightly open in a grimace of fear as she looks forward.Looming directly behind her is a monstrous, demonic figure with long, disheveled black hair that blends into the darkness. This figure has glowing red eyes and a wide, menacing grin that reveals sharp teeth. She is embracing the blonde woman from behind, her body pressed close. Her left hand, which appears blackened or gloved, grips the blonde woman's chin and jaw forcefully, tilting her head slightly. The dark figure extends a long, pink tongue, licking the side of the blonde woman's face near her cheek, adding to the predatory and violating nature of the scene. The lighting is dramatic, highlighting the blonde woman's tears and the texture of her white dress while casting the attacker mostly in shadow, emphasizing the horror and intensity of the moment. The art style is painterly with visible brushstrokes, giving it a gritty, textured look reminiscent of dark anime or horror manga.](https://preview.redd.it/sk66rj8n861h1.png?width=2592&format=png&auto=webp&s=80c937ab94ad392e6cd621e87da4392ae88c79bd)
[A full-body, front-facing shot of a dark, multi-limbed silhouette figure rising from a mass of indistinct, shadowy forms at the bottom of the frame. The central figure has long, wild black hair flowing upward and outward as if caught in wind or supernatural force; its face is partially visible — pale with sharp features, eyes closed or downcast, expression serene yet ominous. Extending from its torso are eight elongated arms, each ending in clawed hands splayed in dynamic, reaching poses — some pointing upward, others outward or downward, creating a radial symmetry around the body. Behind the figure’s head glows a large, textured circular halo or sunburst pattern rendered in beige and ochre tones, radiating thin lines outward like rays of light or energy; within this circle, near the top center, appears a single black Japanese kanji character “神” \(kami\/god\). The background resembles aged parchment or canvas, stained with rust-colored smudges and faint vertical striations, enhancing the antique, ritualistic feel. Lighting is high-contrast: the figure is nearly pure black against the luminous backdrop, emphasizing form through negative space while leaving facial details and limb contours sharply defined. The atmosphere is mythic, divine, and terrifying — blending Eastern iconography with grotesque multiplicity to evoke a deity of chaos, power, or judgment emerging from primordial darkness within a sacred, weathered pictorial field.](https://preview.redd.it/dzs8x5wr861h1.png?width=2592&format=png&auto=webp&s=47538e7afc05ef14a2b42bee7a3122ae31b2b6f5)
[A full-body, side-profile shot of two individuals standing back-to-back against a stark white background. The taller figure on the left is a man with shoulder-length dark hair falling across his forehead and neck; he wears a black long-sleeved shirt that clings to his muscular torso, revealing defined shoulders and collarbones under dramatic lighting. His face is turned slightly downward, eyes half-lidded, expression somber or contemplative. Behind him and to the right stands a shorter individual — likely a woman — with short, spiky dark hair and sharp facial features; she wears a form-fitting black turtleneck dress or top, her body angled away but head turned toward the viewer’s left, gaze steady and intense. A small geometric earring or accessory glints at her left earlobe. Lighting originates from the front-right, casting deep shadows along their backs and sides while illuminating parts of their faces, necks, and arms in high contrast. The composition emphasizes physical proximity without touch,