ComfyUI Node: Unified Image + Mask Resize (LTX 2.3 ready, keeps BOTH sides divisible by 32, replaces Image Resize + Image Resize V2 + Mask mismatch issues)
https://redd.it/1tci23f
@rStableDiffusion
https://redd.it/1tci23f
@rStableDiffusion
Last week in Generative Image & Video
I curate a weekly multimodal AI roundup, here are the open-source image & video highlights from the last week:
\- CausalCine — Interactive autoregressive framework for multi-shot video narratives. Content-Aware Memory Routing retrieves historical KV entries by attention relevance instead of temporal proximity, solving motion stagnation and semantic drift in long-rollout generation. Distilled to a few-step generator for real-time use.
https://reddit.com/link/1tcnpxj/video/tbryyz3s611h1/player
[Paper](http://arxiv.org/abs/2605.12496v1) | [GitHub](https://github.com/yihao-meng/CausalCine)
\- SwiftI2V — Efficient 2K image-to-video generation. Low-res motion drafting followed by high-res refinement while preserving source image detail.
https://reddit.com/link/1tcnpxj/video/8n6t3ust611h1/player
[Paper](https://arxiv.org/abs/2605.06356) | [GitHub](https://github.com/hkust-longgroup/SwiftI2V) | [Project Page](https://hkust-longgroup.github.io/SwiftI2V/)
\- OmniGen2 — Unified image generation model handling text-to-image, editing, subject-driven generation, and visual conditions in one architecture. | [Paper](http://arxiv.org/abs/2605.07254v1)
https://preview.redd.it/iimjl0d2711h1.png?width=2772&format=png&auto=webp&s=21e30ab3ddf374f38b94c4b57498a870ae9a27ee
\- HiDream-O1-Image — Natively unified image generative foundation model. Open weights and code(8b model). | [Paper](http://arxiv.org/abs/2605.11061v1) | [GitHub](https://github.com/HiDream-ai/HiDream-O1-Image) | [Hugging Face](https://huggingface.co/HiDream-ai/HiDream-O1-Image)
https://preview.redd.it/kj4px8mv711h1.png?width=1456&format=png&auto=webp&s=bdfd6297ff6ad0a52ff39188571a5d9230f1825c
\- CDM — Continuous-time distribution matching for few-step diffusion distillation. High-quality images in fewer steps. Models released for SD3 Medium and Longcat.
https://preview.redd.it/bv980n9u711h1.png?width=1456&format=png&auto=webp&s=9e9a3695ab5153b3545bf913b9b9da87c37b08cf
[Paper](https://arxiv.org/abs/2605.06376) | [GitHub](https://github.com/byliutao/cdm) | [HF Models](https://huggingface.co/byliutao/stable-diffusion-3-medium-turbo)
\- PhysForge — Generates physics-grounded 3D assets with parts, materials, joints, mass, and movement rules for simulation and games.
https://reddit.com/link/1tcnpxj/video/yr62agus711h1/player
[Paper](https://arxiv.org/abs/2605.05163) | [GitHub](https://github.com/HKU-MMLab/PhysForge) | [Project Page](https://hku-mmlab.github.io/PhysForge/)
\- u/TensorForger built a Flux.2-Klein pipeline for real-time webcam stream processing at 30 FPS. | [Reddit](https://www.reddit.com/r/StableDiffusion/comments/1t7nd7e/flux2klein_pipeline_for_realtime_webcam_stream/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)
https://reddit.com/link/1tcnpxj/video/opnfdkv7911h1/player
\- u/aniki_kun shared a ZIT I2I “Character LORA Transformation” workflow. | [Reddit](https://www.reddit.com/r/StableDiffusion/comments/1tae2yl/zit_i2i_character_lora_transformation_workflow/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)
https://preview.redd.it/yjuuhq27911h1.jpg?width=1080&format=pjpg&auto=webp&s=56b2df98f3d27029c7019e1ffe01f9b3db34f69f
[](https://substackcdn.com/image/fetch/$s_!FE0C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5722f795-5b1e-416b-9152-8970f2ac3bb8_1080x518.webp)
\- u/ThaJedi finetuned Qwen3-1.7B to imitate the original Z-Image text encoder. 21% less VRAM. | [Reddit](https://www.reddit.com/r/StableDiffusion/comments/1t71hvm/i_finetuned_qwen317b_to_imitate_original_zimage/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)
\- Juggernaut Z dropped.
I curate a weekly multimodal AI roundup, here are the open-source image & video highlights from the last week:
\- CausalCine — Interactive autoregressive framework for multi-shot video narratives. Content-Aware Memory Routing retrieves historical KV entries by attention relevance instead of temporal proximity, solving motion stagnation and semantic drift in long-rollout generation. Distilled to a few-step generator for real-time use.
https://reddit.com/link/1tcnpxj/video/tbryyz3s611h1/player
[Paper](http://arxiv.org/abs/2605.12496v1) | [GitHub](https://github.com/yihao-meng/CausalCine)
\- SwiftI2V — Efficient 2K image-to-video generation. Low-res motion drafting followed by high-res refinement while preserving source image detail.
https://reddit.com/link/1tcnpxj/video/8n6t3ust611h1/player
[Paper](https://arxiv.org/abs/2605.06356) | [GitHub](https://github.com/hkust-longgroup/SwiftI2V) | [Project Page](https://hkust-longgroup.github.io/SwiftI2V/)
\- OmniGen2 — Unified image generation model handling text-to-image, editing, subject-driven generation, and visual conditions in one architecture. | [Paper](http://arxiv.org/abs/2605.07254v1)
https://preview.redd.it/iimjl0d2711h1.png?width=2772&format=png&auto=webp&s=21e30ab3ddf374f38b94c4b57498a870ae9a27ee
\- HiDream-O1-Image — Natively unified image generative foundation model. Open weights and code(8b model). | [Paper](http://arxiv.org/abs/2605.11061v1) | [GitHub](https://github.com/HiDream-ai/HiDream-O1-Image) | [Hugging Face](https://huggingface.co/HiDream-ai/HiDream-O1-Image)
https://preview.redd.it/kj4px8mv711h1.png?width=1456&format=png&auto=webp&s=bdfd6297ff6ad0a52ff39188571a5d9230f1825c
\- CDM — Continuous-time distribution matching for few-step diffusion distillation. High-quality images in fewer steps. Models released for SD3 Medium and Longcat.
https://preview.redd.it/bv980n9u711h1.png?width=1456&format=png&auto=webp&s=9e9a3695ab5153b3545bf913b9b9da87c37b08cf
[Paper](https://arxiv.org/abs/2605.06376) | [GitHub](https://github.com/byliutao/cdm) | [HF Models](https://huggingface.co/byliutao/stable-diffusion-3-medium-turbo)
\- PhysForge — Generates physics-grounded 3D assets with parts, materials, joints, mass, and movement rules for simulation and games.
https://reddit.com/link/1tcnpxj/video/yr62agus711h1/player
[Paper](https://arxiv.org/abs/2605.05163) | [GitHub](https://github.com/HKU-MMLab/PhysForge) | [Project Page](https://hku-mmlab.github.io/PhysForge/)
\- u/TensorForger built a Flux.2-Klein pipeline for real-time webcam stream processing at 30 FPS. | [Reddit](https://www.reddit.com/r/StableDiffusion/comments/1t7nd7e/flux2klein_pipeline_for_realtime_webcam_stream/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)
https://reddit.com/link/1tcnpxj/video/opnfdkv7911h1/player
\- u/aniki_kun shared a ZIT I2I “Character LORA Transformation” workflow. | [Reddit](https://www.reddit.com/r/StableDiffusion/comments/1tae2yl/zit_i2i_character_lora_transformation_workflow/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)
https://preview.redd.it/yjuuhq27911h1.jpg?width=1080&format=pjpg&auto=webp&s=56b2df98f3d27029c7019e1ffe01f9b3db34f69f
[](https://substackcdn.com/image/fetch/$s_!FE0C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5722f795-5b1e-416b-9152-8970f2ac3bb8_1080x518.webp)
\- u/ThaJedi finetuned Qwen3-1.7B to imitate the original Z-Image text encoder. 21% less VRAM. | [Reddit](https://www.reddit.com/r/StableDiffusion/comments/1t71hvm/i_finetuned_qwen317b_to_imitate_original_zimage/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)
\- Juggernaut Z dropped.
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
| [CivitAI](https://civitai.red/models/2600510/juggernaut-z?modelVersionId=2921151)
https://preview.redd.it/8u7gwjd5911h1.png?width=450&format=png&auto=webp&s=100a9e84a5c64cd2752423c8e6e619c6fb4fd820
[](https://substackcdn.com/image/fetch/$s_!uXeu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fdf28e6-fd71-432e-a540-848d7cafc1f5_450x675.webp)
\- ltx\_model released LipDub (Beta), an open-source lipsync IC-LoRA. | [Reddit](https://www.reddit.com/r/StableDiffusion/comments/1ta66f1/lipdub_beta_new_opensource_lipsync_iclora/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)
\- MiniMind-O — 0.1B speech-native omni model. Text/speech/image in, text + streaming speech out. Code, checkpoints, and training datasets released.
https://preview.redd.it/ay16yj3h811h1.png?width=1456&format=png&auto=webp&s=971899daee79f7dd9c7acd8bdb976ea2bfe78dda
[Paper](http://arxiv.org/abs/2605.03937v1) | [GitHub](https://github.com/jingyaogong/minimind-o)
Honorable Mentions:
WavCube — Unified speech representation matching WavLM on SUPERB with 8x compression. SOTA zero-shot TTS. Open weights. | [Paper](http://arxiv.org/abs/2605.06407v1) | [GitHub](https://github.com/yanghaha0908/WavCube) | [Hugging Face](https://huggingface.co/yhaha/WavCube)
[The overall architecture of the WavCube representation.](https://preview.redd.it/0hlfjhvq811h1.png?width=1456&format=png&auto=webp&s=9f18dbd14070d89b11500ddbccc3cd8db4295b00)
Checkout the [full roundup](https://open.substack.com/pub/thelivingedge/p/last-week-in-multimodal-ai-56-from?r=12l7fk&utm_campaign=post-expanded-share&utm_medium=web) for more demos, papers, and resources.
https://redd.it/1tcnpxj
@rStableDiffusion
https://preview.redd.it/8u7gwjd5911h1.png?width=450&format=png&auto=webp&s=100a9e84a5c64cd2752423c8e6e619c6fb4fd820
[](https://substackcdn.com/image/fetch/$s_!uXeu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fdf28e6-fd71-432e-a540-848d7cafc1f5_450x675.webp)
\- ltx\_model released LipDub (Beta), an open-source lipsync IC-LoRA. | [Reddit](https://www.reddit.com/r/StableDiffusion/comments/1ta66f1/lipdub_beta_new_opensource_lipsync_iclora/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)
\- MiniMind-O — 0.1B speech-native omni model. Text/speech/image in, text + streaming speech out. Code, checkpoints, and training datasets released.
https://preview.redd.it/ay16yj3h811h1.png?width=1456&format=png&auto=webp&s=971899daee79f7dd9c7acd8bdb976ea2bfe78dda
[Paper](http://arxiv.org/abs/2605.03937v1) | [GitHub](https://github.com/jingyaogong/minimind-o)
Honorable Mentions:
WavCube — Unified speech representation matching WavLM on SUPERB with 8x compression. SOTA zero-shot TTS. Open weights. | [Paper](http://arxiv.org/abs/2605.06407v1) | [GitHub](https://github.com/yanghaha0908/WavCube) | [Hugging Face](https://huggingface.co/yhaha/WavCube)
[The overall architecture of the WavCube representation.](https://preview.redd.it/0hlfjhvq811h1.png?width=1456&format=png&auto=webp&s=9f18dbd14070d89b11500ddbccc3cd8db4295b00)
Checkout the [full roundup](https://open.substack.com/pub/thelivingedge/p/last-week-in-multimodal-ai-56-from?r=12l7fk&utm_campaign=post-expanded-share&utm_medium=web) for more demos, papers, and resources.
https://redd.it/1tcnpxj
@rStableDiffusion
civitai.red
Juggernaut Z - v1.0 by RunDiffusion | ZImage Checkpoint | Civitai
For business inquires, commercial licensing, custom models, and consultation contact me under juggernaut@rundiffusion.com Try Juggernaut Z on RunDi...
Where are Steps 2 and 3 in Qwen 2509 Image Edit?
I am using the Qwen 2509 Image edit template found in the Comfyui templates section, and when I enter the Subgraph I only see Step 1 - Load Models, and Step 4 - Prompt. The tutorials I've seen online have a Step 2 - Upload image for editing and Step 3 - Image size. Where are these?
https://preview.redd.it/wt87c2ecv11h1.png?width=3600&format=png&auto=webp&s=cba9109379eab9216e10e7bd83a05ebf99e74f6f
https://redd.it/1tcq4y5
@rStableDiffusion
I am using the Qwen 2509 Image edit template found in the Comfyui templates section, and when I enter the Subgraph I only see Step 1 - Load Models, and Step 4 - Prompt. The tutorials I've seen online have a Step 2 - Upload image for editing and Step 3 - Image size. Where are these?
https://preview.redd.it/wt87c2ecv11h1.png?width=3600&format=png&auto=webp&s=cba9109379eab9216e10e7bd83a05ebf99e74f6f
https://redd.it/1tcq4y5
@rStableDiffusion
Guy posts a real painting, disguising it as a generated image. AI critics have a lot to critique.
https://x.com/SHL0MS/status/2054280631807316329
https://redd.it/1tcrjkf
@rStableDiffusion
https://x.com/SHL0MS/status/2054280631807316329
https://redd.it/1tcrjkf
@rStableDiffusion
X (formerly Twitter)
𒐪 (@SHL0MS) on X
i just generated an image in the style of a Monet painting using AI
please describe, in as much detail as possible, what makes this inferior to a real Monet painting
please describe, in as much detail as possible, what makes this inferior to a real Monet painting
Someone posted a real Monet to twitter but said it was AI generated. The replies are amazing, pretentious and confidently wrong
https://redd.it/1tcxmdy
@rStableDiffusion
https://redd.it/1tcxmdy
@rStableDiffusion
Anima TrainFlow — Simple One-Page LoRA Trainer for Anima 2B (Portable, 6GB VRAM, Optimized Config)
https://redd.it/1tcxhoq
@rStableDiffusion
https://redd.it/1tcxhoq
@rStableDiffusion
Qwen-Image-VAE-2.0 Technical Report
arxiv.org/pdf/2605.13565
"We present Qwen-Image-VAE-2.0, a suite of high-compression Variational Autoencoders (VAEs) that achieve significant advances in both reconstruction fidelity and diffusability. To address the reconstruction bottlenecks of high compression, we adopt an improved architecture featuring Global Skip Connections (GSC) and expanded latent channels. Moreover, we scale training to billions of images and incorporate a synthetic rendering engine to improve performance in text-rich scenarios. To tackle the convergence challenges of high-dimensional latent space, we implement an enhanced semantic alignment strategy to make the latent space highly amenable to diffusion modeling. To optimize computational efficiency, we leverage an asymmetric and attention-free encoder-decoder backbone to minimize encoding overhead. We present a comprehensive evaluation of Qwen-Image-VAE-2.0 on public reconstruction benchmarks. To evaluate performance in text-rich scenarios, we propose OmniDoc-TokenBench, a new benchmark comprising a diverse collection of real-world documents coupled with specialized OCR-based evaluation metrics. Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction performance, demonstrating exceptional capabilities in both general domains and text-rich scenarios at high compression ratio. Furthermore, downstream DiT experiments reveal our models possess superior diffusability, significantly accelerating convergence compared to existing high-compression baselines. These establish Qwen-Image-VAE-2.0 as a leading model with high compression, superior reconstruction, and exceptional diffusability."
Key innovations:
Global Skip Connections (GSC): This architectural change allows the model to "remember" fine details from the original image and pass them directly through the compression bottleneck, significantly improving the clarity of the final output.
Asymmetric & Attention-Free Backbone: They made the encoder (which processes the image) very lightweight and fast while keeping the decoder (which reconstructs the image) powerful. By removing "Attention" layers in the VAE itself, they drastically reduced the computational cost (FLOPs).
Semantic Alignment Strategy: To make the model better for generating images (diffusability), they forced the latent space to align more closely with visual "meaning." This helps downstream models learn much faster.
Synthetic Rendering for Text: They trained the model on billions of images, including a massive set of synthetically rendered documents. This makes this VAE exceptionally good at reconstructing OCR-rich images (documents, posters, covers etc.) where most other VAEs fail.
alibaba/OmniDoc-TokenBench
"We conduct a comprehensive evaluation on OmniDoc-TokenBench (\~3K text-rich images, 256×256 resolution). Models are grouped by spatial compression factor and sorted by NED within each group.
Our Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction across all compression ratios. The f16c128 variant attains SSIM 0.9706 and PSNR 30.45 dB, surpassing the best f8 baseline (FLUX.1-dev at 0.9364 / 26.24 dB) despite 2× higher spatial compression. In terms of text fidelity (NED), f16c128 reaches 0.9617, exceeding all evaluated VAEs. Even under extreme f32 compression, our f32c192 achieves NED 0.8555, surpassing multiple f16
arxiv.org/pdf/2605.13565
"We present Qwen-Image-VAE-2.0, a suite of high-compression Variational Autoencoders (VAEs) that achieve significant advances in both reconstruction fidelity and diffusability. To address the reconstruction bottlenecks of high compression, we adopt an improved architecture featuring Global Skip Connections (GSC) and expanded latent channels. Moreover, we scale training to billions of images and incorporate a synthetic rendering engine to improve performance in text-rich scenarios. To tackle the convergence challenges of high-dimensional latent space, we implement an enhanced semantic alignment strategy to make the latent space highly amenable to diffusion modeling. To optimize computational efficiency, we leverage an asymmetric and attention-free encoder-decoder backbone to minimize encoding overhead. We present a comprehensive evaluation of Qwen-Image-VAE-2.0 on public reconstruction benchmarks. To evaluate performance in text-rich scenarios, we propose OmniDoc-TokenBench, a new benchmark comprising a diverse collection of real-world documents coupled with specialized OCR-based evaluation metrics. Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction performance, demonstrating exceptional capabilities in both general domains and text-rich scenarios at high compression ratio. Furthermore, downstream DiT experiments reveal our models possess superior diffusability, significantly accelerating convergence compared to existing high-compression baselines. These establish Qwen-Image-VAE-2.0 as a leading model with high compression, superior reconstruction, and exceptional diffusability."
Key innovations:
Global Skip Connections (GSC): This architectural change allows the model to "remember" fine details from the original image and pass them directly through the compression bottleneck, significantly improving the clarity of the final output.
Asymmetric & Attention-Free Backbone: They made the encoder (which processes the image) very lightweight and fast while keeping the decoder (which reconstructs the image) powerful. By removing "Attention" layers in the VAE itself, they drastically reduced the computational cost (FLOPs).
Semantic Alignment Strategy: To make the model better for generating images (diffusability), they forced the latent space to align more closely with visual "meaning." This helps downstream models learn much faster.
Synthetic Rendering for Text: They trained the model on billions of images, including a massive set of synthetically rendered documents. This makes this VAE exceptionally good at reconstructing OCR-rich images (documents, posters, covers etc.) where most other VAEs fail.
alibaba/OmniDoc-TokenBench
"We conduct a comprehensive evaluation on OmniDoc-TokenBench (\~3K text-rich images, 256×256 resolution). Models are grouped by spatial compression factor and sorted by NED within each group.
Our Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction across all compression ratios. The f16c128 variant attains SSIM 0.9706 and PSNR 30.45 dB, surpassing the best f8 baseline (FLUX.1-dev at 0.9364 / 26.24 dB) despite 2× higher spatial compression. In terms of text fidelity (NED), f16c128 reaches 0.9617, exceeding all evaluated VAEs. Even under extreme f32 compression, our f32c192 achieves NED 0.8555, surpassing multiple f16
LTX Director - All-In-One Timeline Editor. I2V, T2V, FLFF, Prompt Relay, Custom Audio, and more! Unlock LTX 2.3's full potential!
https://youtu.be/fZgtkRcu4_k
https://redd.it/1tczxqw
@rStableDiffusion
https://youtu.be/fZgtkRcu4_k
https://redd.it/1tczxqw
@rStableDiffusion
YouTube
LTX Director - The All-In-One Timeline Editor. I2V, T2V, FLFF, Prompt Relay, Custom Audio, and more!
A Complete Timeline Editor For LTX 2.3.
Download for free here: https://github.com/WhatDreamsCost/WhatDreamsCost-ComfyUI
Example workflows here:
https://github.com/WhatDreamsCost/WhatDreamsCost-ComfyUI/tree/main/example_workflows
Main Features:
- Fully…
Download for free here: https://github.com/WhatDreamsCost/WhatDreamsCost-ComfyUI
Example workflows here:
https://github.com/WhatDreamsCost/WhatDreamsCost-ComfyUI/tree/main/example_workflows
Main Features:
- Fully…
AsymFLUX.2-klein-9B - Pixel Space Model.
Pixel-space text-to-image model AsymFLUX.2-klein finetuned from black-forest-labs/FLUX.2-klein-base-9B, using the AsymFlow method proposed in the paper:
https://preview.redd.it/moe2i7xjt51h1.png?width=3518&format=png&auto=webp&s=a56904867faa1523161bb71b4414939cfd9277a2
HF: Lakonik/AsymFLUX.2-klein-9B · Hugging Face
Paper: \[2605.12964\ Asymmetric Flow Models](https://arxiv.org/abs/2605.12964)
Code: LakonLab/docs/AsymFlow.md at main · Lakonik/LakonLab
https://redd.it/1td9ojh
@rStableDiffusion
Pixel-space text-to-image model AsymFLUX.2-klein finetuned from black-forest-labs/FLUX.2-klein-base-9B, using the AsymFlow method proposed in the paper:
https://preview.redd.it/moe2i7xjt51h1.png?width=3518&format=png&auto=webp&s=a56904867faa1523161bb71b4414939cfd9277a2
HF: Lakonik/AsymFLUX.2-klein-9B · Hugging Face
Paper: \[2605.12964\ Asymmetric Flow Models](https://arxiv.org/abs/2605.12964)
Code: LakonLab/docs/AsymFlow.md at main · Lakonik/LakonLab
https://redd.it/1td9ojh
@rStableDiffusion