ComfyUI Node: Unified Image + Mask Resize (LTX 2.3 ready, keeps BOTH sides divisible by 32, replaces Image Resize + Image Resize V2 + Mask mismatch issues)
https://redd.it/1tci23f
@rStableDiffusion
Wish I had gotten the 96GB DDR4 RAM when I had the chance.
https://redd.it/1tcns1x
@rStableDiffusion
Last week in Generative Image & Video

I curate a weekly multimodal AI roundup, here are the open-source image & video highlights from the last week:

\- CausalCine — Interactive autoregressive framework for multi-shot video narratives. Content-Aware Memory Routing retrieves historical KV entries by attention relevance instead of temporal proximity, solving motion stagnation and semantic drift in long-rollout generation. Distilled to a few-step generator for real-time use.

https://reddit.com/link/1tcnpxj/video/tbryyz3s611h1/player

[Paper](http://arxiv.org/abs/2605.12496v1) | [GitHub](https://github.com/yihao-meng/CausalCine)

\- SwiftI2V — Efficient 2K image-to-video generation. Low-res motion drafting followed by high-res refinement while preserving source image detail.

https://reddit.com/link/1tcnpxj/video/8n6t3ust611h1/player

[Paper](https://arxiv.org/abs/2605.06356) | [GitHub](https://github.com/hkust-longgroup/SwiftI2V) | [Project Page](https://hkust-longgroup.github.io/SwiftI2V/)

\- OmniGen2 — Unified image generation model handling text-to-image, editing, subject-driven generation, and visual conditions in one architecture. | [Paper](http://arxiv.org/abs/2605.07254v1)

https://preview.redd.it/iimjl0d2711h1.png?width=2772&format=png&auto=webp&s=21e30ab3ddf374f38b94c4b57498a870ae9a27ee

\- HiDream-O1-Image — Natively unified image generative foundation model. Open weights and code(8b model). | [Paper](http://arxiv.org/abs/2605.11061v1) | [GitHub](https://github.com/HiDream-ai/HiDream-O1-Image) | [Hugging Face](https://huggingface.co/HiDream-ai/HiDream-O1-Image)

https://preview.redd.it/kj4px8mv711h1.png?width=1456&format=png&auto=webp&s=bdfd6297ff6ad0a52ff39188571a5d9230f1825c

\- CDM — Continuous-time distribution matching for few-step diffusion distillation. High-quality images in fewer steps. Models released for SD3 Medium and Longcat.

https://preview.redd.it/bv980n9u711h1.png?width=1456&format=png&auto=webp&s=9e9a3695ab5153b3545bf913b9b9da87c37b08cf

[Paper](https://arxiv.org/abs/2605.06376) | [GitHub](https://github.com/byliutao/cdm) | [HF Models](https://huggingface.co/byliutao/stable-diffusion-3-medium-turbo)

\- PhysForge — Generates physics-grounded 3D assets with parts, materials, joints, mass, and movement rules for simulation and games.

https://reddit.com/link/1tcnpxj/video/yr62agus711h1/player

[Paper](https://arxiv.org/abs/2605.05163) | [GitHub](https://github.com/HKU-MMLab/PhysForge) | [Project Page](https://hku-mmlab.github.io/PhysForge/)

\- u/TensorForger built a Flux.2-Klein pipeline for real-time webcam stream processing at 30 FPS. | [Reddit](https://www.reddit.com/r/StableDiffusion/comments/1t7nd7e/flux2klein_pipeline_for_realtime_webcam_stream/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)

https://reddit.com/link/1tcnpxj/video/opnfdkv7911h1/player

\- u/aniki_kun shared a ZIT I2I “Character LORA Transformation” workflow. | [Reddit](https://www.reddit.com/r/StableDiffusion/comments/1tae2yl/zit_i2i_character_lora_transformation_workflow/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)

https://preview.redd.it/yjuuhq27911h1.jpg?width=1080&format=pjpg&auto=webp&s=56b2df98f3d27029c7019e1ffe01f9b3db34f69f

[](https://substackcdn.com/image/fetch/$s_!FE0C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5722f795-5b1e-416b-9152-8970f2ac3bb8_1080x518.webp)

\- u/ThaJedi finetuned Qwen3-1.7B to imitate the original Z-Image text encoder. 21% less VRAM. | [Reddit](https://www.reddit.com/r/StableDiffusion/comments/1t71hvm/i_finetuned_qwen317b_to_imitate_original_zimage/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)

\- Juggernaut Z dropped.
| [CivitAI](https://civitai.red/models/2600510/juggernaut-z?modelVersionId=2921151)

https://preview.redd.it/8u7gwjd5911h1.png?width=450&format=png&auto=webp&s=100a9e84a5c64cd2752423c8e6e619c6fb4fd820

[](https://substackcdn.com/image/fetch/$s_!uXeu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fdf28e6-fd71-432e-a540-848d7cafc1f5_450x675.webp)

\- ltx\_model released LipDub (Beta), an open-source lipsync IC-LoRA. | [Reddit](https://www.reddit.com/r/StableDiffusion/comments/1ta66f1/lipdub_beta_new_opensource_lipsync_iclora/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)

\- MiniMind-O — 0.1B speech-native omni model. Text/speech/image in, text + streaming speech out. Code, checkpoints, and training datasets released.

https://preview.redd.it/ay16yj3h811h1.png?width=1456&format=png&auto=webp&s=971899daee79f7dd9c7acd8bdb976ea2bfe78dda

[Paper](http://arxiv.org/abs/2605.03937v1) | [GitHub](https://github.com/jingyaogong/minimind-o)

Honorable Mentions:

WavCube — Unified speech representation matching WavLM on SUPERB with 8x compression. SOTA zero-shot TTS. Open weights. | [Paper](http://arxiv.org/abs/2605.06407v1) | [GitHub](https://github.com/yanghaha0908/WavCube) | [Hugging Face](https://huggingface.co/yhaha/WavCube)

[The overall architecture of the WavCube representation.](https://preview.redd.it/0hlfjhvq811h1.png?width=1456&format=png&auto=webp&s=9f18dbd14070d89b11500ddbccc3cd8db4295b00)



Checkout the [full roundup](https://open.substack.com/pub/thelivingedge/p/last-week-in-multimodal-ai-56-from?r=12l7fk&utm_campaign=post-expanded-share&utm_medium=web) for more demos, papers, and resources.

https://redd.it/1tcnpxj
@rStableDiffusion
Where are Steps 2 and 3 in Qwen 2509 Image Edit?

I am using the Qwen 2509 Image edit template found in the Comfyui templates section, and when I enter the Subgraph I only see Step 1 - Load Models, and Step 4 - Prompt. The tutorials I've seen online have a Step 2 - Upload image for editing and Step 3 - Image size. Where are these?

https://preview.redd.it/wt87c2ecv11h1.png?width=3600&format=png&auto=webp&s=cba9109379eab9216e10e7bd83a05ebf99e74f6f



https://redd.it/1tcq4y5
@rStableDiffusion
Someone posted a real Monet to twitter but said it was AI generated. The replies are amazing, pretentious and confidently wrong
https://redd.it/1tcxmdy
@rStableDiffusion
Anima TrainFlow — Simple One-Page LoRA Trainer for Anima 2B (Portable, 6GB VRAM, Optimized Config)
https://redd.it/1tcxhoq
@rStableDiffusion
Qwen-Image-VAE-2.0 Technical Report

arxiv.org/pdf/2605.13565

"We present Qwen-Image-VAE-2.0, a suite of high-compression Variational Autoencoders (VAEs) that achieve significant advances in both reconstruction fidelity and diffusability. To address the reconstruction bottlenecks of high compression, we adopt an improved architecture featuring Global Skip Connections (GSC) and expanded latent channels. Moreover, we scale training to billions of images and incorporate a synthetic rendering engine to improve performance in text-rich scenarios. To tackle the convergence challenges of high-dimensional latent space, we implement an enhanced semantic alignment strategy to make the latent space highly amenable to diffusion modeling. To optimize computational efficiency, we leverage an asymmetric and attention-free encoder-decoder backbone to minimize encoding overhead. We present a comprehensive evaluation of Qwen-Image-VAE-2.0 on public reconstruction benchmarks. To evaluate performance in text-rich scenarios, we propose OmniDoc-TokenBench, a new benchmark comprising a diverse collection of real-world documents coupled with specialized OCR-based evaluation metrics. Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction performance, demonstrating exceptional capabilities in both general domains and text-rich scenarios at high compression ratio. Furthermore, downstream DiT experiments reveal our models possess superior diffusability, significantly accelerating convergence compared to existing high-compression baselines. These establish Qwen-Image-VAE-2.0 as a leading model with high compression, superior reconstruction, and exceptional diffusability."


Key innovations:

Global Skip Connections (GSC): This architectural change allows the model to "remember" fine details from the original image and pass them directly through the compression bottleneck, significantly improving the clarity of the final output.
Asymmetric & Attention-Free Backbone: They made the encoder (which processes the image) very lightweight and fast while keeping the decoder (which reconstructs the image) powerful. By removing "Attention" layers in the VAE itself, they drastically reduced the computational cost (FLOPs).
Semantic Alignment Strategy: To make the model better for generating images (diffusability), they forced the latent space to align more closely with visual "meaning." This helps downstream models learn much faster.
Synthetic Rendering for Text: They trained the model on billions of images, including a massive set of synthetically rendered documents. This makes this VAE exceptionally good at reconstructing OCR-rich images (documents, posters, covers etc.) where most other VAEs fail.

alibaba/OmniDoc-TokenBench

"We conduct a comprehensive evaluation on OmniDoc-TokenBench (\~3K text-rich images, 256×256 resolution). Models are grouped by spatial compression factor and sorted by NED within each group.

Our Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction across all compression ratios. The f16c128 variant attains SSIM 0.9706 and PSNR 30.45 dB, surpassing the best f8 baseline (FLUX.1-dev at 0.9364 / 26.24 dB) despite 2× higher spatial compression. In terms of text fidelity (NED), f16c128 reaches 0.9617, exceeding all evaluated VAEs. Even under extreme f32 compression, our f32c192 achieves NED 0.8555, surpassing multiple f16
Anima base v1.0 has been released.
https://redd.it/1td4lbn
@rStableDiffusion