r/StableDiffusion

Last week in Generative Image & Video

I curate a weekly multimodal AI roundup, here are the open-source image & video highlights from the last week:

Numina - Finally makes AI video generators count objects correctly. Ask for three cats, get three cats. Reads attention during generation, catches counting errors, corrects without retraining. [GitHub](https://github.com/H-EmbodVis/NUMINA) | [Project](https://h-embodvis.github.io/NUMINA/)

https://reddit.com/link/1slz1rq/video/t623pxnc2bvg1/player

Prompt Relay - Training-free temporal control for multi-event video generation. Routes each prompt to a specific time segment with zero computational overhead. Plug-and-play with Wan2.2, CogVideo, HunyuanVideo. Project

https://preview.redd.it/j1mpwbgt3bvg1.jpg?width=1900&format=pjpg&auto=webp&s=905891a7d7397a6a9f83d74b9824f7d6aa7f8005

Inspatio World - Takes a normal video and reconstructs a 4D world you can explore. Walk around in 3D, scrub time forward and back, no visible drift. Runs on consumer GPUs. [GitHub](https://github.com/inspatio/inspatio-world) | [Demo](https://world.inspatio.com/)

https://reddit.com/link/1slz1rq/video/wn2lgoqy2bvg1/player

C-MET (Cross-Modal Emotion Transfer) - Emotion editing for talking-face video via text, audio, or video prompts. CLIP-based alignment. Beats SadTalker and EDTalk. Project | GitHub

https://reddit.com/link/1slz1rq/video/q1f3ewi73bvg1/player

LTX 2.3 IC-LoRA Outpaint - By oumoumad. Extends LTX Video with outpainting that actually holds up. [Hugging Face](https://huggingface.co/oumoumad/LTX-2.3-22b-IC-LoRA-Outpaint)
ComfyUI-Image-Conveyor - By xmarre. Sequential drag-and-drop image queuing, processes one image per prompt run, supports manual reordering. GitHub

https://preview.redd.it/nl092r753bvg1.png?width=538&format=png&auto=webp&s=6e0ac1ca2ea6a2429fa1ab29fc7c2fdd071f94bf

Honorable Mentions:

Alibaba HappyHorse - New text- and image-to-video model, currently on top of the Artificial Analysis rankings. Still in beta(not available yet). [Benchmark](https://artificialanalysis.ai/text-to-video)

https://reddit.com/link/1slz1rq/video/q1xew5o13bvg1/player

Google FIT - 1.13M-triplet dataset for fit-aware virtual try-on with body measurements and 3D physics-based draping. Built on FLUX.1 + LoRA. Beats IDM-VTON on fit metrics. Project

https://preview.redd.it/ge0zqa0f3bvg1.png?width=1456&format=png&auto=webp&s=b1e56c273442c9ac42412a44a9494c96d2c136c2

Checkout the full roundup for more demos, papers, and resources.

(https://www.reddit.com/submit/?sourceid=t31slytmb&composerentry=crosspostprompt)

https://redd.it/1slz1rq
@rStableDiffusion

GitHub

GitHub - H-EmbodVis/NUMINA: [CVPR 2026] When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion…

[CVPR 2026] When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models - H-EmbodVis/NUMINA

4 views10:40

r/StableDiffusion

Complex & Weird Prompt Test: ERNIE Turbo | Flux.2 Klein 4B | Z-Image Turbo

https://redd.it/1sm32pz
@rStableDiffusion

From the StableDiffusion community on Reddit: Complex & Weird Prompt Test: ERNIE Turbo | Flux.2 Klein 4B | Z-Image Turbo

Explore this post and more from the StableDiffusion community

4 views11:40

About

Blog

Apps

Platform