r/StableDiffusion

Last week in Generative Image & Video

I curate a weekly multimodal AI roundup, here are the open-source image & video highlights from the last week:

Numina - Finally makes AI video generators count objects correctly. Ask for three cats, get three cats. Reads attention during generation, catches counting errors, corrects without retraining. [GitHub](https://github.com/H-EmbodVis/NUMINA) | [Project](https://h-embodvis.github.io/NUMINA/)

https://reddit.com/link/1slz1rq/video/t623pxnc2bvg1/player

Prompt Relay - Training-free temporal control for multi-event video generation. Routes each prompt to a specific time segment with zero computational overhead. Plug-and-play with Wan2.2, CogVideo, HunyuanVideo. Project

https://preview.redd.it/j1mpwbgt3bvg1.jpg?width=1900&format=pjpg&auto=webp&s=905891a7d7397a6a9f83d74b9824f7d6aa7f8005

Inspatio World - Takes a normal video and reconstructs a 4D world you can explore. Walk around in 3D, scrub time forward and back, no visible drift. Runs on consumer GPUs. [GitHub](https://github.com/inspatio/inspatio-world) | [Demo](https://world.inspatio.com/)

https://reddit.com/link/1slz1rq/video/wn2lgoqy2bvg1/player

C-MET (Cross-Modal Emotion Transfer) - Emotion editing for talking-face video via text, audio, or video prompts. CLIP-based alignment. Beats SadTalker and EDTalk. Project | GitHub

https://reddit.com/link/1slz1rq/video/q1f3ewi73bvg1/player

LTX 2.3 IC-LoRA Outpaint - By oumoumad. Extends LTX Video with outpainting that actually holds up. [Hugging Face](https://huggingface.co/oumoumad/LTX-2.3-22b-IC-LoRA-Outpaint)
ComfyUI-Image-Conveyor - By xmarre. Sequential drag-and-drop image queuing, processes one image per prompt run, supports manual reordering. GitHub

https://preview.redd.it/nl092r753bvg1.png?width=538&format=png&auto=webp&s=6e0ac1ca2ea6a2429fa1ab29fc7c2fdd071f94bf

Honorable Mentions:

Alibaba HappyHorse - New text- and image-to-video model, currently on top of the Artificial Analysis rankings. Still in beta(not available yet). [Benchmark](https://artificialanalysis.ai/text-to-video)

https://reddit.com/link/1slz1rq/video/q1xew5o13bvg1/player

Google FIT - 1.13M-triplet dataset for fit-aware virtual try-on with body measurements and 3D physics-based draping. Built on FLUX.1 + LoRA. Beats IDM-VTON on fit metrics. Project

https://preview.redd.it/ge0zqa0f3bvg1.png?width=1456&format=png&auto=webp&s=b1e56c273442c9ac42412a44a9494c96d2c136c2

Checkout the full roundup for more demos, papers, and resources.

(https://www.reddit.com/submit/?sourceid=t31slytmb&composerentry=crosspostprompt)

https://redd.it/1slz1rq
@rStableDiffusion

GitHub

GitHub - H-EmbodVis/NUMINA: [CVPR 2026] When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion…

[CVPR 2026] When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models - H-EmbodVis/NUMINA

4 views10:40

r/StableDiffusion

Complex & Weird Prompt Test: ERNIE Turbo | Flux.2 Klein 4B | Z-Image Turbo

https://redd.it/1sm32pz
@rStableDiffusion

From the StableDiffusion community on Reddit: Complex & Weird Prompt Test: ERNIE Turbo | Flux.2 Klein 4B | Z-Image Turbo

Explore this post and more from the StableDiffusion community

4 views11:40

r/StableDiffusion

I built a real-time telemetry dashboard for LTX 2.3 and discovered that "clean" math kills cinematic motion

Test1

Test2

Been doing controlled scheduler experiments and the results broke my assumptions completely.

Same prompt. Same seed. Same settings. Only the scheduler curve changed.

Scheduler graph is the top left blue graph. The noisy video is from the debug samplers vae preview.

Test 1 — steady decay curve (the "correct" math):

The video drifted. The model had too much time wandering in low-frequency noise. Character features warped. Background slowly lost coherence. The clean curve was the problem.

Test 2 — deliberate spike injected at the transition phase:

The spike forced the model to align with the prompt's kinetic requirements. The sob physics and flame flicker hit with near-perfect accuracy. "Shocking" the latent space prevented the drift entirely and locked the character into the high-velocity motion path.

The takeaway: a stable sigma curve in LTX 2.3 can be a recipe for identity loss. The model needs pressure at the right moment, not a smooth ride.

To actually see what was happening inside the sampler I built a debug dashboard that tracks sigma, SNR, velocity, cosine similarity, and high/mid/low frequency noise energy per step. That's what's shown in the image. Without it I would never have spotted the drift pattern.

Full breakdown of the methodology and the developing dashboard build here:

https://www.linkedin.com/pulse/developing-real-time-telemetry-dashboard-ltx-video-23-bezuidenhout-5laaf/

https://redd.it/1sm58vl
@rStableDiffusion

From the StableDiffusion community on Reddit

Explore this post and more from the StableDiffusion community

6 views14:40

r/StableDiffusion

Illustrious Z
https://redd.it/1smazg6
@rStableDiffusion

6 views16:40

r/StableDiffusion