Great news: the ERNIE editing model is expected to be released by the end of this month
https://redd.it/1sm05ml
@rStableDiffusion
https://redd.it/1sm05ml
@rStableDiffusion
Last week in Generative Image & Video
I curate a weekly multimodal AI roundup, here are the open-source image & video highlights from the last week:
Numina - Finally makes AI video generators count objects correctly. Ask for three cats, get three cats. Reads attention during generation, catches counting errors, corrects without retraining. [GitHub](https://github.com/H-EmbodVis/NUMINA) | [Project](https://h-embodvis.github.io/NUMINA/)
https://reddit.com/link/1slz1rq/video/t623pxnc2bvg1/player
Prompt Relay - Training-free temporal control for multi-event video generation. Routes each prompt to a specific time segment with zero computational overhead. Plug-and-play with Wan2.2, CogVideo, HunyuanVideo. Project
https://preview.redd.it/j1mpwbgt3bvg1.jpg?width=1900&format=pjpg&auto=webp&s=905891a7d7397a6a9f83d74b9824f7d6aa7f8005
Inspatio World - Takes a normal video and reconstructs a 4D world you can explore. Walk around in 3D, scrub time forward and back, no visible drift. Runs on consumer GPUs. [GitHub](https://github.com/inspatio/inspatio-world) | [Demo](https://world.inspatio.com/)
https://reddit.com/link/1slz1rq/video/wn2lgoqy2bvg1/player
C-MET (Cross-Modal Emotion Transfer) - Emotion editing for talking-face video via text, audio, or video prompts. CLIP-based alignment. Beats SadTalker and EDTalk. Project | GitHub
https://reddit.com/link/1slz1rq/video/q1f3ewi73bvg1/player
LTX 2.3 IC-LoRA Outpaint - By oumoumad. Extends LTX Video with outpainting that actually holds up. [Hugging Face](https://huggingface.co/oumoumad/LTX-2.3-22b-IC-LoRA-Outpaint)
ComfyUI-Image-Conveyor - By xmarre. Sequential drag-and-drop image queuing, processes one image per prompt run, supports manual reordering. GitHub
https://preview.redd.it/nl092r753bvg1.png?width=538&format=png&auto=webp&s=6e0ac1ca2ea6a2429fa1ab29fc7c2fdd071f94bf
Honorable Mentions:
Alibaba HappyHorse - New text- and image-to-video model, currently on top of the Artificial Analysis rankings. Still in beta(not available yet). [Benchmark](https://artificialanalysis.ai/text-to-video)
https://reddit.com/link/1slz1rq/video/q1xew5o13bvg1/player
Google FIT - 1.13M-triplet dataset for fit-aware virtual try-on with body measurements and 3D physics-based draping. Built on FLUX.1 + LoRA. Beats IDM-VTON on fit metrics. Project
https://preview.redd.it/ge0zqa0f3bvg1.png?width=1456&format=png&auto=webp&s=b1e56c273442c9ac42412a44a9494c96d2c136c2
Checkout the full roundup for more demos, papers, and resources.
(https://www.reddit.com/submit/?sourceid=t31slytmb&composerentry=crosspostprompt)
https://redd.it/1slz1rq
@rStableDiffusion
I curate a weekly multimodal AI roundup, here are the open-source image & video highlights from the last week:
Numina - Finally makes AI video generators count objects correctly. Ask for three cats, get three cats. Reads attention during generation, catches counting errors, corrects without retraining. [GitHub](https://github.com/H-EmbodVis/NUMINA) | [Project](https://h-embodvis.github.io/NUMINA/)
https://reddit.com/link/1slz1rq/video/t623pxnc2bvg1/player
Prompt Relay - Training-free temporal control for multi-event video generation. Routes each prompt to a specific time segment with zero computational overhead. Plug-and-play with Wan2.2, CogVideo, HunyuanVideo. Project
https://preview.redd.it/j1mpwbgt3bvg1.jpg?width=1900&format=pjpg&auto=webp&s=905891a7d7397a6a9f83d74b9824f7d6aa7f8005
Inspatio World - Takes a normal video and reconstructs a 4D world you can explore. Walk around in 3D, scrub time forward and back, no visible drift. Runs on consumer GPUs. [GitHub](https://github.com/inspatio/inspatio-world) | [Demo](https://world.inspatio.com/)
https://reddit.com/link/1slz1rq/video/wn2lgoqy2bvg1/player
C-MET (Cross-Modal Emotion Transfer) - Emotion editing for talking-face video via text, audio, or video prompts. CLIP-based alignment. Beats SadTalker and EDTalk. Project | GitHub
https://reddit.com/link/1slz1rq/video/q1f3ewi73bvg1/player
LTX 2.3 IC-LoRA Outpaint - By oumoumad. Extends LTX Video with outpainting that actually holds up. [Hugging Face](https://huggingface.co/oumoumad/LTX-2.3-22b-IC-LoRA-Outpaint)
ComfyUI-Image-Conveyor - By xmarre. Sequential drag-and-drop image queuing, processes one image per prompt run, supports manual reordering. GitHub
https://preview.redd.it/nl092r753bvg1.png?width=538&format=png&auto=webp&s=6e0ac1ca2ea6a2429fa1ab29fc7c2fdd071f94bf
Honorable Mentions:
Alibaba HappyHorse - New text- and image-to-video model, currently on top of the Artificial Analysis rankings. Still in beta(not available yet). [Benchmark](https://artificialanalysis.ai/text-to-video)
https://reddit.com/link/1slz1rq/video/q1xew5o13bvg1/player
Google FIT - 1.13M-triplet dataset for fit-aware virtual try-on with body measurements and 3D physics-based draping. Built on FLUX.1 + LoRA. Beats IDM-VTON on fit metrics. Project
https://preview.redd.it/ge0zqa0f3bvg1.png?width=1456&format=png&auto=webp&s=b1e56c273442c9ac42412a44a9494c96d2c136c2
Checkout the full roundup for more demos, papers, and resources.
(https://www.reddit.com/submit/?sourceid=t31slytmb&composerentry=crosspostprompt)
https://redd.it/1slz1rq
@rStableDiffusion
GitHub
GitHub - H-EmbodVis/NUMINA: [CVPR 2026] When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion…
[CVPR 2026] When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models - H-EmbodVis/NUMINA
Complex & Weird Prompt Test: ERNIE Turbo | Flux.2 Klein 4B | Z-Image Turbo
https://redd.it/1sm32pz
@rStableDiffusion
https://redd.it/1sm32pz
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: Complex & Weird Prompt Test: ERNIE Turbo | Flux.2 Klein 4B | Z-Image Turbo
Explore this post and more from the StableDiffusion community
I built a real-time telemetry dashboard for LTX 2.3 and discovered that "clean" math kills cinematic motion
Test1
Test2
Been doing controlled scheduler experiments and the results broke my assumptions completely.
Same prompt. Same seed. Same settings. Only the scheduler curve changed.
Scheduler graph is the top left blue graph. The noisy video is from the debug samplers vae preview.
Test 1 — steady decay curve (the "correct" math):
The video drifted. The model had too much time wandering in low-frequency noise. Character features warped. Background slowly lost coherence. The clean curve was the problem.
Test 2 — deliberate spike injected at the transition phase:
The spike forced the model to align with the prompt's kinetic requirements. The sob physics and flame flicker hit with near-perfect accuracy. "Shocking" the latent space prevented the drift entirely and locked the character into the high-velocity motion path.
The takeaway: a stable sigma curve in LTX 2.3 can be a recipe for identity loss. The model needs pressure at the right moment, not a smooth ride.
To actually see what was happening inside the sampler I built a debug dashboard that tracks sigma, SNR, velocity, cosine similarity, and high/mid/low frequency noise energy per step. That's what's shown in the image. Without it I would never have spotted the drift pattern.
Full breakdown of the methodology and the developing dashboard build here:
https://www.linkedin.com/pulse/developing-real-time-telemetry-dashboard-ltx-video-23-bezuidenhout-5laaf/
https://redd.it/1sm58vl
@rStableDiffusion
Test1
Test2
Been doing controlled scheduler experiments and the results broke my assumptions completely.
Same prompt. Same seed. Same settings. Only the scheduler curve changed.
Scheduler graph is the top left blue graph. The noisy video is from the debug samplers vae preview.
Test 1 — steady decay curve (the "correct" math):
The video drifted. The model had too much time wandering in low-frequency noise. Character features warped. Background slowly lost coherence. The clean curve was the problem.
Test 2 — deliberate spike injected at the transition phase:
The spike forced the model to align with the prompt's kinetic requirements. The sob physics and flame flicker hit with near-perfect accuracy. "Shocking" the latent space prevented the drift entirely and locked the character into the high-velocity motion path.
The takeaway: a stable sigma curve in LTX 2.3 can be a recipe for identity loss. The model needs pressure at the right moment, not a smooth ride.
To actually see what was happening inside the sampler I built a debug dashboard that tracks sigma, SNR, velocity, cosine similarity, and high/mid/low frequency noise energy per step. That's what's shown in the image. Without it I would never have spotted the drift pattern.
Full breakdown of the methodology and the developing dashboard build here:
https://www.linkedin.com/pulse/developing-real-time-telemetry-dashboard-ltx-video-23-bezuidenhout-5laaf/
https://redd.it/1sm58vl
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Dear mods, please care about this place. What currently happens is bullshit.
Today a lot of folk posted their GPU's they've bought. What the fuck has this to do with the essential core and initial thought of this sub? Moderate it for fucks sake, that's your job. Otherwise please find mods that actually care and don't have a dozen subs on their names, I look at you /u/dbzer0.
But what I find really disturbing: Do you want to show off you can buy something expensive in nowadays economy? Honestly great for you! But what the fuck has this to do with this sub? Go to /r/pcmasterrace.
This space was so much more about open source models, sharing and workflow optimizations with each other. Please get back to that since your rules state this too.
Nowadays it's a gallery of images without workflows and now GPU flexing? Why is it like this today?
We got a lot of models, past weeks were filled with new open source models all aroud the open space, but those information doesn't get to the top anymore. Nothing gets filtered anymore. People can post any sub-par image generations.
Do your job mods, please.
https://redd.it/1smj348
@rStableDiffusion
Today a lot of folk posted their GPU's they've bought. What the fuck has this to do with the essential core and initial thought of this sub? Moderate it for fucks sake, that's your job. Otherwise please find mods that actually care and don't have a dozen subs on their names, I look at you /u/dbzer0.
But what I find really disturbing: Do you want to show off you can buy something expensive in nowadays economy? Honestly great for you! But what the fuck has this to do with this sub? Go to /r/pcmasterrace.
This space was so much more about open source models, sharing and workflow optimizations with each other. Please get back to that since your rules state this too.
Nowadays it's a gallery of images without workflows and now GPU flexing? Why is it like this today?
We got a lot of models, past weeks were filled with new open source models all aroud the open space, but those information doesn't get to the top anymore. Nothing gets filtered anymore. People can post any sub-par image generations.
Do your job mods, please.
https://redd.it/1smj348
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Comparison of low Steps, Klein 9b x Z image turbo x Ernie Turbo x Qwen 2512 8 Steps
https://redd.it/1sme1k0
@rStableDiffusion
https://redd.it/1sme1k0
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: Comparison of low Steps, Klein 9b x Z image turbo x Ernie Turbo x Qwen 2512 8 Steps
Explore this post and more from the StableDiffusion community