Great news: the ERNIE editing model is expected to be released by the end of this month
https://redd.it/1sm05ml
@rStableDiffusion
https://redd.it/1sm05ml
@rStableDiffusion
Last week in Generative Image & Video
I curate a weekly multimodal AI roundup, here are the open-source image & video highlights from the last week:
Numina - Finally makes AI video generators count objects correctly. Ask for three cats, get three cats. Reads attention during generation, catches counting errors, corrects without retraining. [GitHub](https://github.com/H-EmbodVis/NUMINA) | [Project](https://h-embodvis.github.io/NUMINA/)
https://reddit.com/link/1slz1rq/video/t623pxnc2bvg1/player
Prompt Relay - Training-free temporal control for multi-event video generation. Routes each prompt to a specific time segment with zero computational overhead. Plug-and-play with Wan2.2, CogVideo, HunyuanVideo. Project
https://preview.redd.it/j1mpwbgt3bvg1.jpg?width=1900&format=pjpg&auto=webp&s=905891a7d7397a6a9f83d74b9824f7d6aa7f8005
Inspatio World - Takes a normal video and reconstructs a 4D world you can explore. Walk around in 3D, scrub time forward and back, no visible drift. Runs on consumer GPUs. [GitHub](https://github.com/inspatio/inspatio-world) | [Demo](https://world.inspatio.com/)
https://reddit.com/link/1slz1rq/video/wn2lgoqy2bvg1/player
C-MET (Cross-Modal Emotion Transfer) - Emotion editing for talking-face video via text, audio, or video prompts. CLIP-based alignment. Beats SadTalker and EDTalk. Project | GitHub
https://reddit.com/link/1slz1rq/video/q1f3ewi73bvg1/player
LTX 2.3 IC-LoRA Outpaint - By oumoumad. Extends LTX Video with outpainting that actually holds up. [Hugging Face](https://huggingface.co/oumoumad/LTX-2.3-22b-IC-LoRA-Outpaint)
ComfyUI-Image-Conveyor - By xmarre. Sequential drag-and-drop image queuing, processes one image per prompt run, supports manual reordering. GitHub
https://preview.redd.it/nl092r753bvg1.png?width=538&format=png&auto=webp&s=6e0ac1ca2ea6a2429fa1ab29fc7c2fdd071f94bf
Honorable Mentions:
Alibaba HappyHorse - New text- and image-to-video model, currently on top of the Artificial Analysis rankings. Still in beta(not available yet). [Benchmark](https://artificialanalysis.ai/text-to-video)
https://reddit.com/link/1slz1rq/video/q1xew5o13bvg1/player
Google FIT - 1.13M-triplet dataset for fit-aware virtual try-on with body measurements and 3D physics-based draping. Built on FLUX.1 + LoRA. Beats IDM-VTON on fit metrics. Project
https://preview.redd.it/ge0zqa0f3bvg1.png?width=1456&format=png&auto=webp&s=b1e56c273442c9ac42412a44a9494c96d2c136c2
Checkout the full roundup for more demos, papers, and resources.
(https://www.reddit.com/submit/?sourceid=t31slytmb&composerentry=crosspostprompt)
https://redd.it/1slz1rq
@rStableDiffusion
I curate a weekly multimodal AI roundup, here are the open-source image & video highlights from the last week:
Numina - Finally makes AI video generators count objects correctly. Ask for three cats, get three cats. Reads attention during generation, catches counting errors, corrects without retraining. [GitHub](https://github.com/H-EmbodVis/NUMINA) | [Project](https://h-embodvis.github.io/NUMINA/)
https://reddit.com/link/1slz1rq/video/t623pxnc2bvg1/player
Prompt Relay - Training-free temporal control for multi-event video generation. Routes each prompt to a specific time segment with zero computational overhead. Plug-and-play with Wan2.2, CogVideo, HunyuanVideo. Project
https://preview.redd.it/j1mpwbgt3bvg1.jpg?width=1900&format=pjpg&auto=webp&s=905891a7d7397a6a9f83d74b9824f7d6aa7f8005
Inspatio World - Takes a normal video and reconstructs a 4D world you can explore. Walk around in 3D, scrub time forward and back, no visible drift. Runs on consumer GPUs. [GitHub](https://github.com/inspatio/inspatio-world) | [Demo](https://world.inspatio.com/)
https://reddit.com/link/1slz1rq/video/wn2lgoqy2bvg1/player
C-MET (Cross-Modal Emotion Transfer) - Emotion editing for talking-face video via text, audio, or video prompts. CLIP-based alignment. Beats SadTalker and EDTalk. Project | GitHub
https://reddit.com/link/1slz1rq/video/q1f3ewi73bvg1/player
LTX 2.3 IC-LoRA Outpaint - By oumoumad. Extends LTX Video with outpainting that actually holds up. [Hugging Face](https://huggingface.co/oumoumad/LTX-2.3-22b-IC-LoRA-Outpaint)
ComfyUI-Image-Conveyor - By xmarre. Sequential drag-and-drop image queuing, processes one image per prompt run, supports manual reordering. GitHub
https://preview.redd.it/nl092r753bvg1.png?width=538&format=png&auto=webp&s=6e0ac1ca2ea6a2429fa1ab29fc7c2fdd071f94bf
Honorable Mentions:
Alibaba HappyHorse - New text- and image-to-video model, currently on top of the Artificial Analysis rankings. Still in beta(not available yet). [Benchmark](https://artificialanalysis.ai/text-to-video)
https://reddit.com/link/1slz1rq/video/q1xew5o13bvg1/player
Google FIT - 1.13M-triplet dataset for fit-aware virtual try-on with body measurements and 3D physics-based draping. Built on FLUX.1 + LoRA. Beats IDM-VTON on fit metrics. Project
https://preview.redd.it/ge0zqa0f3bvg1.png?width=1456&format=png&auto=webp&s=b1e56c273442c9ac42412a44a9494c96d2c136c2
Checkout the full roundup for more demos, papers, and resources.
(https://www.reddit.com/submit/?sourceid=t31slytmb&composerentry=crosspostprompt)
https://redd.it/1slz1rq
@rStableDiffusion
GitHub
GitHub - H-EmbodVis/NUMINA: [CVPR 2026] When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion…
[CVPR 2026] When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models - H-EmbodVis/NUMINA
Complex & Weird Prompt Test: ERNIE Turbo | Flux.2 Klein 4B | Z-Image Turbo
https://redd.it/1sm32pz
@rStableDiffusion
https://redd.it/1sm32pz
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: Complex & Weird Prompt Test: ERNIE Turbo | Flux.2 Klein 4B | Z-Image Turbo
Explore this post and more from the StableDiffusion community
I built a real-time telemetry dashboard for LTX 2.3 and discovered that "clean" math kills cinematic motion
Test1
Test2
Been doing controlled scheduler experiments and the results broke my assumptions completely.
Same prompt. Same seed. Same settings. Only the scheduler curve changed.
Scheduler graph is the top left blue graph. The noisy video is from the debug samplers vae preview.
Test 1 — steady decay curve (the "correct" math):
The video drifted. The model had too much time wandering in low-frequency noise. Character features warped. Background slowly lost coherence. The clean curve was the problem.
Test 2 — deliberate spike injected at the transition phase:
The spike forced the model to align with the prompt's kinetic requirements. The sob physics and flame flicker hit with near-perfect accuracy. "Shocking" the latent space prevented the drift entirely and locked the character into the high-velocity motion path.
The takeaway: a stable sigma curve in LTX 2.3 can be a recipe for identity loss. The model needs pressure at the right moment, not a smooth ride.
To actually see what was happening inside the sampler I built a debug dashboard that tracks sigma, SNR, velocity, cosine similarity, and high/mid/low frequency noise energy per step. That's what's shown in the image. Without it I would never have spotted the drift pattern.
Full breakdown of the methodology and the developing dashboard build here:
https://www.linkedin.com/pulse/developing-real-time-telemetry-dashboard-ltx-video-23-bezuidenhout-5laaf/
https://redd.it/1sm58vl
@rStableDiffusion
Test1
Test2
Been doing controlled scheduler experiments and the results broke my assumptions completely.
Same prompt. Same seed. Same settings. Only the scheduler curve changed.
Scheduler graph is the top left blue graph. The noisy video is from the debug samplers vae preview.
Test 1 — steady decay curve (the "correct" math):
The video drifted. The model had too much time wandering in low-frequency noise. Character features warped. Background slowly lost coherence. The clean curve was the problem.
Test 2 — deliberate spike injected at the transition phase:
The spike forced the model to align with the prompt's kinetic requirements. The sob physics and flame flicker hit with near-perfect accuracy. "Shocking" the latent space prevented the drift entirely and locked the character into the high-velocity motion path.
The takeaway: a stable sigma curve in LTX 2.3 can be a recipe for identity loss. The model needs pressure at the right moment, not a smooth ride.
To actually see what was happening inside the sampler I built a debug dashboard that tracks sigma, SNR, velocity, cosine similarity, and high/mid/low frequency noise energy per step. That's what's shown in the image. Without it I would never have spotted the drift pattern.
Full breakdown of the methodology and the developing dashboard build here:
https://www.linkedin.com/pulse/developing-real-time-telemetry-dashboard-ltx-video-23-bezuidenhout-5laaf/
https://redd.it/1sm58vl
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Dear mods, please care about this place. What currently happens is bullshit.
Today a lot of folk posted their GPU's they've bought. What the fuck has this to do with the essential core and initial thought of this sub? Moderate it for fucks sake, that's your job. Otherwise please find mods that actually care and don't have a dozen subs on their names, I look at you /u/dbzer0.
But what I find really disturbing: Do you want to show off you can buy something expensive in nowadays economy? Honestly great for you! But what the fuck has this to do with this sub? Go to /r/pcmasterrace.
This space was so much more about open source models, sharing and workflow optimizations with each other. Please get back to that since your rules state this too.
Nowadays it's a gallery of images without workflows and now GPU flexing? Why is it like this today?
We got a lot of models, past weeks were filled with new open source models all aroud the open space, but those information doesn't get to the top anymore. Nothing gets filtered anymore. People can post any sub-par image generations.
Do your job mods, please.
https://redd.it/1smj348
@rStableDiffusion
Today a lot of folk posted their GPU's they've bought. What the fuck has this to do with the essential core and initial thought of this sub? Moderate it for fucks sake, that's your job. Otherwise please find mods that actually care and don't have a dozen subs on their names, I look at you /u/dbzer0.
But what I find really disturbing: Do you want to show off you can buy something expensive in nowadays economy? Honestly great for you! But what the fuck has this to do with this sub? Go to /r/pcmasterrace.
This space was so much more about open source models, sharing and workflow optimizations with each other. Please get back to that since your rules state this too.
Nowadays it's a gallery of images without workflows and now GPU flexing? Why is it like this today?
We got a lot of models, past weeks were filled with new open source models all aroud the open space, but those information doesn't get to the top anymore. Nothing gets filtered anymore. People can post any sub-par image generations.
Do your job mods, please.
https://redd.it/1smj348
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Comparison of low Steps, Klein 9b x Z image turbo x Ernie Turbo x Qwen 2512 8 Steps
https://redd.it/1sme1k0
@rStableDiffusion
https://redd.it/1sme1k0
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: Comparison of low Steps, Klein 9b x Z image turbo x Ernie Turbo x Qwen 2512 8 Steps
Explore this post and more from the StableDiffusion community
Tencent HY-World-2.0 is now public
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
https://huggingface.co/tencent/HY-World-2.0
https://github.com/Tencent-Hunyuan/HY-World-2.0
https://preview.redd.it/x2nhoprmtfvg1.png?width=1920&format=png&auto=webp&s=e480c8bc65589154130efeaadfca70bb74d46b0e
https://3d-models.hunyuan.tencent.com/world/
https://3d-models.hunyuan.tencent.com/world/world2\_0/HY\_World\_2\_0.pdf
https://redd.it/1smmer5
@rStableDiffusion
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
https://huggingface.co/tencent/HY-World-2.0
https://github.com/Tencent-Hunyuan/HY-World-2.0
https://preview.redd.it/x2nhoprmtfvg1.png?width=1920&format=png&auto=webp&s=e480c8bc65589154130efeaadfca70bb74d46b0e
https://3d-models.hunyuan.tencent.com/world/
https://3d-models.hunyuan.tencent.com/world/world2\_0/HY\_World\_2\_0.pdf
https://redd.it/1smmer5
@rStableDiffusion
huggingface.co
tencent/HY-World-2.0 · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
WAI-ANIMA 1.0 released
https://civitai.red/models/2544636?modelVersionId=2859702
https://redd.it/1smnwjl
@rStableDiffusion
https://civitai.red/models/2544636?modelVersionId=2859702
https://redd.it/1smnwjl
@rStableDiffusion
civitai.red
WAI-ANIMA - v1.0 | Anima Checkpoint | Civitai
WAI-Anima - v1 - Free AI CHECKPOINT Download | Tensor.Art | Tensor.Art This is the first version of the model and is still in the exploration stage...
I tested Ernie Image Turbo (fp8, nvfp4, fp16 and INT8) with Nano Banana Pro 2 Prompts so you won't have to
https://redd.it/1smo359
@rStableDiffusion
https://redd.it/1smo359
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: I tested Ernie Image Turbo (fp8, nvfp4, fp16 and INT8) with Nano Banana Pro 2 Prompts…
Explore this post and more from the StableDiffusion community
Motif-Video-2B
https://huggingface.co/Motif-Technologies/Motif-Video-2B
https://motiftech.io/videoshowcase
Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. Motif-Video 2B asks whether competitive text-to-video quality is reachable at a much smaller budget — fewer than 10M training clips and under 100,000 H200 GPU hours — and shows that the answer is yes, provided the model design explicitly separates objectives that scaling would otherwise leave entangled.
Our central observation is that prompt alignment, temporal consistency, and fine-detail recovery interfere with one another when handled through the same pathway. Motif-Video 2B addresses this objective interference architecturally rather than relying on scale alone, through two contributions:
Shared Cross-Attention. A residual cross-attention mechanism that reuses self-attention K/V weights to stabilize text–video alignment under long-context token sparsity, where standard joint attention dilutes text influence as the video token sequence grows.
Three-stage DDT-style backbone. 12 dual-stream + 16 single-stream + 8 DDT decoder layers, separating early modality fusion, joint representation learning, and high-frequency detail reconstruction into dedicated components. Per-block attention analysis shows that the DDT decoder spontaneously develops inter-frame attention structure absent from the encoder layers.
"Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. Motif-Video 2B asks whether competitive text-to-video quality is reachable at a much smaller budget — fewer than 10M training clips and under 100,000 H200 GPU hours — and shows that the answer is yes, provided the model design explicitly separates objectives that scaling would otherwise leave entangled.
Our central observation is that prompt alignment, temporal consistency, and fine-detail recovery interfere with one another when handled through the same pathway. Motif-Video 2B addresses this objective interference architecturally rather than relying on scale alone, through two contributions:
Shared Cross-Attention. A residual cross-attention mechanism that reuses self-attention K/V weights to stabilize text–video alignment under long-context token sparsity, where standard joint attention dilutes text influence as the video token sequence grows.
Three-stage DDT-style backbone. 12 dual-stream + 16 single-stream + 8 DDT decoder layers, separating early modality fusion, joint representation learning, and high-frequency detail reconstruction into dedicated components. Per-block attention analysis shows that the DDT decoder spontaneously develops inter-frame attention structure absent from the encoder layers."
https://redd.it/1smonvh
@rStableDiffusion
https://huggingface.co/Motif-Technologies/Motif-Video-2B
https://motiftech.io/videoshowcase
Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. Motif-Video 2B asks whether competitive text-to-video quality is reachable at a much smaller budget — fewer than 10M training clips and under 100,000 H200 GPU hours — and shows that the answer is yes, provided the model design explicitly separates objectives that scaling would otherwise leave entangled.
Our central observation is that prompt alignment, temporal consistency, and fine-detail recovery interfere with one another when handled through the same pathway. Motif-Video 2B addresses this objective interference architecturally rather than relying on scale alone, through two contributions:
Shared Cross-Attention. A residual cross-attention mechanism that reuses self-attention K/V weights to stabilize text–video alignment under long-context token sparsity, where standard joint attention dilutes text influence as the video token sequence grows.
Three-stage DDT-style backbone. 12 dual-stream + 16 single-stream + 8 DDT decoder layers, separating early modality fusion, joint representation learning, and high-frequency detail reconstruction into dedicated components. Per-block attention analysis shows that the DDT decoder spontaneously develops inter-frame attention structure absent from the encoder layers.
"Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. Motif-Video 2B asks whether competitive text-to-video quality is reachable at a much smaller budget — fewer than 10M training clips and under 100,000 H200 GPU hours — and shows that the answer is yes, provided the model design explicitly separates objectives that scaling would otherwise leave entangled.
Our central observation is that prompt alignment, temporal consistency, and fine-detail recovery interfere with one another when handled through the same pathway. Motif-Video 2B addresses this objective interference architecturally rather than relying on scale alone, through two contributions:
Shared Cross-Attention. A residual cross-attention mechanism that reuses self-attention K/V weights to stabilize text–video alignment under long-context token sparsity, where standard joint attention dilutes text influence as the video token sequence grows.
Three-stage DDT-style backbone. 12 dual-stream + 16 single-stream + 8 DDT decoder layers, separating early modality fusion, joint representation learning, and high-frequency detail reconstruction into dedicated components. Per-block attention analysis shows that the DDT decoder spontaneously develops inter-frame attention structure absent from the encoder layers."
https://redd.it/1smonvh
@rStableDiffusion
huggingface.co
Motif-Technologies/Motif-Video-2B · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.