Great news: the ERNIE editing model is expected to be released by the end of this month
https://redd.it/1sm05ml
@rStableDiffusion
https://redd.it/1sm05ml
@rStableDiffusion
Last week in Generative Image & Video
I curate a weekly multimodal AI roundup, here are the open-source image & video highlights from the last week:
Numina - Finally makes AI video generators count objects correctly. Ask for three cats, get three cats. Reads attention during generation, catches counting errors, corrects without retraining. [GitHub](https://github.com/H-EmbodVis/NUMINA) | [Project](https://h-embodvis.github.io/NUMINA/)
https://reddit.com/link/1slz1rq/video/t623pxnc2bvg1/player
Prompt Relay - Training-free temporal control for multi-event video generation. Routes each prompt to a specific time segment with zero computational overhead. Plug-and-play with Wan2.2, CogVideo, HunyuanVideo. Project
https://preview.redd.it/j1mpwbgt3bvg1.jpg?width=1900&format=pjpg&auto=webp&s=905891a7d7397a6a9f83d74b9824f7d6aa7f8005
Inspatio World - Takes a normal video and reconstructs a 4D world you can explore. Walk around in 3D, scrub time forward and back, no visible drift. Runs on consumer GPUs. [GitHub](https://github.com/inspatio/inspatio-world) | [Demo](https://world.inspatio.com/)
https://reddit.com/link/1slz1rq/video/wn2lgoqy2bvg1/player
C-MET (Cross-Modal Emotion Transfer) - Emotion editing for talking-face video via text, audio, or video prompts. CLIP-based alignment. Beats SadTalker and EDTalk. Project | GitHub
https://reddit.com/link/1slz1rq/video/q1f3ewi73bvg1/player
LTX 2.3 IC-LoRA Outpaint - By oumoumad. Extends LTX Video with outpainting that actually holds up. [Hugging Face](https://huggingface.co/oumoumad/LTX-2.3-22b-IC-LoRA-Outpaint)
ComfyUI-Image-Conveyor - By xmarre. Sequential drag-and-drop image queuing, processes one image per prompt run, supports manual reordering. GitHub
https://preview.redd.it/nl092r753bvg1.png?width=538&format=png&auto=webp&s=6e0ac1ca2ea6a2429fa1ab29fc7c2fdd071f94bf
Honorable Mentions:
Alibaba HappyHorse - New text- and image-to-video model, currently on top of the Artificial Analysis rankings. Still in beta(not available yet). [Benchmark](https://artificialanalysis.ai/text-to-video)
https://reddit.com/link/1slz1rq/video/q1xew5o13bvg1/player
Google FIT - 1.13M-triplet dataset for fit-aware virtual try-on with body measurements and 3D physics-based draping. Built on FLUX.1 + LoRA. Beats IDM-VTON on fit metrics. Project
https://preview.redd.it/ge0zqa0f3bvg1.png?width=1456&format=png&auto=webp&s=b1e56c273442c9ac42412a44a9494c96d2c136c2
Checkout the full roundup for more demos, papers, and resources.
(https://www.reddit.com/submit/?sourceid=t31slytmb&composerentry=crosspostprompt)
https://redd.it/1slz1rq
@rStableDiffusion
I curate a weekly multimodal AI roundup, here are the open-source image & video highlights from the last week:
Numina - Finally makes AI video generators count objects correctly. Ask for three cats, get three cats. Reads attention during generation, catches counting errors, corrects without retraining. [GitHub](https://github.com/H-EmbodVis/NUMINA) | [Project](https://h-embodvis.github.io/NUMINA/)
https://reddit.com/link/1slz1rq/video/t623pxnc2bvg1/player
Prompt Relay - Training-free temporal control for multi-event video generation. Routes each prompt to a specific time segment with zero computational overhead. Plug-and-play with Wan2.2, CogVideo, HunyuanVideo. Project
https://preview.redd.it/j1mpwbgt3bvg1.jpg?width=1900&format=pjpg&auto=webp&s=905891a7d7397a6a9f83d74b9824f7d6aa7f8005
Inspatio World - Takes a normal video and reconstructs a 4D world you can explore. Walk around in 3D, scrub time forward and back, no visible drift. Runs on consumer GPUs. [GitHub](https://github.com/inspatio/inspatio-world) | [Demo](https://world.inspatio.com/)
https://reddit.com/link/1slz1rq/video/wn2lgoqy2bvg1/player
C-MET (Cross-Modal Emotion Transfer) - Emotion editing for talking-face video via text, audio, or video prompts. CLIP-based alignment. Beats SadTalker and EDTalk. Project | GitHub
https://reddit.com/link/1slz1rq/video/q1f3ewi73bvg1/player
LTX 2.3 IC-LoRA Outpaint - By oumoumad. Extends LTX Video with outpainting that actually holds up. [Hugging Face](https://huggingface.co/oumoumad/LTX-2.3-22b-IC-LoRA-Outpaint)
ComfyUI-Image-Conveyor - By xmarre. Sequential drag-and-drop image queuing, processes one image per prompt run, supports manual reordering. GitHub
https://preview.redd.it/nl092r753bvg1.png?width=538&format=png&auto=webp&s=6e0ac1ca2ea6a2429fa1ab29fc7c2fdd071f94bf
Honorable Mentions:
Alibaba HappyHorse - New text- and image-to-video model, currently on top of the Artificial Analysis rankings. Still in beta(not available yet). [Benchmark](https://artificialanalysis.ai/text-to-video)
https://reddit.com/link/1slz1rq/video/q1xew5o13bvg1/player
Google FIT - 1.13M-triplet dataset for fit-aware virtual try-on with body measurements and 3D physics-based draping. Built on FLUX.1 + LoRA. Beats IDM-VTON on fit metrics. Project
https://preview.redd.it/ge0zqa0f3bvg1.png?width=1456&format=png&auto=webp&s=b1e56c273442c9ac42412a44a9494c96d2c136c2
Checkout the full roundup for more demos, papers, and resources.
(https://www.reddit.com/submit/?sourceid=t31slytmb&composerentry=crosspostprompt)
https://redd.it/1slz1rq
@rStableDiffusion
GitHub
GitHub - H-EmbodVis/NUMINA: [CVPR 2026] When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion…
[CVPR 2026] When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models - H-EmbodVis/NUMINA
Complex & Weird Prompt Test: ERNIE Turbo | Flux.2 Klein 4B | Z-Image Turbo
https://redd.it/1sm32pz
@rStableDiffusion
https://redd.it/1sm32pz
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: Complex & Weird Prompt Test: ERNIE Turbo | Flux.2 Klein 4B | Z-Image Turbo
Explore this post and more from the StableDiffusion community