This media is not supported in your browser
VIEW IN TELEGRAM
LTX 2.3 adding unwanted subtitles in generated videos even when not mentioned in prompt
https://redd.it/1tbrsf7
@rStableDiffusion
https://redd.it/1tbrsf7
@rStableDiffusion
This media is not supported in your browser
VIEW IN TELEGRAM
Scenema Audio: Zero-shot expressive voice cloning and speech generation
https://redd.it/1tbzgi3
@rStableDiffusion
https://redd.it/1tbzgi3
@rStableDiffusion
ComfyUI Pixaroma Nodes: New Load Image, Notify & Utility Nodes (Ep17)
https://www.youtube.com/watch?v=dXH7Qx9pzyc
https://redd.it/1tc2fuz
@rStableDiffusion
https://www.youtube.com/watch?v=dXH7Qx9pzyc
https://redd.it/1tc2fuz
@rStableDiffusion
YouTube
ComfyUI Pixaroma Nodes: New Load Image, Notify & Utility Nodes (Ep17)
In this episode, I’ll show you the latest updates in the Pixaroma node pack for ComfyUI and Easy Install. We’ll look at the new Pixaroma Load Image node, new Copy and Open buttons, filename outputs, date-based save folders, smarter image resizing, width and…
LTX 2.3 video generation notes after testing H100, RTX 5090, A100, L40, FP8, BF16, and CPU offload
This community helped me a lot in my last post so here's my contribution back. If you're looking to generate LTX 2.3 videos, these notes might save you a few hundred dollars on wasted cloud rentals.
H100:
\- 5s distilled FP8, 704x1280, 121f: 48s
\- 5s distilled no-quant, 704x1280, 121f: 45s
\- 5s HQ/no-quant, 704x1280, 121f, 20 steps: 121s
\- 20s HQ/no-quant, 704x1280, 481f, 20 steps: 321s
\- 20s HQ/no-quant, 704x1280, 481f, 28 steps: 380-390s
RTX 5090:
\- 5s distilled FP8, 704x1280, 121f: 43s
\- 5s HQ FP8, 704x1280, 121f, 20 steps: 151s
\- 20s distilled FP8, 704x1280, 481f: failed/OOM after 55s
\- 20s distilled FP8, 576x1024, 481f: 104s
\- 20s distilled, no quantization, CPU offload, 704x1280, 481f: 299s
A100:
\- 5s image-conditioned, 704x1280: 401-425s
\- 20s HQ/no-quant, 704x1280, 481f, 20 steps, serverless render step: 608s
\- 20s HQ/no-quant, 704x1280, 481f, 20 steps, serverless remote total: 713s
\- 20s HQ/no-quant, 704x1280, 481f, 20 steps, serverless local wall time: 797s
L40:
(I left a note about this in the lessons paragraph below.)
\- 5s distilled, no quantization, CPU offload, 704x1280, 121f: 1199s
\- 5s distilled FP8, 704x1280, 121f: 197s
\- 20s distilled FP8, 704x1280, 481f, max batch 4: failed/OOM after 189s
\- 20s distilled FP8 low-memory, 704x1280, 481f, max batch 1: 365s
\- 20s distilled FP8 low-memory, 704x1280, 481f, repeated runs: 433-453s
Some lessons:
\- For some reason, the output of A100 was worse than H100 for exact setup. I generated around 20 videos on each GPU from the same cloud host and A100 output was always worse. A100 scenes were less realistic than H100.
\- I did not like 5090 results on distilled + FP8. Distilled with offloading to CPU RAM is better.
- The L40 cloud I rented could generate 20s 704x1280 clips, but only with a lower-memory FP8 setup for some reason. I am guessing the cloud rental device was not in the best state.
\- For spoken words, try to target around 45-52 words per 20 seconds.
\- Avoid ending with important words. The model sometimes cuts off the final syllable. A short final sentence helps.
I am still exploring this so feel free to let me know if there's anything additional I can do. Happy to contribute to the community if you're looking for any generated samples or examples.
https://redd.it/1tc5s73
@rStableDiffusion
This community helped me a lot in my last post so here's my contribution back. If you're looking to generate LTX 2.3 videos, these notes might save you a few hundred dollars on wasted cloud rentals.
H100:
\- 5s distilled FP8, 704x1280, 121f: 48s
\- 5s distilled no-quant, 704x1280, 121f: 45s
\- 5s HQ/no-quant, 704x1280, 121f, 20 steps: 121s
\- 20s HQ/no-quant, 704x1280, 481f, 20 steps: 321s
\- 20s HQ/no-quant, 704x1280, 481f, 28 steps: 380-390s
RTX 5090:
\- 5s distilled FP8, 704x1280, 121f: 43s
\- 5s HQ FP8, 704x1280, 121f, 20 steps: 151s
\- 20s distilled FP8, 704x1280, 481f: failed/OOM after 55s
\- 20s distilled FP8, 576x1024, 481f: 104s
\- 20s distilled, no quantization, CPU offload, 704x1280, 481f: 299s
A100:
\- 5s image-conditioned, 704x1280: 401-425s
\- 20s HQ/no-quant, 704x1280, 481f, 20 steps, serverless render step: 608s
\- 20s HQ/no-quant, 704x1280, 481f, 20 steps, serverless remote total: 713s
\- 20s HQ/no-quant, 704x1280, 481f, 20 steps, serverless local wall time: 797s
L40:
(I left a note about this in the lessons paragraph below.)
\- 5s distilled, no quantization, CPU offload, 704x1280, 121f: 1199s
\- 5s distilled FP8, 704x1280, 121f: 197s
\- 20s distilled FP8, 704x1280, 481f, max batch 4: failed/OOM after 189s
\- 20s distilled FP8 low-memory, 704x1280, 481f, max batch 1: 365s
\- 20s distilled FP8 low-memory, 704x1280, 481f, repeated runs: 433-453s
Some lessons:
\- For some reason, the output of A100 was worse than H100 for exact setup. I generated around 20 videos on each GPU from the same cloud host and A100 output was always worse. A100 scenes were less realistic than H100.
\- I did not like 5090 results on distilled + FP8. Distilled with offloading to CPU RAM is better.
- The L40 cloud I rented could generate 20s 704x1280 clips, but only with a lower-memory FP8 setup for some reason. I am guessing the cloud rental device was not in the best state.
\- For spoken words, try to target around 45-52 words per 20 seconds.
\- Avoid ending with important words. The model sometimes cuts off the final syllable. A short final sentence helps.
I am still exploring this so feel free to let me know if there's anything additional I can do. Happy to contribute to the community if you're looking for any generated samples or examples.
https://redd.it/1tc5s73
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Media is too big
VIEW IN TELEGRAM
DramaBox - Most Expressive Voice model ever based on LTX 2.3
https://redd.it/1tc6i8w
@rStableDiffusion
https://redd.it/1tc6i8w
@rStableDiffusion
SenseNova-U1 Technical Report: VAE-free Pixel-level Flow Matching with 32x Compression
https://redd.it/1tc2anx
@rStableDiffusion
https://redd.it/1tc2anx
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: SenseNova-U1 Technical Report: VAE-free Pixel-level Flow Matching with 32x Compression
Explore this post and more from the StableDiffusion community
PyTorch 2.12.0+cu132 (CUDA 13.2) — SA2/SA3 Attention Stability Benchmarks
With the release of PyTorch 2.12.0+cu132, I ran a full benchmark suite to verify that SA2 and SA3 attention backends are stable and working correctly in the new environment.
Tests were conducted on the following models:
* **flux1-krea-dev\_fp8\_scaled** — 20 steps, CFG 1, 1024×1024
* **flux-2-klein-base-9b-fp8** — 20 steps, CFG 5, 1280×1280
* **wan2.2\_t2v\_high/low\_noise\_14B\_fp16 + lightx2v\_4steps\_lora** — 2+2 steps, CFG 1, 640×640
All backends (fp8\_cuda, fp8pp\_cuda, triton, SA3 standard, SA3 per\_block\_mean) are confirmed stable. Results in the charts below.
The Krea model has the largest options when changing modes sa2-3, but the quality is almost the same everywhere.
https://preview.redd.it/8v3quwkfyy0h1.png?width=3840&format=png&auto=webp&s=a38dcff0c402d1102425ababcf7e7ec7693eee09
https://preview.redd.it/b6lkjbfz0z0h1.jpg?width=6000&format=pjpg&auto=webp&s=d047b2fffe7ff4b444dc795f1d638ed8ce972678
The Klein model is almost the same when changing from SA2 to SA3, but the plastic skin remains, which is a credit to the model itself. But the speed is almost the same in all operating modes.
https://preview.redd.it/0ve393uoyy0h1.png?width=3840&format=png&auto=webp&s=107733601b7f0fe184b94d12d4677904df5273a5
https://preview.redd.it/21bfjzyv0z0h1.jpg?width=6000&format=pjpg&auto=webp&s=c4774218bd8b91e04ad4d04c2c1f27708f7213f7
The WAN 2.2 model worked almost identically except for the sa3=standard and sa3=per\_block\_mean modes, so the video lost a little quality and changed. The triton+standard mode slowed down very strangely.
https://preview.redd.it/p5dr6dv8zy0h1.png?width=3840&format=png&auto=webp&s=3600b2892299c8b84b7258dc9cb1608da5d64495
https://reddit.com/link/1tcd718/video/vzevp45kzy0h1/player
But the main task was achieved, everything works and with the new pytorch 2.12.0, I did not test different nodes for compatibility, the ones I created work.
Download the latest SA2/SA3 (windows): [https://github.com/Rogala/AI\_Attention](https://github.com/Rogala/AI_Attention)
The ComfyUI node used for testing: [https://github.com/Rogala/ComfyUI-rogala](https://github.com/Rogala/ComfyUI-rogala)
Original node discussion thread: [https://www.reddit.com/r/StableDiffusion/comments/1ta0ewm/smartattentiondispatcher\_comfyui\_node\_that/](https://www.reddit.com/r/StableDiffusion/comments/1ta0ewm/smartattentiondispatcher_comfyui_node_that/)
https://redd.it/1tcd718
@rStableDiffusion
With the release of PyTorch 2.12.0+cu132, I ran a full benchmark suite to verify that SA2 and SA3 attention backends are stable and working correctly in the new environment.
Tests were conducted on the following models:
* **flux1-krea-dev\_fp8\_scaled** — 20 steps, CFG 1, 1024×1024
* **flux-2-klein-base-9b-fp8** — 20 steps, CFG 5, 1280×1280
* **wan2.2\_t2v\_high/low\_noise\_14B\_fp16 + lightx2v\_4steps\_lora** — 2+2 steps, CFG 1, 640×640
All backends (fp8\_cuda, fp8pp\_cuda, triton, SA3 standard, SA3 per\_block\_mean) are confirmed stable. Results in the charts below.
The Krea model has the largest options when changing modes sa2-3, but the quality is almost the same everywhere.
https://preview.redd.it/8v3quwkfyy0h1.png?width=3840&format=png&auto=webp&s=a38dcff0c402d1102425ababcf7e7ec7693eee09
https://preview.redd.it/b6lkjbfz0z0h1.jpg?width=6000&format=pjpg&auto=webp&s=d047b2fffe7ff4b444dc795f1d638ed8ce972678
The Klein model is almost the same when changing from SA2 to SA3, but the plastic skin remains, which is a credit to the model itself. But the speed is almost the same in all operating modes.
https://preview.redd.it/0ve393uoyy0h1.png?width=3840&format=png&auto=webp&s=107733601b7f0fe184b94d12d4677904df5273a5
https://preview.redd.it/21bfjzyv0z0h1.jpg?width=6000&format=pjpg&auto=webp&s=c4774218bd8b91e04ad4d04c2c1f27708f7213f7
The WAN 2.2 model worked almost identically except for the sa3=standard and sa3=per\_block\_mean modes, so the video lost a little quality and changed. The triton+standard mode slowed down very strangely.
https://preview.redd.it/p5dr6dv8zy0h1.png?width=3840&format=png&auto=webp&s=3600b2892299c8b84b7258dc9cb1608da5d64495
https://reddit.com/link/1tcd718/video/vzevp45kzy0h1/player
But the main task was achieved, everything works and with the new pytorch 2.12.0, I did not test different nodes for compatibility, the ones I created work.
Download the latest SA2/SA3 (windows): [https://github.com/Rogala/AI\_Attention](https://github.com/Rogala/AI_Attention)
The ComfyUI node used for testing: [https://github.com/Rogala/ComfyUI-rogala](https://github.com/Rogala/ComfyUI-rogala)
Original node discussion thread: [https://www.reddit.com/r/StableDiffusion/comments/1ta0ewm/smartattentiondispatcher\_comfyui\_node\_that/](https://www.reddit.com/r/StableDiffusion/comments/1ta0ewm/smartattentiondispatcher_comfyui_node_that/)
https://redd.it/1tcd718
@rStableDiffusion
Is it possible to FEEL real acting with Open Source AI Tools? ( A little experiment)
I spent two weeks working on this at my company for learning and reach purposes. Tried to see if you can create compelling shots. In my opinion, you can, and better than Seedance. (Emotion, not action). But you be the judge. I'll wait and see and if anyone wants I'll share my workflow.
Spaghetti Shortfilm by Arturo Pola
https://redd.it/1tcem8c
@rStableDiffusion
I spent two weeks working on this at my company for learning and reach purposes. Tried to see if you can create compelling shots. In my opinion, you can, and better than Seedance. (Emotion, not action). But you be the judge. I'll wait and see and if anyone wants I'll share my workflow.
Spaghetti Shortfilm by Arturo Pola
https://redd.it/1tcem8c
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
ComfyUI Node: Unified Image + Mask Resize (LTX 2.3 ready, keeps BOTH sides divisible by 32, replaces Image Resize + Image Resize V2 + Mask mismatch issues)
https://redd.it/1tci23f
@rStableDiffusion
https://redd.it/1tci23f
@rStableDiffusion
Last week in Generative Image & Video
I curate a weekly multimodal AI roundup, here are the open-source image & video highlights from the last week:
\- CausalCine — Interactive autoregressive framework for multi-shot video narratives. Content-Aware Memory Routing retrieves historical KV entries by attention relevance instead of temporal proximity, solving motion stagnation and semantic drift in long-rollout generation. Distilled to a few-step generator for real-time use.
https://reddit.com/link/1tcnpxj/video/tbryyz3s611h1/player
[Paper](http://arxiv.org/abs/2605.12496v1) | [GitHub](https://github.com/yihao-meng/CausalCine)
\- SwiftI2V — Efficient 2K image-to-video generation. Low-res motion drafting followed by high-res refinement while preserving source image detail.
https://reddit.com/link/1tcnpxj/video/8n6t3ust611h1/player
[Paper](https://arxiv.org/abs/2605.06356) | [GitHub](https://github.com/hkust-longgroup/SwiftI2V) | [Project Page](https://hkust-longgroup.github.io/SwiftI2V/)
\- OmniGen2 — Unified image generation model handling text-to-image, editing, subject-driven generation, and visual conditions in one architecture. | [Paper](http://arxiv.org/abs/2605.07254v1)
https://preview.redd.it/iimjl0d2711h1.png?width=2772&format=png&auto=webp&s=21e30ab3ddf374f38b94c4b57498a870ae9a27ee
\- HiDream-O1-Image — Natively unified image generative foundation model. Open weights and code(8b model). | [Paper](http://arxiv.org/abs/2605.11061v1) | [GitHub](https://github.com/HiDream-ai/HiDream-O1-Image) | [Hugging Face](https://huggingface.co/HiDream-ai/HiDream-O1-Image)
https://preview.redd.it/kj4px8mv711h1.png?width=1456&format=png&auto=webp&s=bdfd6297ff6ad0a52ff39188571a5d9230f1825c
\- CDM — Continuous-time distribution matching for few-step diffusion distillation. High-quality images in fewer steps. Models released for SD3 Medium and Longcat.
https://preview.redd.it/bv980n9u711h1.png?width=1456&format=png&auto=webp&s=9e9a3695ab5153b3545bf913b9b9da87c37b08cf
[Paper](https://arxiv.org/abs/2605.06376) | [GitHub](https://github.com/byliutao/cdm) | [HF Models](https://huggingface.co/byliutao/stable-diffusion-3-medium-turbo)
\- PhysForge — Generates physics-grounded 3D assets with parts, materials, joints, mass, and movement rules for simulation and games.
https://reddit.com/link/1tcnpxj/video/yr62agus711h1/player
[Paper](https://arxiv.org/abs/2605.05163) | [GitHub](https://github.com/HKU-MMLab/PhysForge) | [Project Page](https://hku-mmlab.github.io/PhysForge/)
\- u/TensorForger built a Flux.2-Klein pipeline for real-time webcam stream processing at 30 FPS. | [Reddit](https://www.reddit.com/r/StableDiffusion/comments/1t7nd7e/flux2klein_pipeline_for_realtime_webcam_stream/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)
https://reddit.com/link/1tcnpxj/video/opnfdkv7911h1/player
\- u/aniki_kun shared a ZIT I2I “Character LORA Transformation” workflow. | [Reddit](https://www.reddit.com/r/StableDiffusion/comments/1tae2yl/zit_i2i_character_lora_transformation_workflow/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)
https://preview.redd.it/yjuuhq27911h1.jpg?width=1080&format=pjpg&auto=webp&s=56b2df98f3d27029c7019e1ffe01f9b3db34f69f
[](https://substackcdn.com/image/fetch/$s_!FE0C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5722f795-5b1e-416b-9152-8970f2ac3bb8_1080x518.webp)
\- u/ThaJedi finetuned Qwen3-1.7B to imitate the original Z-Image text encoder. 21% less VRAM. | [Reddit](https://www.reddit.com/r/StableDiffusion/comments/1t71hvm/i_finetuned_qwen317b_to_imitate_original_zimage/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)
\- Juggernaut Z dropped.
I curate a weekly multimodal AI roundup, here are the open-source image & video highlights from the last week:
\- CausalCine — Interactive autoregressive framework for multi-shot video narratives. Content-Aware Memory Routing retrieves historical KV entries by attention relevance instead of temporal proximity, solving motion stagnation and semantic drift in long-rollout generation. Distilled to a few-step generator for real-time use.
https://reddit.com/link/1tcnpxj/video/tbryyz3s611h1/player
[Paper](http://arxiv.org/abs/2605.12496v1) | [GitHub](https://github.com/yihao-meng/CausalCine)
\- SwiftI2V — Efficient 2K image-to-video generation. Low-res motion drafting followed by high-res refinement while preserving source image detail.
https://reddit.com/link/1tcnpxj/video/8n6t3ust611h1/player
[Paper](https://arxiv.org/abs/2605.06356) | [GitHub](https://github.com/hkust-longgroup/SwiftI2V) | [Project Page](https://hkust-longgroup.github.io/SwiftI2V/)
\- OmniGen2 — Unified image generation model handling text-to-image, editing, subject-driven generation, and visual conditions in one architecture. | [Paper](http://arxiv.org/abs/2605.07254v1)
https://preview.redd.it/iimjl0d2711h1.png?width=2772&format=png&auto=webp&s=21e30ab3ddf374f38b94c4b57498a870ae9a27ee
\- HiDream-O1-Image — Natively unified image generative foundation model. Open weights and code(8b model). | [Paper](http://arxiv.org/abs/2605.11061v1) | [GitHub](https://github.com/HiDream-ai/HiDream-O1-Image) | [Hugging Face](https://huggingface.co/HiDream-ai/HiDream-O1-Image)
https://preview.redd.it/kj4px8mv711h1.png?width=1456&format=png&auto=webp&s=bdfd6297ff6ad0a52ff39188571a5d9230f1825c
\- CDM — Continuous-time distribution matching for few-step diffusion distillation. High-quality images in fewer steps. Models released for SD3 Medium and Longcat.
https://preview.redd.it/bv980n9u711h1.png?width=1456&format=png&auto=webp&s=9e9a3695ab5153b3545bf913b9b9da87c37b08cf
[Paper](https://arxiv.org/abs/2605.06376) | [GitHub](https://github.com/byliutao/cdm) | [HF Models](https://huggingface.co/byliutao/stable-diffusion-3-medium-turbo)
\- PhysForge — Generates physics-grounded 3D assets with parts, materials, joints, mass, and movement rules for simulation and games.
https://reddit.com/link/1tcnpxj/video/yr62agus711h1/player
[Paper](https://arxiv.org/abs/2605.05163) | [GitHub](https://github.com/HKU-MMLab/PhysForge) | [Project Page](https://hku-mmlab.github.io/PhysForge/)
\- u/TensorForger built a Flux.2-Klein pipeline for real-time webcam stream processing at 30 FPS. | [Reddit](https://www.reddit.com/r/StableDiffusion/comments/1t7nd7e/flux2klein_pipeline_for_realtime_webcam_stream/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)
https://reddit.com/link/1tcnpxj/video/opnfdkv7911h1/player
\- u/aniki_kun shared a ZIT I2I “Character LORA Transformation” workflow. | [Reddit](https://www.reddit.com/r/StableDiffusion/comments/1tae2yl/zit_i2i_character_lora_transformation_workflow/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)
https://preview.redd.it/yjuuhq27911h1.jpg?width=1080&format=pjpg&auto=webp&s=56b2df98f3d27029c7019e1ffe01f9b3db34f69f
[](https://substackcdn.com/image/fetch/$s_!FE0C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5722f795-5b1e-416b-9152-8970f2ac3bb8_1080x518.webp)
\- u/ThaJedi finetuned Qwen3-1.7B to imitate the original Z-Image text encoder. 21% less VRAM. | [Reddit](https://www.reddit.com/r/StableDiffusion/comments/1t71hvm/i_finetuned_qwen317b_to_imitate_original_zimage/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)
\- Juggernaut Z dropped.
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
| [CivitAI](https://civitai.red/models/2600510/juggernaut-z?modelVersionId=2921151)
https://preview.redd.it/8u7gwjd5911h1.png?width=450&format=png&auto=webp&s=100a9e84a5c64cd2752423c8e6e619c6fb4fd820
[](https://substackcdn.com/image/fetch/$s_!uXeu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fdf28e6-fd71-432e-a540-848d7cafc1f5_450x675.webp)
\- ltx\_model released LipDub (Beta), an open-source lipsync IC-LoRA. | [Reddit](https://www.reddit.com/r/StableDiffusion/comments/1ta66f1/lipdub_beta_new_opensource_lipsync_iclora/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)
\- MiniMind-O — 0.1B speech-native omni model. Text/speech/image in, text + streaming speech out. Code, checkpoints, and training datasets released.
https://preview.redd.it/ay16yj3h811h1.png?width=1456&format=png&auto=webp&s=971899daee79f7dd9c7acd8bdb976ea2bfe78dda
[Paper](http://arxiv.org/abs/2605.03937v1) | [GitHub](https://github.com/jingyaogong/minimind-o)
Honorable Mentions:
WavCube — Unified speech representation matching WavLM on SUPERB with 8x compression. SOTA zero-shot TTS. Open weights. | [Paper](http://arxiv.org/abs/2605.06407v1) | [GitHub](https://github.com/yanghaha0908/WavCube) | [Hugging Face](https://huggingface.co/yhaha/WavCube)
[The overall architecture of the WavCube representation.](https://preview.redd.it/0hlfjhvq811h1.png?width=1456&format=png&auto=webp&s=9f18dbd14070d89b11500ddbccc3cd8db4295b00)
Checkout the [full roundup](https://open.substack.com/pub/thelivingedge/p/last-week-in-multimodal-ai-56-from?r=12l7fk&utm_campaign=post-expanded-share&utm_medium=web) for more demos, papers, and resources.
https://redd.it/1tcnpxj
@rStableDiffusion
https://preview.redd.it/8u7gwjd5911h1.png?width=450&format=png&auto=webp&s=100a9e84a5c64cd2752423c8e6e619c6fb4fd820
[](https://substackcdn.com/image/fetch/$s_!uXeu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fdf28e6-fd71-432e-a540-848d7cafc1f5_450x675.webp)
\- ltx\_model released LipDub (Beta), an open-source lipsync IC-LoRA. | [Reddit](https://www.reddit.com/r/StableDiffusion/comments/1ta66f1/lipdub_beta_new_opensource_lipsync_iclora/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)
\- MiniMind-O — 0.1B speech-native omni model. Text/speech/image in, text + streaming speech out. Code, checkpoints, and training datasets released.
https://preview.redd.it/ay16yj3h811h1.png?width=1456&format=png&auto=webp&s=971899daee79f7dd9c7acd8bdb976ea2bfe78dda
[Paper](http://arxiv.org/abs/2605.03937v1) | [GitHub](https://github.com/jingyaogong/minimind-o)
Honorable Mentions:
WavCube — Unified speech representation matching WavLM on SUPERB with 8x compression. SOTA zero-shot TTS. Open weights. | [Paper](http://arxiv.org/abs/2605.06407v1) | [GitHub](https://github.com/yanghaha0908/WavCube) | [Hugging Face](https://huggingface.co/yhaha/WavCube)
[The overall architecture of the WavCube representation.](https://preview.redd.it/0hlfjhvq811h1.png?width=1456&format=png&auto=webp&s=9f18dbd14070d89b11500ddbccc3cd8db4295b00)
Checkout the [full roundup](https://open.substack.com/pub/thelivingedge/p/last-week-in-multimodal-ai-56-from?r=12l7fk&utm_campaign=post-expanded-share&utm_medium=web) for more demos, papers, and resources.
https://redd.it/1tcnpxj
@rStableDiffusion
civitai.red
Juggernaut Z - v1.0 by RunDiffusion | ZImage Checkpoint | Civitai
For business inquires, commercial licensing, custom models, and consultation contact me under juggernaut@rundiffusion.com Try Juggernaut Z on RunDi...