LTX 2.3 audio as standalone speech model.
User @wildmindai from X posted about this new model. Has anyone here tried it yet?
LTX 2.3 audio as standalone speech model.
Emotional TTS with Scenema Audio.
\- Zero-shot expressive voice cloning, speech gen
\- 8-step distilled with Gemma 3 12B text encoding
\- stage directions via <action> tags
\- runs at 1.5x real-time on RTX 4090
\- fits in 16GB VRAM
\- 13 languages, 48kHz stereo output
it also gens matching environment sounds
https://huggingface.co/ScenemaAI/scenema-audio
https://redd.it/1tab0tb
@rStableDiffusion
User @wildmindai from X posted about this new model. Has anyone here tried it yet?
LTX 2.3 audio as standalone speech model.
Emotional TTS with Scenema Audio.
\- Zero-shot expressive voice cloning, speech gen
\- 8-step distilled with Gemma 3 12B text encoding
\- stage directions via <action> tags
\- runs at 1.5x real-time on RTX 4090
\- fits in 16GB VRAM
\- 13 languages, 48kHz stereo output
it also gens matching environment sounds
https://huggingface.co/ScenemaAI/scenema-audio
https://redd.it/1tab0tb
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: LTX 2.3 audio as standalone speech model.
Explore this post and more from the StableDiffusion community