LTX 2.3 audio as standalone speech model.

User @wildmindai from X posted about this new model. Has anyone here tried it yet?

LTX 2.3 audio as standalone speech model.

Emotional TTS with Scenema Audio.

\- Zero-shot expressive voice cloning, speech gen

\- 8-step distilled with Gemma 3 12B text encoding

\- stage directions via <action> tags

\- runs at 1.5x real-time on RTX 4090

\- fits in 16GB VRAM

\- 13 languages, 48kHz stereo output

it also gens matching environment sounds

https://huggingface.co/ScenemaAI/scenema-audio

https://redd.it/1tab0tb
@rStableDiffusion