ComfyUI-DramaBox now supports Loras and Voice-Clone-Studio-DramaBox can generate them.
Hey guys, a couple of days ago u/manmaynakhashi released DramaBox.
A really cool TTS model based on LTX.
I had made a **ComfyUI node** for it and today I've added Lora support.
Some of you might be familiar with my TTS tool, Voice-Clone-Studio.
I made a stripped down version called **Voice-Clone-Studio-DramaBox**, specifically for DramaBox, both for using it as a TTS and Lora Generation.
I've stripped out most of the models, only keeping Qwen-TTS for it`s Voice Design option. This makes it a bit more focused and easier to install.
In it you will find a Prep Sample tab that allow for generating complete Datasets from one long audio clip. As it will cut it down by phrases and auto transcribe it.
https://preview.redd.it/gpqqywzkol1h1.png?width=1901&format=png&auto=webp&s=a7418431c0ba0ff1399fdd13585ee4b02cb119a3
I've add better success with 10 clips, than when using 80. With clips ranging between 5 to 10 seconds.
Had DramaBox is VERY prone to hallucination, I'm not adding it to Voice-Clone-Studio. It serves a different use case. This is much more experimental 🤣
https://redd.it/1tfbjfo
@rStableDiffusion
Hey guys, a couple of days ago u/manmaynakhashi released DramaBox.
A really cool TTS model based on LTX.
I had made a **ComfyUI node** for it and today I've added Lora support.
Some of you might be familiar with my TTS tool, Voice-Clone-Studio.
I made a stripped down version called **Voice-Clone-Studio-DramaBox**, specifically for DramaBox, both for using it as a TTS and Lora Generation.
I've stripped out most of the models, only keeping Qwen-TTS for it`s Voice Design option. This makes it a bit more focused and easier to install.
In it you will find a Prep Sample tab that allow for generating complete Datasets from one long audio clip. As it will cut it down by phrases and auto transcribe it.
https://preview.redd.it/gpqqywzkol1h1.png?width=1901&format=png&auto=webp&s=a7418431c0ba0ff1399fdd13585ee4b02cb119a3
I've add better success with 10 clips, than when using 80. With clips ranging between 5 to 10 seconds.
Had DramaBox is VERY prone to hallucination, I'm not adding it to Voice-Clone-Studio. It serves a different use case. This is much more experimental 🤣
https://redd.it/1tfbjfo
@rStableDiffusion
Reddit
Check out manmaynakhashi’s Reddit profile
Explore manmaynakhashi’s posts and comments on Reddit
LTX 2.3 is now supported in Comfyui-Mesh for splitting models across Ethernet or multigpu machines with Nvenc codec. Major vram fixes included for flux2/LTX model implementations in the node.
https://redd.it/1tfcj56
@rStableDiffusion
https://redd.it/1tfcj56
@rStableDiffusion
Lifestyle/everyday scenes have been harder for me than glamour shots. The foam on her hands took the most prompt iteration.
https://redd.it/1tf9e68
@rStableDiffusion
https://redd.it/1tf9e68
@rStableDiffusion
LTX 2.3 Experimental Music Video
https://www.youtube.com/watch?v=8PDmOIgKAFk
https://redd.it/1tfk3tq
@rStableDiffusion
https://www.youtube.com/watch?v=8PDmOIgKAFk
https://redd.it/1tfk3tq
@rStableDiffusion
YouTube
Rainbow Connection - Kermit/Jim Henson's song (1979)
This experiment uses the latest local AI technology to create a music video. I am experimenting and playing with the framing and camera angles of the video.
All of my AI demos:
https://www.youtube.com/playlist?list=PLe3OBqR7FeRhZM6SNoIWibQ1PA2JREYtL
All of my AI demos:
https://www.youtube.com/playlist?list=PLe3OBqR7FeRhZM6SNoIWibQ1PA2JREYtL
Prompting Tips Flux.2-Klein
For Klein 9B using the qwen_3_8b, the prompt path is basically:
your prompt;
1-wrapped in Qwen chat template
2 - Qwen2 tokenizer
3- Qwen3 8B text encoder
4- hidden layers [9, 18, 27\] stacked into conditioning
5- Flux2/Klein transformer cross-attends to that
The local wrapper does this template:
<|imstart|>user
YOUR PROMPT<|imend|>
<|imstart|>assistant
<think>
</think>
So it is not reading your prompt like CLIP tags. It is reading it like an instruction/message.
What It Accepts Well:
**It should respond best to natural language with clear relationships:**
A woman sitting on a beachfront, looking at the camera, wearing a black dress. The camera is at eye level. Her body is seated facing slightly left. The beach and ocean are behind her.
**Strong prompt concepts:**
\- subject type: woman, man, dog, car
\- action/pose: sitting, standing, walking, looking at camera
\- location: on a beach, inside a kitchen
\- spatial relations: behind her, to her left, in the foreground
\- clothing/object attribution: she is wearing, holding, beside
\- camera/framing: close-up, full body, eye-level, three-quarter view
\- style if phrased plainly: photo, natural lighting, soft shadows
**What It Throws Away Or Weakens**
The big one: Comfy prompt weighting is disabled for this TE.
**So this does not mean much:**
((face:1.4)), [body:0.6], (((identity)))
The tokenizer still sees punctuation/text, but the encoder wrapper passes disable\weights=True, so classic CLIP-style
emphasis is not applied as weights.
Also weak:
\- giant comma tag soups
\- repeated words as fake emphasis
\- abstract junk like masterpiece, best quality, ultra detailed
\- contradictions: sitting, standing, walking
\- vague modifiers not attached to a noun: beautiful, perfect, cinematic
\- negative prompt logic, unless the sampler/model path explicitly uses it well
\- overly long prompts where important instructions are buried
What Matters Most
Because this is Qwen-style chat encoding, write prompt chunks as sentences with ownership:
Bad:
beach, woman, camera, sitting, black dress, looking, ocean, realistic
Better:
A realistic photo of a woman sitting on a beach. She is looking at the camera. She is wearing a black dress. The ocean is behind her.
For identity/reference workflows "Identity feature transfer", avoid asking the TE to redefine the subject too much. Let the node carry identity, and let prompt carry scene/action:
Keep the same woman. Change only the location: she is sitting on a beachfront, looking at the camera. Natural daylight photo.
Best Prompt Shape For Your Use:
Use this structure:
[identity constraint\].
[scene/location change\].
[pose/action\].
[clothing/body constraint\].
[camera/framing\].
[lighting/style\].
Example:
Keep the same woman from the reference image.
Move her to a sunny beachfront.
She is sitting and looking directly at the camera.
Preserve her face, body proportions, hairstyle, and clothing shape.
Eye-level photo, natural daylight, realistic beach background.
The TE will not literally “obey” every clause, but this format gives Qwen the best chance to encode relationships instead of treating the prompt as a bag of tags.
https://redd.it/1tflqso
@rStableDiffusion
For Klein 9B using the qwen_3_8b, the prompt path is basically:
your prompt;
1-wrapped in Qwen chat template
2 - Qwen2 tokenizer
3- Qwen3 8B text encoder
4- hidden layers [9, 18, 27\] stacked into conditioning
5- Flux2/Klein transformer cross-attends to that
The local wrapper does this template:
<|imstart|>user
YOUR PROMPT<|imend|>
<|imstart|>assistant
<think>
</think>
So it is not reading your prompt like CLIP tags. It is reading it like an instruction/message.
What It Accepts Well:
**It should respond best to natural language with clear relationships:**
A woman sitting on a beachfront, looking at the camera, wearing a black dress. The camera is at eye level. Her body is seated facing slightly left. The beach and ocean are behind her.
**Strong prompt concepts:**
\- subject type: woman, man, dog, car
\- action/pose: sitting, standing, walking, looking at camera
\- location: on a beach, inside a kitchen
\- spatial relations: behind her, to her left, in the foreground
\- clothing/object attribution: she is wearing, holding, beside
\- camera/framing: close-up, full body, eye-level, three-quarter view
\- style if phrased plainly: photo, natural lighting, soft shadows
**What It Throws Away Or Weakens**
The big one: Comfy prompt weighting is disabled for this TE.
**So this does not mean much:**
((face:1.4)), [body:0.6], (((identity)))
The tokenizer still sees punctuation/text, but the encoder wrapper passes disable\weights=True, so classic CLIP-style
emphasis is not applied as weights.
Also weak:
\- giant comma tag soups
\- repeated words as fake emphasis
\- abstract junk like masterpiece, best quality, ultra detailed
\- contradictions: sitting, standing, walking
\- vague modifiers not attached to a noun: beautiful, perfect, cinematic
\- negative prompt logic, unless the sampler/model path explicitly uses it well
\- overly long prompts where important instructions are buried
What Matters Most
Because this is Qwen-style chat encoding, write prompt chunks as sentences with ownership:
Bad:
beach, woman, camera, sitting, black dress, looking, ocean, realistic
Better:
A realistic photo of a woman sitting on a beach. She is looking at the camera. She is wearing a black dress. The ocean is behind her.
For identity/reference workflows "Identity feature transfer", avoid asking the TE to redefine the subject too much. Let the node carry identity, and let prompt carry scene/action:
Keep the same woman. Change only the location: she is sitting on a beachfront, looking at the camera. Natural daylight photo.
Best Prompt Shape For Your Use:
Use this structure:
[identity constraint\].
[scene/location change\].
[pose/action\].
[clothing/body constraint\].
[camera/framing\].
[lighting/style\].
Example:
Keep the same woman from the reference image.
Move her to a sunny beachfront.
She is sitting and looking directly at the camera.
Preserve her face, body proportions, hairstyle, and clothing shape.
Eye-level photo, natural daylight, realistic beach background.
The TE will not literally “obey” every clause, but this format gives Qwen the best chance to encode relationships instead of treating the prompt as a bag of tags.
https://redd.it/1tflqso
@rStableDiffusion
GitHub
GitHub - capitan01R/ComfyUI-Flux2Klein-Enhancer: Flux.2Klein 9B Enhancement Nodes Suite
Flux.2Klein 9B Enhancement Nodes Suite . Contribute to capitan01R/ComfyUI-Flux2Klein-Enhancer development by creating an account on GitHub.
Dream Wan + LTX combination
Given Wan2.2 is much better at learning movement and physics, but LTX is better with audio and lipsync, the dream would be to define the desired motion with a generated Wan clip, and let LTX continue it.
There exists workflows such as RuneXX to try and achieve this, but I've not managed to make LTX replicate and continue Wan's movements, only go off on its own tangent.
Has anyone achieved this? I know Sulphur is impressive, but it's still a long way behind some of the Wan checkpoints especially in terms of physics and prompt adherence.
https://huggingface.co/RuneXX/LTX-2.3-Workflows/tree/main/Video-2-Video/Extend-Any-Video
https://redd.it/1tfktgi
@rStableDiffusion
Given Wan2.2 is much better at learning movement and physics, but LTX is better with audio and lipsync, the dream would be to define the desired motion with a generated Wan clip, and let LTX continue it.
There exists workflows such as RuneXX to try and achieve this, but I've not managed to make LTX replicate and continue Wan's movements, only go off on its own tangent.
Has anyone achieved this? I know Sulphur is impressive, but it's still a long way behind some of the Wan checkpoints especially in terms of physics and prompt adherence.
https://huggingface.co/RuneXX/LTX-2.3-Workflows/tree/main/Video-2-Video/Extend-Any-Video
https://redd.it/1tfktgi
@rStableDiffusion
huggingface.co
RuneXX/LTX-2.3-Workflows at main
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
This media is not supported in your browser
VIEW IN TELEGRAM
My local workflow for turning SDXL character generations into game-ready 3D assets
https://redd.it/1tfnlr8
@rStableDiffusion
https://redd.it/1tfnlr8
@rStableDiffusion