ComfyUI-DramaBox now supports Loras and Voice-Clone-Studio-DramaBox can generate them.

Hey guys, a couple of days ago u/manmaynakhashi released DramaBox.
A really cool TTS model based on LTX.

I had made a **ComfyUI node** for it and today I've added Lora support.


Some of you might be familiar with my TTS tool, Voice-Clone-Studio.
I made a stripped down version called **Voice-Clone-Studio-DramaBox**, specifically for DramaBox, both for using it as a TTS and Lora Generation.

I've stripped out most of the models, only keeping Qwen-TTS for it`s Voice Design option. This makes it a bit more focused and easier to install.

In it you will find a Prep Sample tab that allow for generating complete Datasets from one long audio clip. As it will cut it down by phrases and auto transcribe it.

https://preview.redd.it/gpqqywzkol1h1.png?width=1901&format=png&auto=webp&s=a7418431c0ba0ff1399fdd13585ee4b02cb119a3

I've add better success with 10 clips, than when using 80. With clips ranging between 5 to 10 seconds.

Had DramaBox is VERY prone to hallucination, I'm not adding it to Voice-Clone-Studio. It serves a different use case. This is much more experimental 🤣

https://redd.it/1tfbjfo
@rStableDiffusion
LTX 2.3 is now supported in Comfyui-Mesh for splitting models across Ethernet or multigpu machines with Nvenc codec. Major vram fixes included for flux2/LTX model implementations in the node.
https://redd.it/1tfcj56
@rStableDiffusion
Lifestyle/everyday scenes have been harder for me than glamour shots. The foam on her hands took the most prompt iteration.
https://redd.it/1tf9e68
@rStableDiffusion
Prompting Tips Flux.2-Klein

For Klein 9B using the qwen_3_8b, the prompt path is basically:

your prompt;

1-wrapped in Qwen chat template

2 - Qwen2 tokenizer

3- Qwen3 8B text encoder

4- hidden layers [9, 18, 27\] stacked into conditioning

5- Flux2/Klein transformer cross-attends to that

The local wrapper does this template:

<|imstart|>user
YOUR PROMPT<|im
end|>
<|imstart|>assistant
<think>

</think>

So it is not reading your prompt like CLIP tags. It is reading it like an instruction/message.

What It Accepts Well:

**It should respond best to natural language with clear relationships:**

A woman sitting on a beachfront, looking at the camera, wearing a black dress. The camera is at eye level. Her body is seated facing slightly left. The beach and ocean are behind her.

**Strong prompt concepts:**

\- subject type: woman, man, dog, car

\- action/pose: sitting, standing, walking, looking at camera

\- location: on a beach, inside a kitchen

\- spatial relations: behind her, to her left, in the foreground

\- clothing/object attribution: she is wearing, holding, beside

\- camera/framing: close-up, full body, eye-level, three-quarter view

\- style if phrased plainly: photo, natural lighting, soft shadows

**What It Throws Away Or Weakens**

The big one: Comfy prompt weighting is disabled for this TE.

**So this does not mean much:**

((face:1.4)), [body:0.6], (((identity)))

The tokenizer still sees punctuation/text, but the encoder wrapper passes disable\
weights=True, so classic CLIP-style

emphasis is not applied as weights.

Also weak:

\- giant comma tag soups

\- repeated words as fake emphasis

\- abstract junk like masterpiece, best quality, ultra detailed

\- contradictions: sitting, standing, walking

\- vague modifiers not attached to a noun: beautiful, perfect, cinematic

\- negative prompt logic, unless the sampler/model path explicitly uses it well

\- overly long prompts where important instructions are buried

What Matters Most

Because this is Qwen-style chat encoding, write prompt chunks as sentences with ownership:

Bad:

beach, woman, camera, sitting, black dress, looking, ocean, realistic

Better:

A realistic photo of a woman sitting on a beach. She is looking at the camera. She is wearing a black dress. The ocean is behind her.

For identity/reference workflows "Identity feature transfer", avoid asking the TE to redefine the subject too much. Let the node carry identity, and let prompt carry scene/action:

Keep the same woman. Change only the location: she is sitting on a beachfront, looking at the camera. Natural daylight photo.

Best Prompt Shape For Your Use:

Use this structure:

[identity constraint\].

[scene/location change\].

[pose/action\].

[clothing/body constraint\].

[camera/framing\].

[lighting/style\].

Example:

Keep the same woman from the reference image.
Move her to a sunny beachfront.
She is sitting and looking directly at the camera.
Preserve her face, body proportions, hairstyle, and clothing shape.
Eye-level photo, natural daylight, realistic beach background.

The TE will not literally “obey” every clause, but this format gives Qwen the best chance to encode relationships instead of treating the prompt as a bag of tags.

https://redd.it/1tflqso
@rStableDiffusion
Dream Wan + LTX combination

Given Wan2.2 is much better at learning movement and physics, but LTX is better with audio and lipsync, the dream would be to define the desired motion with a generated Wan clip, and let LTX continue it.

There exists workflows such as RuneXX to try and achieve this, but I've not managed to make LTX replicate and continue Wan's movements, only go off on its own tangent.

Has anyone achieved this? I know Sulphur is impressive, but it's still a long way behind some of the Wan checkpoints especially in terms of physics and prompt adherence.

https://huggingface.co/RuneXX/LTX-2.3-Workflows/tree/main/Video-2-Video/Extend-Any-Video

https://redd.it/1tfktgi
@rStableDiffusion
This media is not supported in your browser
VIEW IN TELEGRAM
My local workflow for turning SDXL character generations into game-ready 3D assets

https://redd.it/1tfnlr8
@rStableDiffusion
Check out a free prompt writing site I made
https://redd.it/1tfrykt
@rStableDiffusion