End-of-January LTX-2 Drop: More Control, Faster Iteration

We just shipped a new LTX-2 drop focused on one thing: making video generation easier to iterate on without killing VRAM, consistency, or sync.

If you’ve been frustrated by LTX because prompt iteration was slow or outputs felt brittle, this update is aimed directly at that.

Here’s the highlights, the full details are here.

# What’s New

Faster prompt iteration (Gemma text encoding nodes)
Why you should care: no more constant VRAM loading and unloading on consumer GPUs.

New ComfyUI nodes let you save and reuse text encodings, or run Gemma encoding through our free API when running LTX locally.

This makes Detailer and iterative flows much faster and less painful.

Independent control over prompt accuracy, stability, and sync (Multimodal Guider)
Why you should care: you can now tune quality without breaking something else.

The new Multimodal Guider lets you control:

Prompt adherence
Visual stability over time
Audio-video synchronization

Each can be tuned independently, per modality. No more choosing between “follows the prompt” and “doesn’t fall apart.”

More practical fine-tuning + faster inference
Why you should care: better behavior on real hardware.

Trainer updates improve memory usage and make fine-tuning more predictable on constrained GPUs.

Inference is also faster for video-to-video by downscaling the reference video before cross-attention, reducing compute cost. (Speedup depend on resolution and clip length.)

We’ve also shipped new ComfyUI nodes and a unified LoRA to support these changes.

# What’s Next

This drop isn’t a one-off. The next LTX-2 version is already in progress, focused on:

Better fine detail and visual fidelity (new VAE)
Improved consistency to conditioning inputs
Cleaner, more reliable audio
Stronger image-to-video behavior
Better prompt understanding and color handling

More on what's coming up here.

# Try It and Stress It!

If you’re pushing LTX-2 in real workflows, your feedback directly shapes what we build next. Try the update, break it, and tell us what still feels off in our Discord.

https://redd.it/1qqf0ve
@rStableDiffusion
This is an even bigger deal than Z-image base.
https://redd.it/1qote0h
@rStableDiffusion
TTS Audio Suite v4.19 - Qwen3-TTS with Voice Designer
https://redd.it/1qqugpl
@rStableDiffusion
A primer on the most important concepts to train a LoRA

The other days I was giving a list of all the concepts I think people would benefit from understanding before they decide to train a LoRA. In the interest of the community, here are those concepts, at least an ELI10 of them - just enough to understand how all those parameters interact with your dataset and captions.



NOTE: English is my 2nd language and I am not doing this on an LLM, so bare with me for possible mistakes.



# **What is a LoRA?**



A LoRA stands for "Low Rank Adaptation". It's an adaptor that you train to fit on a model in order to modify its output.

Think of a USB-C port on your PC. If you don't have a USB-C cable, you can't connect to it. If you want to connect a device that has a USB-A, you'd need an adaptor, or a cable, that "adapts" the USB-C into a USB-A.

A LoRA is the same: it's an adaptor for a model (like flux, or qwen, or z-image).



In this text I am going to assume we are talking mostly about character LoRAs, even though most of these concepts also work for other types of LoRAs.



***Can I use a LoRA I found on civitAI for SDXL on a Flux Model?***



No. A LoRA generally cannot work on a different model than the one it was trained for. You can't use a USB-C-to-something adaptor on a completely different interface. It only fits USB-C.



***My character LoRA is 70% good, is that normal?***



No. A character LoRA, if done correctly, should have 95% consistency. In fact, it is the only truly consistant way to generate the same character, if that character is not already known from the base model. If your LoRA "sort" of works, it means something is wrong.



***Can a LoRA work with other LoRAs?***



Not really, at least not for character LoRAs. When two LoRAs are applied to a model, they *add* their weights, meaning that the result will be something new. There are ways to go around this, but that's an advanced topic for another day.





# **How does a LoRA "learns"?**



A LoRA learns by looking at everything that repeats across your dataset. If something is repeating, and you don't want that thing to bleed during image generation, then you have a problem and you need to adjust your dataset. For example, if all your dataset is on a white background, then the white background will most likely be "learned" inside the LoRA and you will have a hard time generating other kinds of backgrounds with that LoRA.



So you need to consider your dataset very carefully. Are you providing multiple angles of the same thing that must be learned? Are you making sure everything else is diverse and not repeating?



***How many images do I need in my dataset?***



It can work with as little as just a few images, or as much as 100 images. What matters is that what repeats truly repeats consistently in the dataset, and everything else remains as variable as possible. For this reason, you'll often get better results for character LoRAs when you use less images - but high definition, crisp and ideal images, rather than a lot of lower quality images.

For synthetic characters, if your character's facial features aren't fully consistent, you'll get a mesh of all those faces, which may end up not exactly like your ideal target, but that's not as critical as for a real person.



In many cases for character LoRAs, you can use about 15 portraits and about 10 full body poses for easy, best results.





# **The importance of clarifying your LoRA Goal**



To produce a high quality LoRA it is essential to be clear on what your goals are. You need to be clear on:



* The art style: realistic vs anime style, etc.
* Type of LoRA: i am assuming character LoRA here, but many different kinds (style LoRA, pose LoRA, product LoRA, multi-concepts LoRA) may require different settings
* What is part of your character identity and should NEVER change? Same hair color and hair style or variable? Same outfit all the time or variable? Same backgrounds all the time or variable? Same body type all the time or variable? Do you want that tatoo to be part of the
character's identity or can it change at generation? Do you want her glasses to be part of her identity or a variable? etc.
* Does the LoRA will need to teach the model a new concept? or will it only specialize known concepts (like a specific face) ?





# **Carefully building your dataset**



Based on the above answers you should carefully build your dataset. Each single image has to bring something new to learn :



* Front facing portraits
* Profile portraits
* Three-quarter portraits
* Tree-quarter rear portraits
* Seen from a higher elevation
* Seen from a lower elevation
* Zoomed on eyes
* Zoomed on specific features like moles, tatoos, etc.
* Zoomed on specific body parts like toes and fingers
* Full body poses showing body proportions
* Full body poses in relation to other items (like doors) to teach relative height



In each image of the dataset, the subject that must be learned has to be consistent and repeat on all images. So if there is a tattoo that should be PART of the character, it has to be present everywhere at the proper place. If the anime character is always in blue hair, all your dataset should show that character with blue hair.

Everything else should never repeat! Change the background on each image. Change the outfit on each image. etc.





# **How to carefully caption your dataset**



Captioning is ***essential***. During training, captioning is performing several things for your LoRA :

* It's giving context to what is being learned (especially important when you add extreme close-ups)
* It's telling the training software what is variable and should be ignored and not learned (like background and outfit)
* It's providing a unique trigger word for everything that will be learned and allows differentiation when more than one concept is being learned
* It's telling the model what concept it already knows that this LoRA is refining
* It's countering the training tendency to overtrain



For each image, your caption should use natural language (except for older models like SD) but should also be kept short and factual.

It should say:

* The trigger word
* The expression / emotion
* The camera angle, height angle, and zoom level
* The light
* The pose and background (only very short, no detailed description)
* The outfit \[unless you want the outfit to be learned with the LoRA, like for an anime superhero)
* The accessories
* The hairstyle and color \[unless you want the same hair style and color to be part of the LoRA)
* The action



Example :



*Portrait of Lora1234 standing in a garden, smiling, seen from the front at eye-level, natural light, soft shadows. She is wearing a beige cardigan and jeans. Blurry plants are visible in the background.*





***Can I just avoid captioning at all for character LoRAs ?***



That's a bad idea. If your dataset is perfect, nothing unwanted is repeating, there are no extreme close-up, and everything that repeats is consistent, then you may still get good results. But otherwise, you'll get average or bad results (at first) or a rigid overtrained model after enough steps.





***Can I just run auto captions using some LLM like JoyCaption?***



It should never be done entierly by automation (unless you have thousands upon thousands of images), because auto-caption doesn't know what's the exact purpose of your LoRA and therefore it can't carefully choose which part to caption to mitigate overtraining while not captioning the core things being learned.





# **What is the LoRA rank (network dim) and how to set it**



The rank of a LoRA represents the space we are allocating for details.

Use high rank when you have a lot of things to learn.

Use Low rank when you have something simple to learn.



Typically, a rank of 32 is enough for most tasks.

Large models like Qwen produce big LoRAs, so you don't need to have a very high rank on those models.



This is important because...



* If you use too high a rank, your LoRA will start learning additional details from your dataset that may clutter or even make it rigid and bleed during generation as