Inpainting with LTXV 2.3. Results after two weeks of R&D.
Hello!
I am a designer at DOGMA, we do AI work for tv ads, shows and movies, a Netflix show we worked on recently came out on Netflix Ita, the company had the first meeting in Hollywood last month.
50% of our work is inpainting on videos, 100% of our work for Netflix was inpaintings, so I've spent the last few weeks doing R&D with LTXV 2.3 to see if and how the tool can help in the practical needs of the movie business. We strongly believe in the sociocultural importance of open-source.
First of all huge thanks to u/ltxmodel for becoming the main paladin of the democratization of open-source video generation tools and for the constant improvements on their model, the incredible HDR lora is something we were not expecting so soon, please keep up the amazing work; from our tests LTXV 2.3 T2V and I2V can be pushed locally up to 5K resolution, with results that have very little to envy from the closed-source Seedance 2. Congratulations also to u/RoundAwareness5490 for his outstanding experimental work and effort in creating loras that extend the capabilities of the main model.
Here is the recap of the R&D (translated from italian to eng).
\---
Method 1 / No inpainting LoRA:
You use Add Guide Multi with 2 reference frames, first and last, while the original video goes into VAE Encode. Then you apply an LTXV latent mask to the area that needs to be modified.
Problems: as always when using multiple guide inputs for inpainting, some parts flicker and do not match the original video, especially in the frames close to the first and last reference frames. There is no other way to provide reference frames with this method except by adding more entries in Add Guide Multi. In practice, it is a kind of denoise. It works very well if you do not need precision and can avoid reference frames, relying only on the prompt/lora.
\---
Method 2 / Inpainting with the model ltx23_inpaint_masked_r2v_rank32_v1_3000steps.safetensors:
The 3000-step version seems to be the only one that works most of the time.
This model is trained to take as input a video where the original video is on the right, with the part to be inpainted marked in magenta, and a small reference frame on the left. As output, it provides the final inpainted video using that reference. It does sometimes work also if you send as input the whole video with no reference and a white overlay on the masked area (similar to VACE).
Problems: it is excellent if you put Trump’s face in the small reference frame, but terrible if you need something precise, because the mini-frame is not even 200px wide, so it has no way to capture precise information. Adding Add Guide Multi partly solves this, but then you are back to the Add Guide Multi problem, meaning flickering and, above all, a mismatch with the original video close to the reference frames. Sending as input only the video with the purple masked area, with the first and last frames already set the way you want them, often, but not always, results in videos where the purple or white artifacts come back in form of smoke or solid color.
\--
Method 3 / Inpainting with the model
ltx23_inpaint_rank128_v1_02500steps.safetensors
or the model
ltx23_inpaint_rank128_v1_10000steps.safetensors
This model does in fact take the area to be inpainted in the same way VACE did. Here, it seems that the masked area should be white instead of purple. This LoRA does not support any kind of reference, so it is useful for inpainting based only on the prompt. Here too, Add Guide Multi can be used to force it to use start and end reference frames, with all the problems and inconsistencies of usage of the previous method.
I tried many variations for each method. For example, I tried passing only the video with the mask applied to all frames except the first and last. I tried using a KSampler Advanced to apply denoise only during the final steps. I tried raising the CFG up to 2.5. All these methods sometimes produce decent results, but never consistent ones.
Hello!
I am a designer at DOGMA, we do AI work for tv ads, shows and movies, a Netflix show we worked on recently came out on Netflix Ita, the company had the first meeting in Hollywood last month.
50% of our work is inpainting on videos, 100% of our work for Netflix was inpaintings, so I've spent the last few weeks doing R&D with LTXV 2.3 to see if and how the tool can help in the practical needs of the movie business. We strongly believe in the sociocultural importance of open-source.
First of all huge thanks to u/ltxmodel for becoming the main paladin of the democratization of open-source video generation tools and for the constant improvements on their model, the incredible HDR lora is something we were not expecting so soon, please keep up the amazing work; from our tests LTXV 2.3 T2V and I2V can be pushed locally up to 5K resolution, with results that have very little to envy from the closed-source Seedance 2. Congratulations also to u/RoundAwareness5490 for his outstanding experimental work and effort in creating loras that extend the capabilities of the main model.
Here is the recap of the R&D (translated from italian to eng).
\---
Method 1 / No inpainting LoRA:
You use Add Guide Multi with 2 reference frames, first and last, while the original video goes into VAE Encode. Then you apply an LTXV latent mask to the area that needs to be modified.
Problems: as always when using multiple guide inputs for inpainting, some parts flicker and do not match the original video, especially in the frames close to the first and last reference frames. There is no other way to provide reference frames with this method except by adding more entries in Add Guide Multi. In practice, it is a kind of denoise. It works very well if you do not need precision and can avoid reference frames, relying only on the prompt/lora.
\---
Method 2 / Inpainting with the model ltx23_inpaint_masked_r2v_rank32_v1_3000steps.safetensors:
The 3000-step version seems to be the only one that works most of the time.
This model is trained to take as input a video where the original video is on the right, with the part to be inpainted marked in magenta, and a small reference frame on the left. As output, it provides the final inpainted video using that reference. It does sometimes work also if you send as input the whole video with no reference and a white overlay on the masked area (similar to VACE).
Problems: it is excellent if you put Trump’s face in the small reference frame, but terrible if you need something precise, because the mini-frame is not even 200px wide, so it has no way to capture precise information. Adding Add Guide Multi partly solves this, but then you are back to the Add Guide Multi problem, meaning flickering and, above all, a mismatch with the original video close to the reference frames. Sending as input only the video with the purple masked area, with the first and last frames already set the way you want them, often, but not always, results in videos where the purple or white artifacts come back in form of smoke or solid color.
\--
Method 3 / Inpainting with the model
ltx23_inpaint_rank128_v1_02500steps.safetensors
or the model
ltx23_inpaint_rank128_v1_10000steps.safetensors
This model does in fact take the area to be inpainted in the same way VACE did. Here, it seems that the masked area should be white instead of purple. This LoRA does not support any kind of reference, so it is useful for inpainting based only on the prompt. Here too, Add Guide Multi can be used to force it to use start and end reference frames, with all the problems and inconsistencies of usage of the previous method.
I tried many variations for each method. For example, I tried passing only the video with the mask applied to all frames except the first and last. I tried using a KSampler Advanced to apply denoise only during the final steps. I tried raising the CFG up to 2.5. All these methods sometimes produce decent results, but never consistent ones.
The video that came out well yesterday was a complete fluke. If you change the mask by 1px, it may suddenly, randomly, come out well. Change the seed or change the mask by 1px, and the white or purple little clouds may come back.
\--
Besides, the author of the inpainting LoRA himself added a huge number of clarifications on the project page, which basically means: it does not work always perfectly without fiddling with parameters, which means we can use it but we can hardly pass a general workflow to a junior at the company to speed up production.
None of the official or unofficial workflows I found does the exact kind of work we need: replacing only one part of a video with something for which we provide an exact visual reference, eventually mixed with depth/canny masks, while keeping and matching the original input video exactly, both in terms of resolution and spatiotemporal coherence.
In all these cases, the only way to get back the original video with only the inpainted part changed is still to recomposite the model output over the original video using the mask. This happens because even if you run inference only on a masked part of the latent, your video will still pass through the VAE and therefore it will be modified. We knew this already, but we always keep hoping they will make an ad hoc model or nodes for this.
There are ways to solve it, and as you saw yesterday, somehow, sooner or later, you can get a result that works. But it requires too much time and too many attempts, at least based on what I have tested so far. What we need is an easy, fast, stable, consistent, and precisely customizable solution.
\---------------
I will start re-testing today VACE 2.1 and the experimental 2.2 merge to see how it compares, VACE 2.1 felt almost magical, you could feed it very complex videos with depth maps, reference frames, pose maps, masks, all nested in a single guiding video and with zero prompt you would get exactly what you were expecting, but its generation capabilities are too old for May 2026.
https://redd.it/1t77h3n
@rStableDiffusion
\--
Besides, the author of the inpainting LoRA himself added a huge number of clarifications on the project page, which basically means: it does not work always perfectly without fiddling with parameters, which means we can use it but we can hardly pass a general workflow to a junior at the company to speed up production.
None of the official or unofficial workflows I found does the exact kind of work we need: replacing only one part of a video with something for which we provide an exact visual reference, eventually mixed with depth/canny masks, while keeping and matching the original input video exactly, both in terms of resolution and spatiotemporal coherence.
In all these cases, the only way to get back the original video with only the inpainted part changed is still to recomposite the model output over the original video using the mask. This happens because even if you run inference only on a masked part of the latent, your video will still pass through the VAE and therefore it will be modified. We knew this already, but we always keep hoping they will make an ad hoc model or nodes for this.
There are ways to solve it, and as you saw yesterday, somehow, sooner or later, you can get a result that works. But it requires too much time and too many attempts, at least based on what I have tested so far. What we need is an easy, fast, stable, consistent, and precisely customizable solution.
\---------------
I will start re-testing today VACE 2.1 and the experimental 2.2 merge to see how it compares, VACE 2.1 felt almost magical, you could feed it very complex videos with depth maps, reference frames, pose maps, masks, all nested in a single guiding video and with zero prompt you would get exactly what you were expecting, but its generation capabilities are too old for May 2026.
https://redd.it/1t77h3n
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Revisiting WAN 2.2 for real-person realism, consented LoRA, retuned settings
https://redd.it/1t7cnaj
@rStableDiffusion
https://redd.it/1t7cnaj
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: Revisiting WAN 2.2 for real-person realism, consented LoRA, retuned settings
Explore this post and more from the StableDiffusion community
Z-Image Turbo for character LoRAs — honest comparison vs Flux after training the same character on both
https://redd.it/1t7g8de
@rStableDiffusion
https://redd.it/1t7g8de
@rStableDiffusion
FLUX, Open Research, and the Future of Visual AI — Stephen Batifol, Black Forest Labs
https://youtu.be/x8Yb4RidLgM?si=rRA-QvBXt4aUWu5k
https://redd.it/1t7ekxn
@rStableDiffusion
https://youtu.be/x8Yb4RidLgM?si=rRA-QvBXt4aUWu5k
https://redd.it/1t7ekxn
@rStableDiffusion
YouTube
FLUX, Open Research, and the Future of Visual AI — Stephen Batifol, Black Forest Labs
FLUX started as an image model story, but this talk makes the larger ambition clear: visual intelligence, not just image generation. From FLUX.1 through Kontext, FLUX.2, and FLUX.2 Klein, Black Forest Labs has been pushing fast, open releases while building…
AI tooling is starting to feel like PC modding culture
I think local AI setups are about to split into two completely different communities.
One side cares about actual production workflows:
* agents
* automation
* APIs
* inference efficiency
* data quality
* reproducibility
The other side mostly treats it like PC modding:
* model collecting
* benchmark screenshots
* “look how many params I run”
* endless UI tweaking
* generating the same test prompts forever
Not even judging either side honestly. I just think it explains why AI discussions online feel so weird lately. Two people can both be “into local AI” and barely even be talking about the same thing anymore.
https://redd.it/1t7fm79
@rStableDiffusion
I think local AI setups are about to split into two completely different communities.
One side cares about actual production workflows:
* agents
* automation
* APIs
* inference efficiency
* data quality
* reproducibility
The other side mostly treats it like PC modding:
* model collecting
* benchmark screenshots
* “look how many params I run”
* endless UI tweaking
* generating the same test prompts forever
Not even judging either side honestly. I just think it explains why AI discussions online feel so weird lately. Two people can both be “into local AI” and barely even be talking about the same thing anymore.
https://redd.it/1t7fm79
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Spent 3 training rounds trying to get a Jean-Léon Gérôme lora to retain fini surfaces
https://redd.it/1t7kpic
@rStableDiffusion
https://redd.it/1t7kpic
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: Spent 3 training rounds trying to get a Jean-Léon Gérôme lora to retain fini surfaces
Explore this post and more from the StableDiffusion community