Lora Training, Is more than 30 images for a character lora helpful if its a wide variety of actions?
Noob question but alot of the tutorials I read or watch mention that about 30 images is good for a character lora.
However would something like 50 to 100 be helpful if the character is doing a wide range of things besides 100 of the same generic portrait image? I thought at first maybe the base model would cover generic actions but the truth is how do I know how much the model learned about say a person riding a bike? etc?
Like what if I did,
\- 30 general images
\- 70 actions or fringe situations (jumping jacks, running, sitting, unique pose)
Is it still too many images regardless? I guess I want my loras to be useful beyond a bunch of portrait style pictures. Like if the user wanted the character in a comic and they had to do a wide variety of things.
https://redd.it/1s87roe
@rStableDiffusion
Noob question but alot of the tutorials I read or watch mention that about 30 images is good for a character lora.
However would something like 50 to 100 be helpful if the character is doing a wide range of things besides 100 of the same generic portrait image? I thought at first maybe the base model would cover generic actions but the truth is how do I know how much the model learned about say a person riding a bike? etc?
Like what if I did,
\- 30 general images
\- 70 actions or fringe situations (jumping jacks, running, sitting, unique pose)
Is it still too many images regardless? I guess I want my loras to be useful beyond a bunch of portrait style pictures. Like if the user wanted the character in a comic and they had to do a wide variety of things.
https://redd.it/1s87roe
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space
>LongCat-TTS, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-TTS lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-TTS achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-TTS-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community.
https://huggingface.co/meituan-longcat/LongCat-AudioDiT-3.5B
https://huggingface.co/meituan-longcat/LongCat-AudioDiT-1B
https://github.com/meituan-longcat/LongCat-AudioDiT
ComfyUI: https://github.com/Saganaki22/ComfyUI-LongCat-AudioDIT-TTS
Models are auto-downloaded from HuggingFace on first use:
[meituan-longcat/LongCat-AudioDiT-1B](https://huggingface.co/meituan-longcat/LongCat-AudioDiT-1B) — 1B params model
meituan-longcat/LongCat-AudioDiT-3.5B — original FP32 model
[drbaph/LongCat-AudioDiT-3.5B-bf16](https://huggingface.co/drbaph/LongCat-AudioDiT-3.5B-bf16) — BF16 quantized
drbaph/LongCat-AudioDiT-3.5B-fp8 — FP8 quantized
https://redd.it/1s89p16
@rStableDiffusion
>LongCat-TTS, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-TTS lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-TTS achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-TTS-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community.
https://huggingface.co/meituan-longcat/LongCat-AudioDiT-3.5B
https://huggingface.co/meituan-longcat/LongCat-AudioDiT-1B
https://github.com/meituan-longcat/LongCat-AudioDiT
ComfyUI: https://github.com/Saganaki22/ComfyUI-LongCat-AudioDIT-TTS
Models are auto-downloaded from HuggingFace on first use:
[meituan-longcat/LongCat-AudioDiT-1B](https://huggingface.co/meituan-longcat/LongCat-AudioDiT-1B) — 1B params model
meituan-longcat/LongCat-AudioDiT-3.5B — original FP32 model
[drbaph/LongCat-AudioDiT-3.5B-bf16](https://huggingface.co/drbaph/LongCat-AudioDiT-3.5B-bf16) — BF16 quantized
drbaph/LongCat-AudioDiT-3.5B-fp8 — FP8 quantized
https://redd.it/1s89p16
@rStableDiffusion
huggingface.co
meituan-longcat/LongCat-AudioDiT-3.5B · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Open-source tool for running full-precision models on 16GB GPUs — compressed GPU memory paging for ComfyUI
If you've ever wished you could run the full FP16 model instead of GGUF Q4 on your 16GB card, this might help. It compresses weights for the PCIe transfer and decompresses on GPU. Tested on Wan 2.2 14B, works with LoRAs.
Not useful if GGUF Q4 already gives you the quality you need — it's faster. But if you want higher fidelity on limited hardware, this is a new option.
https://github.com/willjriley/vram-pager
https://redd.it/1s8cjb9
@rStableDiffusion
If you've ever wished you could run the full FP16 model instead of GGUF Q4 on your 16GB card, this might help. It compresses weights for the PCIe transfer and decompresses on GPU. Tested on Wan 2.2 14B, works with LoRAs.
Not useful if GGUF Q4 already gives you the quality you need — it's faster. But if you want higher fidelity on limited hardware, this is a new option.
https://github.com/willjriley/vram-pager
https://redd.it/1s8cjb9
@rStableDiffusion
GitHub
GitHub - willjriley/vram-pager: Compressed GPU Memory Paging for Diffusion & Video Models — 3.4x faster inference on consumer GPUs
Compressed GPU Memory Paging for Diffusion & Video Models — 3.4x faster inference on consumer GPUs - willjriley/vram-pager
Use Qwen3.5 as an AI Assistant, Captioner or Image Analyzer inside of Comfyui!
https://huggingface.co/Winnougan/Qwen-3.5-Abliterated-Comfyui-nvfp4
https://redd.it/1s8jhyj
@rStableDiffusion
https://huggingface.co/Winnougan/Qwen-3.5-Abliterated-Comfyui-nvfp4
https://redd.it/1s8jhyj
@rStableDiffusion
huggingface.co
Winnougan/Qwen-3.5-Abliterated-Comfyui-nvfp4 · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.