Segment Anything (SAM) ControlNet for Z-Image
https://huggingface.co/neuralvfx/Z-Image-SAM-ControlNet
https://redd.it/1s7r1ly
@rStableDiffusion
https://huggingface.co/neuralvfx/Z-Image-SAM-ControlNet
https://redd.it/1s7r1ly
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: Segment Anything (SAM) ControlNet for Z-Image
Explore this post and more from the StableDiffusion community
What's your thoughts on ltx 2.3 now?
in my personal experience, it's a big improvement over the previous version. prompt following far better. sound far better. less unprompted sounds and music.
i2v is still pretty hit and miss. keeping about 30% likeness to orginal source image. Any type of movement that is not talking causes the model to fall apart and produce body horror. I'm finding myself throwing away more gens due to just terrible results.
it's great for talking heads in my opinion, but I've gone back to wan 2.2 for now. hopefully, ltx can improve the movement and animation in coming updates.
what are your thoughts on the model so far ?
https://redd.it/1s7srxg
@rStableDiffusion
in my personal experience, it's a big improvement over the previous version. prompt following far better. sound far better. less unprompted sounds and music.
i2v is still pretty hit and miss. keeping about 30% likeness to orginal source image. Any type of movement that is not talking causes the model to fall apart and produce body horror. I'm finding myself throwing away more gens due to just terrible results.
it's great for talking heads in my opinion, but I've gone back to wan 2.2 for now. hopefully, ltx can improve the movement and animation in coming updates.
what are your thoughts on the model so far ?
https://redd.it/1s7srxg
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Do you use llm's to expand on your prompts?
I've just switched to Klein 9b and I've been told that it handles extremely detailed prompts very well.
So I tried to install the Human Detail LLM today, to let it expand on my prompts and failed miserably on setting it up. Now I'm wondering if it's worth the frustration.
Maybe there's a better option than Human Detail LLM anyway? Maybe even Gemini can do the job well enough? Or maybe its all hype anyway and its not worth spending time on?
I'd love to hear your opinions and tips on the topic.
https://redd.it/1s7zcw2
@rStableDiffusion
I've just switched to Klein 9b and I've been told that it handles extremely detailed prompts very well.
So I tried to install the Human Detail LLM today, to let it expand on my prompts and failed miserably on setting it up. Now I'm wondering if it's worth the frustration.
Maybe there's a better option than Human Detail LLM anyway? Maybe even Gemini can do the job well enough? Or maybe its all hype anyway and its not worth spending time on?
I'd love to hear your opinions and tips on the topic.
https://redd.it/1s7zcw2
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Is there a list for AI services that advertise with fake posts and comments? Should one be made?
I think those services should be boycotted as a whole, because lying doesn't do good for the AI community.
Just answered a post today asking for help, it was another insert for some scam service (scam because they lie to get customers).
Edit: Downvotes.. Sorry for standing on your business, but it's about morals.
https://redd.it/1s844x8
@rStableDiffusion
I think those services should be boycotted as a whole, because lying doesn't do good for the AI community.
Just answered a post today asking for help, it was another insert for some scam service (scam because they lie to get customers).
Edit: Downvotes.. Sorry for standing on your business, but it's about morals.
https://redd.it/1s844x8
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Mugen - Modernized Anime SDXL Base, or how to make Bluvoll tiny bit less sane
https://redd.it/1s86i0v
@rStableDiffusion
https://redd.it/1s86i0v
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: Mugen - Modernized Anime SDXL Base, or how to make Bluvoll tiny bit less sane
Explore this post and more from the StableDiffusion community
Lora Training, Is more than 30 images for a character lora helpful if its a wide variety of actions?
Noob question but alot of the tutorials I read or watch mention that about 30 images is good for a character lora.
However would something like 50 to 100 be helpful if the character is doing a wide range of things besides 100 of the same generic portrait image? I thought at first maybe the base model would cover generic actions but the truth is how do I know how much the model learned about say a person riding a bike? etc?
Like what if I did,
\- 30 general images
\- 70 actions or fringe situations (jumping jacks, running, sitting, unique pose)
Is it still too many images regardless? I guess I want my loras to be useful beyond a bunch of portrait style pictures. Like if the user wanted the character in a comic and they had to do a wide variety of things.
https://redd.it/1s87roe
@rStableDiffusion
Noob question but alot of the tutorials I read or watch mention that about 30 images is good for a character lora.
However would something like 50 to 100 be helpful if the character is doing a wide range of things besides 100 of the same generic portrait image? I thought at first maybe the base model would cover generic actions but the truth is how do I know how much the model learned about say a person riding a bike? etc?
Like what if I did,
\- 30 general images
\- 70 actions or fringe situations (jumping jacks, running, sitting, unique pose)
Is it still too many images regardless? I guess I want my loras to be useful beyond a bunch of portrait style pictures. Like if the user wanted the character in a comic and they had to do a wide variety of things.
https://redd.it/1s87roe
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space
>LongCat-TTS, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-TTS lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-TTS achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-TTS-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community.
https://huggingface.co/meituan-longcat/LongCat-AudioDiT-3.5B
https://huggingface.co/meituan-longcat/LongCat-AudioDiT-1B
https://github.com/meituan-longcat/LongCat-AudioDiT
ComfyUI: https://github.com/Saganaki22/ComfyUI-LongCat-AudioDIT-TTS
Models are auto-downloaded from HuggingFace on first use:
[meituan-longcat/LongCat-AudioDiT-1B](https://huggingface.co/meituan-longcat/LongCat-AudioDiT-1B) — 1B params model
meituan-longcat/LongCat-AudioDiT-3.5B — original FP32 model
[drbaph/LongCat-AudioDiT-3.5B-bf16](https://huggingface.co/drbaph/LongCat-AudioDiT-3.5B-bf16) — BF16 quantized
drbaph/LongCat-AudioDiT-3.5B-fp8 — FP8 quantized
https://redd.it/1s89p16
@rStableDiffusion
>LongCat-TTS, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-TTS lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-TTS achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-TTS-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community.
https://huggingface.co/meituan-longcat/LongCat-AudioDiT-3.5B
https://huggingface.co/meituan-longcat/LongCat-AudioDiT-1B
https://github.com/meituan-longcat/LongCat-AudioDiT
ComfyUI: https://github.com/Saganaki22/ComfyUI-LongCat-AudioDIT-TTS
Models are auto-downloaded from HuggingFace on first use:
[meituan-longcat/LongCat-AudioDiT-1B](https://huggingface.co/meituan-longcat/LongCat-AudioDiT-1B) — 1B params model
meituan-longcat/LongCat-AudioDiT-3.5B — original FP32 model
[drbaph/LongCat-AudioDiT-3.5B-bf16](https://huggingface.co/drbaph/LongCat-AudioDiT-3.5B-bf16) — BF16 quantized
drbaph/LongCat-AudioDiT-3.5B-fp8 — FP8 quantized
https://redd.it/1s89p16
@rStableDiffusion
huggingface.co
meituan-longcat/LongCat-AudioDiT-3.5B · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Open-source tool for running full-precision models on 16GB GPUs — compressed GPU memory paging for ComfyUI
If you've ever wished you could run the full FP16 model instead of GGUF Q4 on your 16GB card, this might help. It compresses weights for the PCIe transfer and decompresses on GPU. Tested on Wan 2.2 14B, works with LoRAs.
Not useful if GGUF Q4 already gives you the quality you need — it's faster. But if you want higher fidelity on limited hardware, this is a new option.
https://github.com/willjriley/vram-pager
https://redd.it/1s8cjb9
@rStableDiffusion
If you've ever wished you could run the full FP16 model instead of GGUF Q4 on your 16GB card, this might help. It compresses weights for the PCIe transfer and decompresses on GPU. Tested on Wan 2.2 14B, works with LoRAs.
Not useful if GGUF Q4 already gives you the quality you need — it's faster. But if you want higher fidelity on limited hardware, this is a new option.
https://github.com/willjriley/vram-pager
https://redd.it/1s8cjb9
@rStableDiffusion
GitHub
GitHub - willjriley/vram-pager: Compressed GPU Memory Paging for Diffusion & Video Models — 3.4x faster inference on consumer GPUs
Compressed GPU Memory Paging for Diffusion & Video Models — 3.4x faster inference on consumer GPUs - willjriley/vram-pager
Use Qwen3.5 as an AI Assistant, Captioner or Image Analyzer inside of Comfyui!
https://huggingface.co/Winnougan/Qwen-3.5-Abliterated-Comfyui-nvfp4
https://redd.it/1s8jhyj
@rStableDiffusion
https://huggingface.co/Winnougan/Qwen-3.5-Abliterated-Comfyui-nvfp4
https://redd.it/1s8jhyj
@rStableDiffusion
huggingface.co
Winnougan/Qwen-3.5-Abliterated-Comfyui-nvfp4 · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.