Chatterbox-TTS fork updated to include Voice Conversion, per generation json settings export, and more.

After seeing this community post here:
https://www.reddit.com/r/StableDiffusion/comments/1ldn88o/chatterbox\_audiobook\_and\_podcast\_studio\_all\_local/

And this other community post:
https://www.reddit.com/r/StableDiffusion/comments/1ldu8sf/video\_guide\_how\_to\_sync\_chatterbox\_tts\_with/

Here is my latest updated fork of Chatterbox-TTS.
NEW FEATURES:
It remembers your last settings and they will be reloaded when you restart the script.

Saves a json file for each audio generation that contains all your configuration data, including the seed, so when you want to use the same settings for other generations, you can load that json file into the json file upload/drag and drop box and all the settings contained in the json file will automatically be applied.

You can now select an alternate whisper sync validation model (faster-whisper) for faster validation and to use less VRAM. For example with the largest models: large (\~10–13 GB OpenAI / \~4.5–6.5 GB faster-whisper)

Added the VOICE CONVERSION feature that some had asked for which is already included in the original repo. This is where you can record yourself saying whatever, then take another voice and convert your voice to theirs saying the same thing in the same way, same intonation, timing, etc..

|Category|Features|
|:-|:-|
|Input|Text, multi-file upload, reference audio, load/save settings|
|Output|WAV/MP3/FLAC, per-gen .json/.csv settings, downloadable & previewable in UI|
|Generation|Multi-gen, multi-candidate, random/fixed seed, voice conditioning|
|Batching|Sentence batching, smart merge, parallel chunk processing, split by punctuation/length|
|Text Preproc|Lowercase, spacing normalization, dot-letter fix, inline ref number removal, sound word edit|
|Audio Postproc|Auto-editor silence trim, threshold/margin, keep original, normalization (ebu/peak)|
|Whisper Sync|Model selection, faster-whisper, bypass, per-chunk validation, retry logic|
|Voice Conversion|Input+target voice, watermark disabled, chunked processing, crossfade, WAV output|

https://redd.it/1le0194
@rStableDiffusion
NVidia Cosmos Predict2! New txt2img model at 2B and 14B!

ComfyUI Guide for local use

https://docs.comfy.org/tutorials/image/cosmos/cosmos-predict2-t2i

This model just dropped out of the blue and I have been performing a few test:

1) SPEED TEST on a RTX 3090 @ 1MP (unless indicated otherwise)

FLUX.1-Dev FP16 = 1.45sec / it

Cosmos Predict2 2B = 1.2sec / it. @ 1MP & 1.5MP

Cosmos Predict2 2B = 1.8sec / it. @ 2MP

HiDream Full FP16 = 4.5sec / it.

Cosmos Predict2 14B = 4.9sec / it.

Cosmos Predict2 14B = 7.7sec / it. @ 1.5MP

Cosmos Predict2 14B = 10.65sec / it. @ 2MP

The thing to note here is that the 2B model can produce images at an impressive speed @ 2MP, while the 14B one reaches an atrocious speed.

Prompt: A Photograph of a russian woman with natural blue eyes and blonde hair is walking on the beach at dusk while wearing a red bikini. She is making the peace sign with one hand and winking

2B Model

14B Model

2) PROMPT TEST:

Prompt: An ethereal elven woman stands poised in a vibrant springtime valley, draped in an ornate, skimpy armor adorned with one magical gemstone embedded in its chest. A regal cloak flows behind her, lined with pristine white fur at the neck, adding to her striking presence. She wields a mystical spear pulsating with arcane energy, its luminous aura casting shifting colors across the landscape. Western Anime Style

2B Model

Prompt: A muscled Orc stands poised in a springtime valley, draped in an ornate, leather armor adorned with a small animal skulls. A regal black cloak flows behind him, lined with matted brown fur at the neck, adding to his menacing presence. He wields a rustic large Axe with both hands

2B Model

14B Model

Prompt: A massive spaceship glides silently through the void, approaching the curvature of a distant planet. Its sleek metallic hull reflects the light of a distant star as it prepares for orbital entry. The ship’s thrusters emit a faint, glowing trail, creating a mesmerizing contrast against the deep, inky blackness of space. Wisps of atmospheric haze swirl around its edges as it crosses into the planet’s gravitational pull, the moment captured in a cinematic, hyper-realistic style, emphasizing the grand scale and futuristic elegance of the vessel.

2B Model

Prompt: Under the soft pink canopy of a blooming Sakura tree, a man and a woman stand together, immersed in an intimate exchange. The gentle breeze stirs the delicate petals, causing a flurry of blossoms to drift around them like falling snow. The man, dressed in elegant yet casual attire, gazes at the woman with a warm, knowing smile, while she responds with a shy, delighted laugh, her long hair catching the light. Their interaction is subtle yet deeply expressive—an unspoken understanding conveyed through fleeting touches and lingering glances. The setting is painted in a dreamy, semi-realistic style, emphasizing the poetic beauty of the moment, where nature and emotion intertwine in perfect harmony.

2B Model

PERSONAL CONCLUSIONS FROM THE (PRELIMINARY) TEST:

Cosmos-Predict2-2B-Text2Image A bit weak in understanding styles (maybe it was not
trained in them?), but relatively fast even at 2MP and with good prompt adherence (I'll have to test more).

Cosmos-Predict2-14B-Text2Image doesn't seem, to be "better" at first glance than it's 2B "mini-me", and it is HiDream sloooow.

Also, it has a text to Video brother! But, I am not testing it here yet.

The MEME:

Just don't prompt a woman laying on the grass!

Prompt: Photograph of a woman laying on the grass and eating a banana

https://preview.redd.it/9qipubalok7f1.jpg?width=1088&format=pjpg&auto=webp&s=3b7502d820964911e1ec807713ef3014d3d0a417

https://redd.it/1le28bw
@rStableDiffusion
Im desperate, please help me understand LoRA training

Hello, 2 weeks ago i created my own realistic AI model ("incluencer"). Since then, I've trained like 8 LoRAs and none of them are good. The only LoRA that is giving me the face I want is unable to give me any other hairstyles then those on learning pictures. So I obviously tried to train another one, with better pictures, more hairstyles, emotions, from every angle, I had like 150 pictures - and it's complete bulls*it. Face resembles her maybe 4 out of 10 times.

Since im completely new in AI world, I've used ChatGPT for everything and he told me the more pics - the better for training. What I've noticed tho, CC on YT usually use only like 20-30pics so I'm now confused.

At this point I don't even care if its flux or sdxl, i have programs for both, but please can someone help me with definite answer on how many training pics i need? And do i train only the face or also the body? Or should it be done separately in 2 LoRAs?

Thank you so much🙈🙈❤️

https://redd.it/1le961p
@rStableDiffusion
Let's Benchmark ! Your GPU against others - Wan Edition

Welcome to Let's Benchmark ! Your GPU against others - Where we share our generation time to see if we are on the good track compared to others in the community !

To do that, please always include at least the following (mine for reference):

Generation time : 4:01min
GPU : RTX 3090 24GB VRAM
RAM : 128GB
Model : Wan2.1 14B 720P GGUF Q8
Speedup Lora(s) : Kijai Self Forcing 14B (https://huggingface.co/Kijai/WanVideo\_comfy/blob/main/Wan21\_T2V\_14B\_lightx2v\_cfg\_step\_distill\_lora\_rank32.safetensors)
Steps : 4
Frames : 81 (5sec video)
Resolution : 720x1280

I think I'm average, but not sure ! That's why I'm creating this post so everyone can compare and share together !

https://redd.it/1lee9sh
@rStableDiffusion
Qwen2VL-Flux ControlNet is available since Nov 2024 but most people missed it. Fully compatible with Flux Dev and ComfyUI. Works with Depth and Canny (kinda works with Tile and Realistic Lineart)

https://redd.it/1lefv07
@rStableDiffusion
What is the best video upscaler besides Topaz?

Based on my research, it seems like Topaz is the best video upscaler currently. Topaz has been around for several years now. I am wondering why there hasn't been a newcomer yet with better quality.

Is your experience the same with video upscaler software, and what is the best OS video upscaler software?

https://redd.it/1ledzsc
@rStableDiffusion
Which UI is better, Comfyui, Automatic1111, or Forge?

I'm going to start working with AI soon, and I'd like to know which one is the most recommended.

https://redd.it/1lekbm7
@rStableDiffusion
This media is not supported in your browser
VIEW IN TELEGRAM
Sources VS Output Comparaison: Trying to use 3D reference some with camera motion from blender to see if i can control the output

https://redd.it/1lensll
@rStableDiffusion