Chatterbox-TTS fork updated to include Voice Conversion, per generation json settings export, and more.
After seeing this community post here:
https://www.reddit.com/r/StableDiffusion/comments/1ldn88o/chatterbox\_audiobook\_and\_podcast\_studio\_all\_local/
And this other community post:
https://www.reddit.com/r/StableDiffusion/comments/1ldu8sf/video\_guide\_how\_to\_sync\_chatterbox\_tts\_with/
Here is my latest updated fork of Chatterbox-TTS.
NEW FEATURES:
It remembers your last settings and they will be reloaded when you restart the script.
Saves a json file for each audio generation that contains all your configuration data, including the seed, so when you want to use the same settings for other generations, you can load that json file into the json file upload/drag and drop box and all the settings contained in the json file will automatically be applied.
You can now select an alternate whisper sync validation model (faster-whisper) for faster validation and to use less VRAM. For example with the largest models: large (\~10–13 GB OpenAI / \~4.5–6.5 GB faster-whisper)
Added the VOICE CONVERSION feature that some had asked for which is already included in the original repo. This is where you can record yourself saying whatever, then take another voice and convert your voice to theirs saying the same thing in the same way, same intonation, timing, etc..
|Category|Features|
|:-|:-|
|Input|Text, multi-file upload, reference audio, load/save settings|
|Output|WAV/MP3/FLAC, per-gen .json/.csv settings, downloadable & previewable in UI|
|Generation|Multi-gen, multi-candidate, random/fixed seed, voice conditioning|
|Batching|Sentence batching, smart merge, parallel chunk processing, split by punctuation/length|
|Text Preproc|Lowercase, spacing normalization, dot-letter fix, inline ref number removal, sound word edit|
|Audio Postproc|Auto-editor silence trim, threshold/margin, keep original, normalization (ebu/peak)|
|Whisper Sync|Model selection, faster-whisper, bypass, per-chunk validation, retry logic|
|Voice Conversion|Input+target voice, watermark disabled, chunked processing, crossfade, WAV output|
https://redd.it/1le0194
@rStableDiffusion
After seeing this community post here:
https://www.reddit.com/r/StableDiffusion/comments/1ldn88o/chatterbox\_audiobook\_and\_podcast\_studio\_all\_local/
And this other community post:
https://www.reddit.com/r/StableDiffusion/comments/1ldu8sf/video\_guide\_how\_to\_sync\_chatterbox\_tts\_with/
Here is my latest updated fork of Chatterbox-TTS.
NEW FEATURES:
It remembers your last settings and they will be reloaded when you restart the script.
Saves a json file for each audio generation that contains all your configuration data, including the seed, so when you want to use the same settings for other generations, you can load that json file into the json file upload/drag and drop box and all the settings contained in the json file will automatically be applied.
You can now select an alternate whisper sync validation model (faster-whisper) for faster validation and to use less VRAM. For example with the largest models: large (\~10–13 GB OpenAI / \~4.5–6.5 GB faster-whisper)
Added the VOICE CONVERSION feature that some had asked for which is already included in the original repo. This is where you can record yourself saying whatever, then take another voice and convert your voice to theirs saying the same thing in the same way, same intonation, timing, etc..
|Category|Features|
|:-|:-|
|Input|Text, multi-file upload, reference audio, load/save settings|
|Output|WAV/MP3/FLAC, per-gen .json/.csv settings, downloadable & previewable in UI|
|Generation|Multi-gen, multi-candidate, random/fixed seed, voice conditioning|
|Batching|Sentence batching, smart merge, parallel chunk processing, split by punctuation/length|
|Text Preproc|Lowercase, spacing normalization, dot-letter fix, inline ref number removal, sound word edit|
|Audio Postproc|Auto-editor silence trim, threshold/margin, keep original, normalization (ebu/peak)|
|Whisper Sync|Model selection, faster-whisper, bypass, per-chunk validation, retry logic|
|Voice Conversion|Input+target voice, watermark disabled, chunked processing, crossfade, WAV output|
https://redd.it/1le0194
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: Chatterbox Audiobook (and Podcast) Studio - All Local
Explore this post and more from the StableDiffusion community
NVidia Cosmos Predict2! New txt2img model at 2B and 14B!
ComfyUI Guide for local use
https://docs.comfy.org/tutorials/image/cosmos/cosmos-predict2-t2i
This model just dropped out of the blue and I have been performing a few test:
1) SPEED TEST on a RTX 3090 @ 1MP (unless indicated otherwise)
FLUX.1-Dev FP16 = 1.45sec / it
Cosmos Predict2 2B = 1.2sec / it. @ 1MP & 1.5MP
Cosmos Predict2 2B = 1.8sec / it. @ 2MP
HiDream Full FP16 = 4.5sec / it.
Cosmos Predict2 14B = 4.9sec / it.
Cosmos Predict2 14B = 7.7sec / it. @ 1.5MP
Cosmos Predict2 14B = 10.65sec / it. @ 2MP
The thing to note here is that the 2B model can produce images at an impressive speed @ 2MP, while the 14B one reaches an atrocious speed.
Prompt: A Photograph of a russian woman with natural blue eyes and blonde hair is walking on the beach at dusk while wearing a red bikini. She is making the peace sign with one hand and winking
2B Model
14B Model
2) PROMPT TEST:
Prompt: An ethereal elven woman stands poised in a vibrant springtime valley, draped in an ornate, skimpy armor adorned with one magical gemstone embedded in its chest. A regal cloak flows behind her, lined with pristine white fur at the neck, adding to her striking presence. She wields a mystical spear pulsating with arcane energy, its luminous aura casting shifting colors across the landscape. Western Anime Style
2B Model
Prompt: A muscled Orc stands poised in a springtime valley, draped in an ornate, leather armor adorned with a small animal skulls. A regal black cloak flows behind him, lined with matted brown fur at the neck, adding to his menacing presence. He wields a rustic large Axe with both hands
2B Model
14B Model
Prompt: A massive spaceship glides silently through the void, approaching the curvature of a distant planet. Its sleek metallic hull reflects the light of a distant star as it prepares for orbital entry. The ship’s thrusters emit a faint, glowing trail, creating a mesmerizing contrast against the deep, inky blackness of space. Wisps of atmospheric haze swirl around its edges as it crosses into the planet’s gravitational pull, the moment captured in a cinematic, hyper-realistic style, emphasizing the grand scale and futuristic elegance of the vessel.
2B Model
Prompt: Under the soft pink canopy of a blooming Sakura tree, a man and a woman stand together, immersed in an intimate exchange. The gentle breeze stirs the delicate petals, causing a flurry of blossoms to drift around them like falling snow. The man, dressed in elegant yet casual attire, gazes at the woman with a warm, knowing smile, while she responds with a shy, delighted laugh, her long hair catching the light. Their interaction is subtle yet deeply expressive—an unspoken understanding conveyed through fleeting touches and lingering glances. The setting is painted in a dreamy, semi-realistic style, emphasizing the poetic beauty of the moment, where nature and emotion intertwine in perfect harmony.
2B Model
PERSONAL CONCLUSIONS FROM THE (PRELIMINARY) TEST:
Cosmos-Predict2-2B-Text2Image A bit weak in understanding styles (maybe it was not
ComfyUI Guide for local use
https://docs.comfy.org/tutorials/image/cosmos/cosmos-predict2-t2i
This model just dropped out of the blue and I have been performing a few test:
1) SPEED TEST on a RTX 3090 @ 1MP (unless indicated otherwise)
FLUX.1-Dev FP16 = 1.45sec / it
Cosmos Predict2 2B = 1.2sec / it. @ 1MP & 1.5MP
Cosmos Predict2 2B = 1.8sec / it. @ 2MP
HiDream Full FP16 = 4.5sec / it.
Cosmos Predict2 14B = 4.9sec / it.
Cosmos Predict2 14B = 7.7sec / it. @ 1.5MP
Cosmos Predict2 14B = 10.65sec / it. @ 2MP
The thing to note here is that the 2B model can produce images at an impressive speed @ 2MP, while the 14B one reaches an atrocious speed.
Prompt: A Photograph of a russian woman with natural blue eyes and blonde hair is walking on the beach at dusk while wearing a red bikini. She is making the peace sign with one hand and winking
2B Model
14B Model
2) PROMPT TEST:
Prompt: An ethereal elven woman stands poised in a vibrant springtime valley, draped in an ornate, skimpy armor adorned with one magical gemstone embedded in its chest. A regal cloak flows behind her, lined with pristine white fur at the neck, adding to her striking presence. She wields a mystical spear pulsating with arcane energy, its luminous aura casting shifting colors across the landscape. Western Anime Style
2B Model
Prompt: A muscled Orc stands poised in a springtime valley, draped in an ornate, leather armor adorned with a small animal skulls. A regal black cloak flows behind him, lined with matted brown fur at the neck, adding to his menacing presence. He wields a rustic large Axe with both hands
2B Model
14B Model
Prompt: A massive spaceship glides silently through the void, approaching the curvature of a distant planet. Its sleek metallic hull reflects the light of a distant star as it prepares for orbital entry. The ship’s thrusters emit a faint, glowing trail, creating a mesmerizing contrast against the deep, inky blackness of space. Wisps of atmospheric haze swirl around its edges as it crosses into the planet’s gravitational pull, the moment captured in a cinematic, hyper-realistic style, emphasizing the grand scale and futuristic elegance of the vessel.
2B Model
Prompt: Under the soft pink canopy of a blooming Sakura tree, a man and a woman stand together, immersed in an intimate exchange. The gentle breeze stirs the delicate petals, causing a flurry of blossoms to drift around them like falling snow. The man, dressed in elegant yet casual attire, gazes at the woman with a warm, knowing smile, while she responds with a shy, delighted laugh, her long hair catching the light. Their interaction is subtle yet deeply expressive—an unspoken understanding conveyed through fleeting touches and lingering glances. The setting is painted in a dreamy, semi-realistic style, emphasizing the poetic beauty of the moment, where nature and emotion intertwine in perfect harmony.
2B Model
PERSONAL CONCLUSIONS FROM THE (PRELIMINARY) TEST:
Cosmos-Predict2-2B-Text2Image A bit weak in understanding styles (maybe it was not
ComfyUI
Cosmos Predict2 Text-to-Image ComfyUI Official Example - ComfyUI
This guide demonstrates how to complete Cosmos-Predict2 text-to-image workflow in ComfyUI
trained in them?), but relatively fast even at 2MP and with good prompt adherence (I'll have to test more).
Cosmos-Predict2-14B-Text2Image doesn't seem, to be "better" at first glance than it's 2B "mini-me", and it is HiDream sloooow.
Also, it has a text to Video brother! But, I am not testing it here yet.
The MEME:
Just don't prompt a woman laying on the grass!
Prompt: Photograph of a woman laying on the grass and eating a banana
https://preview.redd.it/9qipubalok7f1.jpg?width=1088&format=pjpg&auto=webp&s=3b7502d820964911e1ec807713ef3014d3d0a417
https://redd.it/1le28bw
@rStableDiffusion
Cosmos-Predict2-14B-Text2Image doesn't seem, to be "better" at first glance than it's 2B "mini-me", and it is HiDream sloooow.
Also, it has a text to Video brother! But, I am not testing it here yet.
The MEME:
Just don't prompt a woman laying on the grass!
Prompt: Photograph of a woman laying on the grass and eating a banana
https://preview.redd.it/9qipubalok7f1.jpg?width=1088&format=pjpg&auto=webp&s=3b7502d820964911e1ec807713ef3014d3d0a417
https://redd.it/1le28bw
@rStableDiffusion
Im desperate, please help me understand LoRA training
Hello, 2 weeks ago i created my own realistic AI model ("incluencer"). Since then, I've trained like 8 LoRAs and none of them are good. The only LoRA that is giving me the face I want is unable to give me any other hairstyles then those on learning pictures. So I obviously tried to train another one, with better pictures, more hairstyles, emotions, from every angle, I had like 150 pictures - and it's complete bulls*it. Face resembles her maybe 4 out of 10 times.
Since im completely new in AI world, I've used ChatGPT for everything and he told me the more pics - the better for training. What I've noticed tho, CC on YT usually use only like 20-30pics so I'm now confused.
At this point I don't even care if its flux or sdxl, i have programs for both, but please can someone help me with definite answer on how many training pics i need? And do i train only the face or also the body? Or should it be done separately in 2 LoRAs?
Thank you so much🙈🙈❤️
https://redd.it/1le961p
@rStableDiffusion
Hello, 2 weeks ago i created my own realistic AI model ("incluencer"). Since then, I've trained like 8 LoRAs and none of them are good. The only LoRA that is giving me the face I want is unable to give me any other hairstyles then those on learning pictures. So I obviously tried to train another one, with better pictures, more hairstyles, emotions, from every angle, I had like 150 pictures - and it's complete bulls*it. Face resembles her maybe 4 out of 10 times.
Since im completely new in AI world, I've used ChatGPT for everything and he told me the more pics - the better for training. What I've noticed tho, CC on YT usually use only like 20-30pics so I'm now confused.
At this point I don't even care if its flux or sdxl, i have programs for both, but please can someone help me with definite answer on how many training pics i need? And do i train only the face or also the body? Or should it be done separately in 2 LoRAs?
Thank you so much🙈🙈❤️
https://redd.it/1le961p
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Let's Benchmark ! Your GPU against others - Wan Edition
Welcome to Let's Benchmark ! Your GPU against others - Where we share our generation time to see if we are on the good track compared to others in the community !
To do that, please always include at least the following (mine for reference):
Generation time : 4:01min
GPU : RTX 3090 24GB VRAM
RAM : 128GB
Model : Wan2.1 14B 720P GGUF Q8
Speedup Lora(s) : Kijai Self Forcing 14B (https://huggingface.co/Kijai/WanVideo\_comfy/blob/main/Wan21\_T2V\_14B\_lightx2v\_cfg\_step\_distill\_lora\_rank32.safetensors)
Steps : 4
Frames : 81 (5sec video)
Resolution : 720x1280
I think I'm average, but not sure ! That's why I'm creating this post so everyone can compare and share together !
https://redd.it/1lee9sh
@rStableDiffusion
Welcome to Let's Benchmark ! Your GPU against others - Where we share our generation time to see if we are on the good track compared to others in the community !
To do that, please always include at least the following (mine for reference):
Generation time : 4:01min
GPU : RTX 3090 24GB VRAM
RAM : 128GB
Model : Wan2.1 14B 720P GGUF Q8
Speedup Lora(s) : Kijai Self Forcing 14B (https://huggingface.co/Kijai/WanVideo\_comfy/blob/main/Wan21\_T2V\_14B\_lightx2v\_cfg\_step\_distill\_lora\_rank32.safetensors)
Steps : 4
Frames : 81 (5sec video)
Resolution : 720x1280
I think I'm average, but not sure ! That's why I'm creating this post so everyone can compare and share together !
https://redd.it/1lee9sh
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Krea co-founder is considering open-sourcing their new model trained in collaboration with Black Forest Labs - Maybe go there and leave an encouraging comment?
https://preview.redd.it/j6qshjdiao7f1.jpg?width=1182&format=pjpg&auto=webp&s=9f5da751e086c7c3a8cd882f5b7648211daae50c
https://reddit.com/link/1leexi9/video/bs096nikao7f1/player
Link to the post: https://x.com/viccpoes/status/1934983545233277428
https://redd.it/1leexi9
@rStableDiffusion
https://preview.redd.it/j6qshjdiao7f1.jpg?width=1182&format=pjpg&auto=webp&s=9f5da751e086c7c3a8cd882f5b7648211daae50c
https://reddit.com/link/1leexi9/video/bs096nikao7f1/player
Link to the post: https://x.com/viccpoes/status/1934983545233277428
https://redd.it/1leexi9
@rStableDiffusion
Flux Uncensored in ComfyUI | Master Full Body & Ultra-Realistic AI Workflow
https://youtu.be/N7GbJ97vJow
https://redd.it/1leh4tm
@rStableDiffusion
https://youtu.be/N7GbJ97vJow
https://redd.it/1leh4tm
@rStableDiffusion
YouTube
Flux Uncensored in ComfyUI | Master Full Body & Ultra-Realistic AI Workflow
Flux Uncensored in ComfyUI | Master Full Body & Ultra-Realistic AI Workflow
In this video, I’ll show you how to master Flux Uncensored inside ComfyUI to create ultra-realistic full-body and portrait images. This step-by-step tutorial walks you through everything…
In this video, I’ll show you how to master Flux Uncensored inside ComfyUI to create ultra-realistic full-body and portrait images. This step-by-step tutorial walks you through everything…
Qwen2VL-Flux ControlNet is available since Nov 2024 but most people missed it. Fully compatible with Flux Dev and ComfyUI. Works with Depth and Canny (kinda works with Tile and Realistic Lineart)
https://redd.it/1lefv07
@rStableDiffusion
https://redd.it/1lefv07
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: Qwen2VL-Flux ControlNet is available since Nov 2024 but most people missed it. Fully…
Explore this post and more from the StableDiffusion community
What is the best video upscaler besides Topaz?
Based on my research, it seems like Topaz is the best video upscaler currently. Topaz has been around for several years now. I am wondering why there hasn't been a newcomer yet with better quality.
Is your experience the same with video upscaler software, and what is the best OS video upscaler software?
https://redd.it/1ledzsc
@rStableDiffusion
Based on my research, it seems like Topaz is the best video upscaler currently. Topaz has been around for several years now. I am wondering why there hasn't been a newcomer yet with better quality.
Is your experience the same with video upscaler software, and what is the best OS video upscaler software?
https://redd.it/1ledzsc
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Which UI is better, Comfyui, Automatic1111, or Forge?
I'm going to start working with AI soon, and I'd like to know which one is the most recommended.
https://redd.it/1lekbm7
@rStableDiffusion
I'm going to start working with AI soon, and I'd like to know which one is the most recommended.
https://redd.it/1lekbm7
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
This media is not supported in your browser
VIEW IN TELEGRAM
Sources VS Output Comparaison: Trying to use 3D reference some with camera motion from blender to see if i can control the output
https://redd.it/1lensll
@rStableDiffusion
https://redd.it/1lensll
@rStableDiffusion
Chroma - Diffusers released!
I look at the Chroma site and what do I see? It is now available in diffusers format!
https://huggingface.co/lodestones/Chroma/tree/main
https://redd.it/1lepqtg
@rStableDiffusion
I look at the Chroma site and what do I see? It is now available in diffusers format!
https://huggingface.co/lodestones/Chroma/tree/main
https://redd.it/1lepqtg
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: Chroma - Diffusers released!
Explore this post and more from the StableDiffusion community
Wan2 1 VACE Video Masking using Florence2 and SAM2 Segmentation
https://youtu.be/QON-XxE9r50?si=0-aHFMwARIId6jdY
https://redd.it/1ler7zz
@rStableDiffusion
https://youtu.be/QON-XxE9r50?si=0-aHFMwARIId6jdY
https://redd.it/1ler7zz
@rStableDiffusion
YouTube
Wan2 1 VACE Video Masking using Florence2 and SAM2 Segmentation
In this Tutorial I attempt to give a complete walkthrough of what it takes to use video masking to swap out one object for another using a reference image, SAM2 segementation, and Florence2Run in Wan 2.1 VACE.
Free Workflows can be found at: https://pat…
Free Workflows can be found at: https://pat…
❤1