Nice approach with prealignment and great speed
https://arxiv.org/abs/2406.08835
EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed
Ziyang Zhuang, Chenfeng Miao, Kun Zou, Shuai Gong, Ming Fang, Tao Wei, Zijian Li, Wei Hu, Shaojun Wang, Jing Xiao
Non-autoregressive (NAR) automatic speech recognition (ASR) models predict tokens independently and simultaneously, bringing high inference speed. However, there is still a gap in the accuracy of the NAR models compared to the autoregressive (AR) models. In this paper, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EffectiveASR. It uses an Index Mapping Vector (IMV) based alignment generator to generate alignments during training, and an alignment predictor to learn the alignments for inference. It can be trained end-to-end (E2E) with cross-entropy loss combined with alignment loss. The proposed EffectiveASR achieves competitive results on the AISHELL-1 and AISHELL-2 Mandarin benchmarks compared to the leading models. Specifically, it achieves character error rates (CER) of 4.26%/4.62% on the AISHELL-1 dev/test dataset, which outperforms the AR Conformer with about 30x inference speedup.
https://arxiv.org/abs/2406.08835
EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed
Ziyang Zhuang, Chenfeng Miao, Kun Zou, Shuai Gong, Ming Fang, Tao Wei, Zijian Li, Wei Hu, Shaojun Wang, Jing Xiao
Non-autoregressive (NAR) automatic speech recognition (ASR) models predict tokens independently and simultaneously, bringing high inference speed. However, there is still a gap in the accuracy of the NAR models compared to the autoregressive (AR) models. In this paper, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EffectiveASR. It uses an Index Mapping Vector (IMV) based alignment generator to generate alignments during training, and an alignment predictor to learn the alignments for inference. It can be trained end-to-end (E2E) with cross-entropy loss combined with alignment loss. The proposed EffectiveASR achieves competitive results on the AISHELL-1 and AISHELL-2 Mandarin benchmarks compared to the leading models. Specifically, it achieves character error rates (CER) of 4.26%/4.62% on the AISHELL-1 dev/test dataset, which outperforms the AR Conformer with about 30x inference speedup.
arXiv.org
EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech...
Non-autoregressive (NAR) automatic speech recognition (ASR) models predict tokens independently and simultaneously, bringing high inference speed. However, there is still a gap in the accuracy of...
This is an important paper actually, let me leave it for the future
https://arxiv.org/abs/2407.16370
Evolutionary Prompt Design for LLM-Based Post-ASR Error Correction
Rithik Sachdev, Zhong-Qiu Wang, Chao-Han Huck Yang
Building upon the strength of modern large language models (LLMs), generative error correction (GEC) has emerged as a promising paradigm that can elevate the performance of modern automatic speech recognition (ASR) systems. One representative approach is to leverage in-context learning to prompt LLMs so that a better hypothesis can be generated by the LLMs based on a carefully-designed prompt and an N-best list of hypotheses produced by ASR systems. However, it is yet unknown whether the existing prompts are the most effective ones for the task of post-ASR error correction. In this context, this paper first explores alternative prompts to identify an initial set of effective prompts, and then proposes to employ an evolutionary prompt optimization algorithm to refine the initial prompts. Evaluations results on the CHiME-4 subset of the Task 1 of the SLT 2024 GenSEC challenge show the effectiveness and potential of the proposed algorithms.
https://arxiv.org/abs/2407.16370
Evolutionary Prompt Design for LLM-Based Post-ASR Error Correction
Rithik Sachdev, Zhong-Qiu Wang, Chao-Han Huck Yang
Building upon the strength of modern large language models (LLMs), generative error correction (GEC) has emerged as a promising paradigm that can elevate the performance of modern automatic speech recognition (ASR) systems. One representative approach is to leverage in-context learning to prompt LLMs so that a better hypothesis can be generated by the LLMs based on a carefully-designed prompt and an N-best list of hypotheses produced by ASR systems. However, it is yet unknown whether the existing prompts are the most effective ones for the task of post-ASR error correction. In this context, this paper first explores alternative prompts to identify an initial set of effective prompts, and then proposes to employ an evolutionary prompt optimization algorithm to refine the initial prompts. Evaluations results on the CHiME-4 subset of the Task 1 of the SLT 2024 GenSEC challenge show the effectiveness and potential of the proposed algorithms.
arXiv.org
Evolutionary Prompt Design for LLM-Based Post-ASR Error Correction
Building upon the strength of modern large language models (LLMs), generative error correction (GEC) has emerged as a promising paradigm that can elevate the performance of modern automatic speech...
Implementation in wespeaker
https://github.com/wenet-e2e/wespeaker/pull/356
https://www.arxiv.org/abs/2408.15585
Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models
Yiyang Zhao, Shuai Wang, Guangzhi Sun, Zehua Chen, Chao Zhang, Mingxing Xu, Thomas Fang Zheng
In this paper, Whisper, a large-scale pre-trained model for automatic speech recognition, is proposed to apply to speaker verification. A partial multi-scale feature aggregation (PMFA) approach is proposed based on a subset of Whisper encoder blocks to derive highly discriminative speaker embeddings.Experimental results demonstrate that using the middle to later blocks of the Whisper encoder keeps more speaker information. On the VoxCeleb1 and CN-Celeb1 datasets, our system achieves 1.42% and 8.23% equal error rates (EERs) respectively, receiving 0.58% and 1.81% absolute EER reductions over the ECAPA-TDNN baseline, and 0.46% and 0.97% over the ResNet34 baseline. Furthermore, our results indicate that using Whisper models trained on multilingual data can effectively enhance the model's robustness across languages. Finally, the low-rank adaptation approach is evaluated, which reduces the trainable model parameters by approximately 45 times while only slightly increasing EER by 0.2%.
https://github.com/wenet-e2e/wespeaker/pull/356
https://www.arxiv.org/abs/2408.15585
Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models
Yiyang Zhao, Shuai Wang, Guangzhi Sun, Zehua Chen, Chao Zhang, Mingxing Xu, Thomas Fang Zheng
In this paper, Whisper, a large-scale pre-trained model for automatic speech recognition, is proposed to apply to speaker verification. A partial multi-scale feature aggregation (PMFA) approach is proposed based on a subset of Whisper encoder blocks to derive highly discriminative speaker embeddings.Experimental results demonstrate that using the middle to later blocks of the Whisper encoder keeps more speaker information. On the VoxCeleb1 and CN-Celeb1 datasets, our system achieves 1.42% and 8.23% equal error rates (EERs) respectively, receiving 0.58% and 1.81% absolute EER reductions over the ECAPA-TDNN baseline, and 0.46% and 0.97% over the ResNet34 baseline. Furthermore, our results indicate that using Whisper models trained on multilingual data can effectively enhance the model's robustness across languages. Finally, the low-rank adaptation approach is evaluated, which reduces the trainable model parameters by approximately 45 times while only slightly increasing EER by 0.2%.
GitHub
Support Whisper-PMFA by Aurora1818 · Pull Request #356 · wenet-e2e/wespeaker
Add examples/v1/Whisper-PMFA
Modify wespeaker/bin/train.py and wespeaker/utils/excuter.py a small amount of code to support more Front - end model
Modify wespeaker/bin/train.py and wespeaker/utils/excuter.py a small amount of code to support more Front - end model
Good talk
https://twitter.com/HungyiLee2/status/1830698181757411769
Spoken Language Models at INTERSPEECH 2024
https://drive.google.com/file/d/1gPjnjGKxeCF72gisPVuQlDvogXQCtNk4/view
https://twitter.com/HungyiLee2/status/1830698181757411769
Spoken Language Models at INTERSPEECH 2024
https://drive.google.com/file/d/1gPjnjGKxeCF72gisPVuQlDvogXQCtNk4/view
X (formerly Twitter)
Hung-yi Lee (李宏毅) (@HungyiLee2) on X
I'll give an overview talk on Spoken Language Models at INTERSPEECH 2024! Join me tomorrow, September 3rd, from 13:30 to 14:10 in the "Lasso" room.
link of slides: https://t.co/uyYYdkKhBX
link of slides: https://t.co/uyYYdkKhBX
Nice results and finally proper evaluation of the TTS
https://github.com/zhenye234/xcodec
https://arxiv.org/abs/2408.17175
Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model
Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, Wei Xue
Recent advancements in audio generation have been significantly propelled by the capabilities of Large Language Models (LLMs). The existing research on audio LLM has primarily focused on enhancing the architecture and scale of audio language models, as well as leveraging larger datasets, and generally, acoustic codecs, such as EnCodec, are used for audio tokenization. However, these codecs were originally designed for audio compression, which may lead to suboptimal performance in the context of audio LLM. Our research aims to address the shortcomings of current audio LLM codecs, particularly their challenges in maintaining semantic integrity in generated audio. For instance, existing methods like VALL-E, which condition acoustic token generation on text transcriptions, often suffer from content inaccuracies and elevated word error rates (WER) due to semantic misinterpretations of acoustic tokens, resulting in word skipping and errors. To overcome these issues, we propose a straightforward yet effective approach called X-Codec. X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization (RVQ) stage and introduces a semantic reconstruction loss after RVQ. By enhancing the semantic ability of the codec, X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications, including music and sound generation. Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation. Our code and demo are available (Demo: this https URL Code: this https URL)
https://github.com/zhenye234/xcodec
https://arxiv.org/abs/2408.17175
Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model
Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, Wei Xue
Recent advancements in audio generation have been significantly propelled by the capabilities of Large Language Models (LLMs). The existing research on audio LLM has primarily focused on enhancing the architecture and scale of audio language models, as well as leveraging larger datasets, and generally, acoustic codecs, such as EnCodec, are used for audio tokenization. However, these codecs were originally designed for audio compression, which may lead to suboptimal performance in the context of audio LLM. Our research aims to address the shortcomings of current audio LLM codecs, particularly their challenges in maintaining semantic integrity in generated audio. For instance, existing methods like VALL-E, which condition acoustic token generation on text transcriptions, often suffer from content inaccuracies and elevated word error rates (WER) due to semantic misinterpretations of acoustic tokens, resulting in word skipping and errors. To overcome these issues, we propose a straightforward yet effective approach called X-Codec. X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization (RVQ) stage and introduces a semantic reconstruction loss after RVQ. By enhancing the semantic ability of the codec, X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications, including music and sound generation. Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation. Our code and demo are available (Demo: this https URL Code: this https URL)
GitHub
GitHub - zhenye234/xcodec: AAAI 2025: Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model
AAAI 2025: Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model - zhenye234/xcodec
This is very reasonable architecture for many reasons. I'll write more on this later
https://bytedancespeech.github.io/seedtts_tech_report/
No open source it seems, but there is a new repo which you can try. Voice conversion is quite good even for cross-lingual case
https://huggingface.co/spaces/Plachta/Seed-VC
https://github.com/Plachtaa/seed-vc
https://bytedancespeech.github.io/seedtts_tech_report/
No open source it seems, but there is a new repo which you can try. Voice conversion is quite good even for cross-lingual case
https://huggingface.co/spaces/Plachta/Seed-VC
https://github.com/Plachtaa/seed-vc
huggingface.co
Seed Voice Conversion - a Hugging Face Space by Plachta
This application allows you to convert the voice in one audio file to match the style or tone of another audio file. You need to provide two audio files: one as the source and one as the reference....
https://github.com/nyrahealth/CrisperWhisper
https://huggingface.co/nyrahealth/CrisperWhisper
https://arxiv.org/abs/2408.16589
CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions
Laurin Wagner, Bernhard Thallinger, Mario Zusag
We demonstrate that carefully adjusting the tokenizer of the Whisper speech recognition model significantly improves the precision of word-level timestamps when applying dynamic time warping to the decoder's cross-attention scores. We fine-tune the model to produce more verbatim speech transcriptions and employ several techniques to increase robustness against multiple speakers and background noise. These adjustments achieve state-of-the-art performance on benchmarks for verbatim speech transcription, word segmentation, and the timed detection of filler events, and can further mitigate transcription hallucinations. The code is available open this https URL.
https://huggingface.co/nyrahealth/CrisperWhisper
https://arxiv.org/abs/2408.16589
CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions
Laurin Wagner, Bernhard Thallinger, Mario Zusag
We demonstrate that carefully adjusting the tokenizer of the Whisper speech recognition model significantly improves the precision of word-level timestamps when applying dynamic time warping to the decoder's cross-attention scores. We fine-tune the model to produce more verbatim speech transcriptions and employ several techniques to increase robustness against multiple speakers and background noise. These adjustments achieve state-of-the-art performance on benchmarks for verbatim speech transcription, word segmentation, and the timed detection of filler events, and can further mitigate transcription hallucinations. The code is available open this https URL.
GitHub
GitHub - nyrahealth/CrisperWhisper: Verbatim Automatic Speech Recognition with improved word-level timestamps and filler detection
Verbatim Automatic Speech Recognition with improved word-level timestamps and filler detection - nyrahealth/CrisperWhisper
You can automatically convert between creative forms now, paper to podcast, next paper to artwork
https://illuminate.google.com
Illuminate, an experimental AI application made by Google. Convert a paper from PDF to a podcast. The synthetic voice sounds pretty natural. What a great time to be a learner!
https://illuminate.google.com
Illuminate, an experimental AI application made by Google. Convert a paper from PDF to a podcast. The synthetic voice sounds pretty natural. What a great time to be a learner!
Google introduces Think tokens (emulating brain) everywhere
Contemplative Mechanism for Speech Recognition:
Speech Encoders can Think
Tien-Ju Yang, Andrew Rosenberg, Bhuvana Ramabhadran
https://www.isca-archive.org/interspeech_2024/yang24g_interspeech.pdf
Related too
Think before you speak: Training Language Models With Pause Tokens
Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan
https://arxiv.org/abs/2310.02226
Contemplative Mechanism for Speech Recognition:
Speech Encoders can Think
Tien-Ju Yang, Andrew Rosenberg, Bhuvana Ramabhadran
https://www.isca-archive.org/interspeech_2024/yang24g_interspeech.pdf
Related too
Think before you speak: Training Language Models With Pause Tokens
Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan
https://arxiv.org/abs/2310.02226
CHIME Challenge is way more dense than Interespeech. Chime 8 workshop just ended
https://www.chimechallenge.org/current/workshop/index
Congrats to STC team, as usual they demonstrate top performance on Chime tasks.
No publications yet, but even keynote talk is interesting
Teaching New Skills to Foundation Models: Insights and Experiences
Speaker: Hung-yi Lee
National Taiwan University (NTU)
https://www.chimechallenge.org/current/workshop/CHiME2024_Lee.pdf
https://www.chimechallenge.org/current/workshop/index
Congrats to STC team, as usual they demonstrate top performance on Chime tasks.
No publications yet, but even keynote talk is interesting
Teaching New Skills to Foundation Models: Insights and Experiences
Speaker: Hung-yi Lee
National Taiwan University (NTU)
https://www.chimechallenge.org/current/workshop/CHiME2024_Lee.pdf
CHiME Challenges and Workshops
CHiME 2024 Workshop
September 6, 2024, Kos International Convention Centre (KICC), Kos Island, Greece
And, an interesting NOTSOFAR task, for information to many startups trying to implement meeting transcription
To illustrate, on our newly released NOTSOFAR meeting benchmark, Whisper large-v3 with head-mounted mics achieves 9.3% WER (word-error-rate), yet on audio from a distant mic it climbs to 37.4% WER. The culprits are reverberation, noise, and overlapping speech, which interfere with the source signal.
https://arxiv.org/abs/2401.08887
NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription
Alon Vinnikov, Amir Ivry, Aviv Hurvitz, Igor Abramovski, Sharon Koubi, Ilya Gurvich, Shai Pe`er, Xiong Xiao, Benjamin Martinez Elizalde, Naoyuki Kanda, Xiaofei Wang, Shalev Shaer, Stav Yagev, Yossi Asher, Sunit Sivasankaran, Yifan Gong, Min Tang, Huaming Wang, Eyal Krupka
We introduce the first Natural Office Talkers in Settings of Far-field Audio Recordings (``NOTSOFAR-1'') Challenge alongside datasets and baseline system. The challenge focuses on distant speaker diarization and automatic speech recognition (DASR) in far-field meeting scenarios, with single-channel and known-geometry multi-channel tracks, and serves as a launch platform for two new datasets: First, a benchmarking dataset of 315 meetings, averaging 6 minutes each, capturing a broad spectrum of real-world acoustic conditions and conversational dynamics. It is recorded across 30 conference rooms, featuring 4-8 attendees and a total of 35 unique speakers. Second, a 1000-hour simulated training dataset, synthesized with enhanced authenticity for real-world generalization, incorporating 15,000 real acoustic transfer functions. The tasks focus on single-device DASR, where multi-channel devices always share the same known geometry. This is aligned with common setups in actual conference rooms, and avoids technical complexities associated with multi-device tasks. It also allows for the development of geometry-specific solutions. The NOTSOFAR-1 Challenge aims to advance research in the field of distant conversational speech recognition, providing key resources to unlock the potential of data-driven methods, which we believe are currently constrained by the absence of comprehensive high-quality training and benchmarking datasets.
To illustrate, on our newly released NOTSOFAR meeting benchmark, Whisper large-v3 with head-mounted mics achieves 9.3% WER (word-error-rate), yet on audio from a distant mic it climbs to 37.4% WER. The culprits are reverberation, noise, and overlapping speech, which interfere with the source signal.
https://arxiv.org/abs/2401.08887
NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription
Alon Vinnikov, Amir Ivry, Aviv Hurvitz, Igor Abramovski, Sharon Koubi, Ilya Gurvich, Shai Pe`er, Xiong Xiao, Benjamin Martinez Elizalde, Naoyuki Kanda, Xiaofei Wang, Shalev Shaer, Stav Yagev, Yossi Asher, Sunit Sivasankaran, Yifan Gong, Min Tang, Huaming Wang, Eyal Krupka
We introduce the first Natural Office Talkers in Settings of Far-field Audio Recordings (``NOTSOFAR-1'') Challenge alongside datasets and baseline system. The challenge focuses on distant speaker diarization and automatic speech recognition (DASR) in far-field meeting scenarios, with single-channel and known-geometry multi-channel tracks, and serves as a launch platform for two new datasets: First, a benchmarking dataset of 315 meetings, averaging 6 minutes each, capturing a broad spectrum of real-world acoustic conditions and conversational dynamics. It is recorded across 30 conference rooms, featuring 4-8 attendees and a total of 35 unique speakers. Second, a 1000-hour simulated training dataset, synthesized with enhanced authenticity for real-world generalization, incorporating 15,000 real acoustic transfer functions. The tasks focus on single-device DASR, where multi-channel devices always share the same known geometry. This is aligned with common setups in actual conference rooms, and avoids technical complexities associated with multi-device tasks. It also allows for the development of geometry-specific solutions. The NOTSOFAR-1 Challenge aims to advance research in the field of distant conversational speech recognition, providing key resources to unlock the potential of data-driven methods, which we believe are currently constrained by the absence of comprehensive high-quality training and benchmarking datasets.
arXiv.org
NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for...
We introduce the first Natural Office Talkers in Settings of Far-field Audio Recordings (``NOTSOFAR-1'') Challenge alongside datasets and baseline system. The challenge focuses on distant speaker...
Fun fact, the amount of fake content grows. For example a user just sent me an article which mentions Vosk Indonesian model and even gives a link to it. The problem is we never had one! The article is clearly autogenerated!
https://x.com/FishAudio/status/1833787529595912531
Excited to introduce Fish Speech 1.4 - now open-source and more powerful than ever! Our mission is to make cutting-edge voice tech accessible to everyone.
What's new:
- Trained on 700k hours of multilingual data (up from 200k)
- Now supports 8 languages: English, Chinese, German, Japanese, French, Spanish, Korean, and Arabic
- Fully open-source, empowering developers and researchers worldwide
Key features:
- Lightning-fast TTS with ultra-low latency
- Instant voice cloning
- Self-host or use our cloud service
- Simple, flat-rate pricing
Try it out:
- Playground: https://fish.audio
- GitHub: https://github.com/fishaudio/fish-speech
- HuggingFace Model: https://huggingface.co/fishaudio/fish-speech-1.4
- Demo: https://huggingface.co/spaces/fishaudio/fish-speech-1
- Product Hunt: https://producthunt.com/posts/fish-speech-1-4
Excited to introduce Fish Speech 1.4 - now open-source and more powerful than ever! Our mission is to make cutting-edge voice tech accessible to everyone.
What's new:
- Trained on 700k hours of multilingual data (up from 200k)
- Now supports 8 languages: English, Chinese, German, Japanese, French, Spanish, Korean, and Arabic
- Fully open-source, empowering developers and researchers worldwide
Key features:
- Lightning-fast TTS with ultra-low latency
- Instant voice cloning
- Self-host or use our cloud service
- Simple, flat-rate pricing
Try it out:
- Playground: https://fish.audio
- GitHub: https://github.com/fishaudio/fish-speech
- HuggingFace Model: https://huggingface.co/fishaudio/fish-speech-1.4
- Demo: https://huggingface.co/spaces/fishaudio/fish-speech-1
- Product Hunt: https://producthunt.com/posts/fish-speech-1-4
X (formerly Twitter)
Fish Audio (@FishAudio) on X
Excited to introduce Fish Speech 1.4 - now open-source and more powerful than ever! 🎉 Our mission is to make cutting-edge voice tech accessible to everyone.
What's new:
- Trained on 700k hours of multilingual data (up from 200k)
- Now supports 8 languages:…
What's new:
- Trained on 700k hours of multilingual data (up from 200k)
- Now supports 8 languages:…
Small LLMS are the right thing to go, here is a Chinese LLM with 26M parameters
https://github.com/jingyaogong/minimind/blob/master/README_en.md
https://github.com/jingyaogong/minimind/blob/master/README_en.md
GitHub
minimind/README_en.md at master · jingyaogong/minimind
🚀🚀 「大模型」2小时完全从0训练26M的小参数GPT!🌏 Train a 26M-parameter GPT from scratch in just 2h! - jingyaogong/minimind
New paper from StyleTTS authors. Metrics looks good, and finally proper comparison between systems! But I kind of wonder if algorithms are too focused on read speech. Hard to believe in such a great metrics for conversational dataset with proposed complex algorithms.
https://arxiv.org/abs/2409.10058
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion
Yinghao Aaron Li, Xilin Jiang, Cong Han, Nima Mesgarani
The rapid development of large-scale text-to-speech (TTS) models has led to significant advancements in modeling diverse speaker prosody and voices. However, these models often face issues such as slow inference speeds, reliance on complex pre-trained neural codec representations, and difficulties in achieving naturalness and high similarity to reference speakers. To address these challenges, this work introduces StyleTTS-ZS, an efficient zero-shot TTS model that leverages distilled time-varying style diffusion to capture diverse speaker identities and prosodies. We propose a novel approach that represents human speech using input text and fixed-length time-varying discrete style codes to capture diverse prosodic variations, trained adversarially with multi-modal discriminators. A diffusion model is then built to sample this time-varying style code for efficient latent diffusion. Using classifier-free guidance, StyleTTS-ZS achieves high similarity to the reference speaker in the style diffusion process. Furthermore, to expedite sampling, the style diffusion model is distilled with perceptual loss using only 10k samples, maintaining speech quality and similarity while reducing inference speed by 90%. Our model surpasses previous state-of-the-art large-scale zero-shot TTS models in both naturalness and similarity, offering a 10-20 faster sampling speed, making it an attractive alternative for efficient large-scale zero-shot TTS systems. The audio demo, code and models are available at this https URL.
https://arxiv.org/abs/2409.10058
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion
Yinghao Aaron Li, Xilin Jiang, Cong Han, Nima Mesgarani
The rapid development of large-scale text-to-speech (TTS) models has led to significant advancements in modeling diverse speaker prosody and voices. However, these models often face issues such as slow inference speeds, reliance on complex pre-trained neural codec representations, and difficulties in achieving naturalness and high similarity to reference speakers. To address these challenges, this work introduces StyleTTS-ZS, an efficient zero-shot TTS model that leverages distilled time-varying style diffusion to capture diverse speaker identities and prosodies. We propose a novel approach that represents human speech using input text and fixed-length time-varying discrete style codes to capture diverse prosodic variations, trained adversarially with multi-modal discriminators. A diffusion model is then built to sample this time-varying style code for efficient latent diffusion. Using classifier-free guidance, StyleTTS-ZS achieves high similarity to the reference speaker in the style diffusion process. Furthermore, to expedite sampling, the style diffusion model is distilled with perceptual loss using only 10k samples, maintaining speech quality and similarity while reducing inference speed by 90%. Our model surpasses previous state-of-the-art large-scale zero-shot TTS models in both naturalness and similarity, offering a 10-20 faster sampling speed, making it an attractive alternative for efficient large-scale zero-shot TTS systems. The audio demo, code and models are available at this https URL.
arXiv.org
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech...
The rapid development of large-scale text-to-speech (TTS) models has led to significant advancements in modeling diverse speaker prosody and voices. However, these models often face issues such as...
Everyone is doing DiT in TTS. Honestly I failed to understand the use of it, but metrics show good values. Examples are
https://github.com/KdaiP/StableTTS
https://github.com/7Xin/DPI-TTS
https://github.com/KdaiP/StableTTS
https://github.com/7Xin/DPI-TTS
GitHub
GitHub - KdaiP/StableTTS: Next-generation TTS model using flow-matching and DiT, inspired by Stable Diffusion 3
Next-generation TTS model using flow-matching and DiT, inspired by Stable Diffusion 3 - KdaiP/StableTTS