SANE 2024 Videos, interesting things
https://www.youtube.com/playlist?list=PLBJWRPcgwk7vVzKLPnTrqm831VohoLMmy
https://www.youtube.com/playlist?list=PLBJWRPcgwk7vVzKLPnTrqm831VohoLMmy
YouTube
SANE 2024 @ Google Cambridge
SANE 2024, a one-day event gathering researchers and students in speech and audio from the Northeast of the American continent, was held on Thursday October ...
https://github.com/jishengpeng/WavChat
https://arxiv.org/abs/2411.13577
WavChat: A Survey of Spoken Dialogue Models
Shengpeng Ji, Yifu Chen, Minghui Fang, Jialong Zuo, Jingyu Lu, Hanting Wang, Ziyue Jiang, Long Zhou, Shujie Liu, Xize Cheng, Xiaoda Yang, Zehan Wang, Qian Yang, Jian Li, Yidi Jiang, Jingzhen He, Yunfei Chu, Jin Xu, Zhou Zhao
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain. Compared to traditional three-tier cascaded spoken dialogue models that comprise speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS), modern spoken dialogue models exhibit greater intelligence. These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech. Moreover, they generate high-quality, multi-turn speech responses with low latency, enabling real-time interaction through simultaneous listening and speaking capability. Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems and the underlying technologies. To address this, we have first compiled existing spoken dialogue systems in the chronological order and categorized them into the cascaded and end-to-end paradigms. We then provide an in-depth overview of the core technologies in spoken dialogue models, covering aspects such as speech representation, training paradigm, streaming, duplex, and interaction capabilities. Each section discusses the limitations of these technologies and outlines considerations for future research. Additionally, we present a thorough review of relevant datasets, evaluation metrics, and benchmarks from the perspectives of training and evaluating spoken dialogue systems. We hope this survey will contribute to advancing both academic research and industrial applications in the field of spoken dialogue systems. The related material is available at this https URL.
https://arxiv.org/abs/2411.13577
WavChat: A Survey of Spoken Dialogue Models
Shengpeng Ji, Yifu Chen, Minghui Fang, Jialong Zuo, Jingyu Lu, Hanting Wang, Ziyue Jiang, Long Zhou, Shujie Liu, Xize Cheng, Xiaoda Yang, Zehan Wang, Qian Yang, Jian Li, Yidi Jiang, Jingzhen He, Yunfei Chu, Jin Xu, Zhou Zhao
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain. Compared to traditional three-tier cascaded spoken dialogue models that comprise speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS), modern spoken dialogue models exhibit greater intelligence. These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech. Moreover, they generate high-quality, multi-turn speech responses with low latency, enabling real-time interaction through simultaneous listening and speaking capability. Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems and the underlying technologies. To address this, we have first compiled existing spoken dialogue systems in the chronological order and categorized them into the cascaded and end-to-end paradigms. We then provide an in-depth overview of the core technologies in spoken dialogue models, covering aspects such as speech representation, training paradigm, streaming, duplex, and interaction capabilities. Each section discusses the limitations of these technologies and outlines considerations for future research. Additionally, we present a thorough review of relevant datasets, evaluation metrics, and benchmarks from the perspectives of training and evaluating spoken dialogue systems. We hope this survey will contribute to advancing both academic research and industrial applications in the field of spoken dialogue systems. The related material is available at this https URL.
GitHub
GitHub - jishengpeng/WavChat: A Survey of Spoken Dialogue Models (60 pages)
A Survey of Spoken Dialogue Models (60 pages). Contribute to jishengpeng/WavChat development by creating an account on GitHub.
Small is always nice
https://arxiv.org/abs/2408.13920
Wav2Small: Distilling Wav2Vec2 to 72K parameters for Low-Resource Speech emotion recognition
Dionyssos Kounadis-Bastian, Oliver Schrüfer, Anna Derington, Hagen Wierstorf, Florian Eyben, Felix Burkhardt, Björn Schuller
Speech Emotion Recognition (SER) needs high computational resources to overcome the challenge of substantial annotator disagreement. Today SER is shifting towards dimensional annotations of arousal, dominance, and valence (A/D/V). Universal metrics as the L2 distance prove unsuitable for evaluating A/D/V accuracy due to non converging consensus of annotator opinions. However, Concordance Correlation Coefficient (CCC) arose as an alternative metric for A/D/V where a model's output is evaluated to match a whole dataset's CCC rather than L2 distances of individual audios. Recent studies have shown that wav2vec2 / wavLM architectures outputing a float value for each A/D/V dimension achieve today's State-of-the-art (Sota) CCC on A/D/V. The Wav2Vec2.0 / WavLM family has a high computational footprint, but training small models using human annotations has been unsuccessful. In this paper we use a large Transformer Sota A/D/V model as Teacher/Annotator to train 5 student models: 4 MobileNets and our proposed Wav2Small, using only the Teacher's A/D/V outputs instead of human annotations. The Teacher model we propose also sets a new Sota on the MSP Podcast dataset of valence CCC=0.676. We choose MobileNetV4 / MobileNet-V3 as students, as MobileNet has been designed for fast execution times. We also propose Wav2Small - an architecture designed for minimal parameters and RAM consumption. Wav2Small with an .onnx (quantised) of only 120KB is a potential solution for A/D/V on hardware with low resources, having only 72K parameters vs 3.12M parameters for MobileNet-V4-Small.
https://arxiv.org/abs/2408.13920
Wav2Small: Distilling Wav2Vec2 to 72K parameters for Low-Resource Speech emotion recognition
Dionyssos Kounadis-Bastian, Oliver Schrüfer, Anna Derington, Hagen Wierstorf, Florian Eyben, Felix Burkhardt, Björn Schuller
Speech Emotion Recognition (SER) needs high computational resources to overcome the challenge of substantial annotator disagreement. Today SER is shifting towards dimensional annotations of arousal, dominance, and valence (A/D/V). Universal metrics as the L2 distance prove unsuitable for evaluating A/D/V accuracy due to non converging consensus of annotator opinions. However, Concordance Correlation Coefficient (CCC) arose as an alternative metric for A/D/V where a model's output is evaluated to match a whole dataset's CCC rather than L2 distances of individual audios. Recent studies have shown that wav2vec2 / wavLM architectures outputing a float value for each A/D/V dimension achieve today's State-of-the-art (Sota) CCC on A/D/V. The Wav2Vec2.0 / WavLM family has a high computational footprint, but training small models using human annotations has been unsuccessful. In this paper we use a large Transformer Sota A/D/V model as Teacher/Annotator to train 5 student models: 4 MobileNets and our proposed Wav2Small, using only the Teacher's A/D/V outputs instead of human annotations. The Teacher model we propose also sets a new Sota on the MSP Podcast dataset of valence CCC=0.676. We choose MobileNetV4 / MobileNet-V3 as students, as MobileNet has been designed for fast execution times. We also propose Wav2Small - an architecture designed for minimal parameters and RAM consumption. Wav2Small with an .onnx (quantised) of only 120KB is a potential solution for A/D/V on hardware with low resources, having only 72K parameters vs 3.12M parameters for MobileNet-V4-Small.
arXiv.org
Wav2Small: Distilling Wav2Vec2 to 72K parameters for Low-Resource...
Speech Emotion Recognition (SER) needs high computational resources to overcome the challenge of substantial annotator disagreement. Today SER is shifting towards dimensional annotations of...
Just a reminder that BEST-RQ is a good self-supervised method
https://arxiv.org/abs/2202.01855
Recently added in SpeechBrain too
https://github.com/speechbrain/speechbrain/releases/tag/v1.0.2
Also
https://github.com/HarunoriKawano/BEST-RQ
https://arxiv.org/abs/2202.01855
Recently added in SpeechBrain too
https://github.com/speechbrain/speechbrain/releases/tag/v1.0.2
Also
https://github.com/HarunoriKawano/BEST-RQ
arXiv.org
Self-supervised Learning with Random-projection Quantizer for...
We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels...
A big review paper
https://www.sciencedirect.com/science/article/pii/S088523082400130X?ssrnid=4870649&dgcid=SSRN_redirect_SD
Refining the evaluation of speech synthesis: A summary of the Blizzard Challenge 2023☆
The Blizzard Challenge has benchmarked progress in Text-to-Speech (TTS) since 2005. The Challenge has seen important milestones passed, with results suggesting that synthetic speech was indistinguishable from natural speech in terms of intelligibility in 2021 and that by that same year it was perhaps even indistinguishable in naturalness. The high quality of synthetic speech generated by the latest TTS systems has thus revealed limitations with ITU-T P.800.1 Mean Opinion Score (MOS) in detecting the remaining differences between synthetic and natural speech. Yet, it was the only method used in previous Challenges and is still the most popular method in the field for speech synthesis evaluation. In the 2023 Challenge, we addressed observed limitations of past Challenges by incorporating state-of-the-art speech synthesis evaluation techniques to refine the evaluation of speech quality, speaker similarity and intelligibility. For speech quality, a relative comparison of the systems receiving the best MOS was able to discover a greater number of significant differences between systems. Regarding speaker similarity, we demonstrated that there is a strong bias depending on whether the listeners are familiar with the target voice or not. As for intelligibility, the evaluation of language-specific phenomena, such as the pronunciation of homographs, better highlighted system limits compared to global transcription tasks of synthesised utterances. In addition to reporting results for the 18 entries to the 2023 Challenge, we extend the results analysis to type of TTS module to provide some insights on the most recent advances in model design. Overall, this year’s results demonstrate the need for a shift towards new methods for refining TTS evaluation to shed light on increasingly smaller and localised differences between synthesised and natural speech.
https://www.sciencedirect.com/science/article/pii/S088523082400130X?ssrnid=4870649&dgcid=SSRN_redirect_SD
Refining the evaluation of speech synthesis: A summary of the Blizzard Challenge 2023☆
The Blizzard Challenge has benchmarked progress in Text-to-Speech (TTS) since 2005. The Challenge has seen important milestones passed, with results suggesting that synthetic speech was indistinguishable from natural speech in terms of intelligibility in 2021 and that by that same year it was perhaps even indistinguishable in naturalness. The high quality of synthetic speech generated by the latest TTS systems has thus revealed limitations with ITU-T P.800.1 Mean Opinion Score (MOS) in detecting the remaining differences between synthetic and natural speech. Yet, it was the only method used in previous Challenges and is still the most popular method in the field for speech synthesis evaluation. In the 2023 Challenge, we addressed observed limitations of past Challenges by incorporating state-of-the-art speech synthesis evaluation techniques to refine the evaluation of speech quality, speaker similarity and intelligibility. For speech quality, a relative comparison of the systems receiving the best MOS was able to discover a greater number of significant differences between systems. Regarding speaker similarity, we demonstrated that there is a strong bias depending on whether the listeners are familiar with the target voice or not. As for intelligibility, the evaluation of language-specific phenomena, such as the pronunciation of homographs, better highlighted system limits compared to global transcription tasks of synthesised utterances. In addition to reporting results for the 18 entries to the 2023 Challenge, we extend the results analysis to type of TTS module to provide some insights on the most recent advances in model design. Overall, this year’s results demonstrate the need for a shift towards new methods for refining TTS evaluation to shed light on increasingly smaller and localised differences between synthesised and natural speech.
Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation
arXiv: https://arxiv.org/abs/2411.06807
Demo: https://chomeyama.github.io/wavehax-demo/
An approach to significantly improve codec generation
arXiv: https://arxiv.org/abs/2411.06807
Demo: https://chomeyama.github.io/wavehax-demo/
An approach to significantly improve codec generation
arXiv.org
Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D...
Neural vocoders often struggle with aliasing in latent feature spaces, caused by time-domain nonlinear operations and resampling layers. Aliasing folds high-frequency components into the...
https://www.youtube.com/watch?v=TGlfK0lwjgw
Titouan Parcollet is a “Research Scientist at the Samsung AI Center Cambridge” and an “adjunct researcher at the Cambridge Machine Learning Systems Lab from the University of Cambridge”. Further, he is an “Associate Professor on leave from the Laboratoire Informatique d'Avignon (LIA) and Avignon Université (FR)”. His current Research focus is on self-supervised / representation learning and on continual learning. He played an instrumental part in the development of SpeechBrain and Pytorch-Kaldi.
Titouan Parcollet is a “Research Scientist at the Samsung AI Center Cambridge” and an “adjunct researcher at the Cambridge Machine Learning Systems Lab from the University of Cambridge”. Further, he is an “Associate Professor on leave from the Laboratoire Informatique d'Avignon (LIA) and Avignon Université (FR)”. His current Research focus is on self-supervised / representation learning and on continual learning. He played an instrumental part in the development of SpeechBrain and Pytorch-Kaldi.
YouTube
#13 Titouan Parcollet
Titouan Parcollet is a “Research Scientist at the Samsung AI Center Cambridge” and an “adjunct researcher at the Cambridge Machine Learning Systems Lab from the University of Cambridge”. Further, he is an “Associate Professor on leave from the Laboratoire…
ML-SUPERB 2.0 Challenge at
#Interspeech2025
154 languages & 200+ accents/dialects
Live leaderboard & online evaluation! Join now: multilingual.superbbenchmark.org
https://multilingual.superbbenchmark.org/
#Interspeech2025
154 languages & 200+ accents/dialects
Live leaderboard & online evaluation! Join now: multilingual.superbbenchmark.org
https://multilingual.superbbenchmark.org/
multilingual.superbbenchmark.org
ML-SUPERB: Multilingual Speech processing Universal PERformance Benchmark
A multilingual benchmark for Self-supervised Speech Representation Learning
For crypto guys, projects to finetune different TTS models
https://github.com/impel-intelligence/dippy-speech-subnet
https://github.com/myshell-ai/MyShell-TTS-Subnet
https://github.com/impel-intelligence/dippy-speech-subnet
https://github.com/myshell-ai/MyShell-TTS-Subnet
GitHub
GitHub - impel-intelligence/dippy-speech-subnet: Dippy Synthetic Speech Subnet
Dippy Synthetic Speech Subnet. Contribute to impel-intelligence/dippy-speech-subnet development by creating an account on GitHub.
https://x.com/LiuXub/status/1863622470709690575
TAAE — the first Transformer-based Audio AutoEncoder scaled to 1B parameters for neural speech coding! 🔥
TAAE achieves state-of-the-art speech quality at ultra-low bitrates of 400 or 700 bits-per-second, delivering reconstruction quality remarkably close to real audio. It sets a new benchmark for efficient and high-quality speech tokenization.
📖 Paper: https://arxiv.org/abs/2411.19842v1
👂 Demos: https://stability-ai.github.io/stable-codec-demo/
💻 GitHub: https://github.com/Stability-AI/stable-codec
Code and pre-trained models will be released to empower the community!
TAAE — the first Transformer-based Audio AutoEncoder scaled to 1B parameters for neural speech coding! 🔥
TAAE achieves state-of-the-art speech quality at ultra-low bitrates of 400 or 700 bits-per-second, delivering reconstruction quality remarkably close to real audio. It sets a new benchmark for efficient and high-quality speech tokenization.
📖 Paper: https://arxiv.org/abs/2411.19842v1
👂 Demos: https://stability-ai.github.io/stable-codec-demo/
💻 GitHub: https://github.com/Stability-AI/stable-codec
Code and pre-trained models will be released to empower the community!
arXiv.org
Scaling Transformers for Low-Bitrate High-Quality Speech Coding
The tokenization of speech with neural audio codec models is a vital part of modern AI pipelines for the generation or understanding of speech, alone or in a multimodal context. Traditionally such...
Indic Parler-TTS is a multilingual Indic extension of Parler-TTS Mini.
It is a fine-tuned version of Indic Parler-TTS Pretrained, trained on a 1,806 hours multilingual Indic and English dataset.
Indic Parler-TTS Mini can officially speak in 20 Indic languages, making it comprehensive for regional language technologies, and in English. The 21 languages supported are: Assamese, Bengali, Bodo, Dogri, English, Gujarati, Hindi, Kannada, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Odia, Sanskrit, Santali, Sindhi, Tamil, Telugu, and Urdu.
Thanks to its better prompt tokenizer, it can easily be extended to other languages. This tokenizer has a larger vocabulary and handles byte fallback, which simplifies multilingual training.
https://huggingface.co/ai4bharat/indic-parler-tts
It is a fine-tuned version of Indic Parler-TTS Pretrained, trained on a 1,806 hours multilingual Indic and English dataset.
Indic Parler-TTS Mini can officially speak in 20 Indic languages, making it comprehensive for regional language technologies, and in English. The 21 languages supported are: Assamese, Bengali, Bodo, Dogri, English, Gujarati, Hindi, Kannada, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Odia, Sanskrit, Santali, Sindhi, Tamil, Telugu, and Urdu.
Thanks to its better prompt tokenizer, it can easily be extended to other languages. This tokenizer has a larger vocabulary and handles byte fallback, which simplifies multilingual training.
https://huggingface.co/ai4bharat/indic-parler-tts
huggingface.co
ai4bharat/indic-parler-tts · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Introducing Fish Speech 1.5 🎉 - Making state-of-the-art TTS accessible to everyone!
Highlights:
- #2 ranked on TTS-Arena (as "Anonymous Sparkle")
- 1M hours of multilingual training data
- 13 languages supported, including English, Chinese, Japanese & more
- <150ms latency with high-quality instant voice cloning
- Pretrained model now open source
- Cost-effective self-hosting or cloud options
Let's check out the details 🧵⬇️
https://x.com/FishAudio/status/1864370933496205728
Supported languages:
English (en) >300k hours
Chinese (zh) >300k hours
Japanese (ja) >100k hours
German (de) ~20k hours
French (fr) ~20k hours
Spanish (es) ~20k hours
Korean (ko) ~20k hours
Arabic (ar) ~20k hours
Russian (ru) ~20k hours
Dutch (nl) <10k hours
Italian (it) <10k hours
Polish (pl) <10k hours
Portuguese (pt) <10k hours
Highlights:
- #2 ranked on TTS-Arena (as "Anonymous Sparkle")
- 1M hours of multilingual training data
- 13 languages supported, including English, Chinese, Japanese & more
- <150ms latency with high-quality instant voice cloning
- Pretrained model now open source
- Cost-effective self-hosting or cloud options
Let's check out the details 🧵⬇️
https://x.com/FishAudio/status/1864370933496205728
Supported languages:
English (en) >300k hours
Chinese (zh) >300k hours
Japanese (ja) >100k hours
German (de) ~20k hours
French (fr) ~20k hours
Spanish (es) ~20k hours
Korean (ko) ~20k hours
Arabic (ar) ~20k hours
Russian (ru) ~20k hours
Dutch (nl) <10k hours
Italian (it) <10k hours
Polish (pl) <10k hours
Portuguese (pt) <10k hours
While widely used, discrete methods have disadvantages (there are advantages too). There are attempts to replace them with continuous models. This paper gets quite some attention
https://x.com/marco_ppasini/status/1864330701530644835
https://arxiv.org/abs/2411.18447
Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation
Marco Pasini, Javier Nistal, Stefan Lattner, George Fazekas
Autoregressive models are typically applied to sequences of discrete tokens, but recent research indicates that generating sequences of continuous embeddings in an autoregressive manner is also feasible. However, such Continuous Autoregressive Models (CAMs) can suffer from a decline in generation quality over extended sequences due to error accumulation during inference. We introduce a novel method to address this issue by injecting random noise into the input embeddings during training. This procedure makes the model robust against varying error levels at inference. We further reduce error accumulation through an inference procedure that introduces low-level noise. Experiments on musical audio generation show that CAM substantially outperforms existing autoregressive and non-autoregressive approaches while preserving audio quality over extended sequences. This work paves the way for generating continuous embeddings in a purely autoregressive setting, opening new possibilities for real-time and interactive generative applications.
https://x.com/marco_ppasini/status/1864330701530644835
https://arxiv.org/abs/2411.18447
Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation
Marco Pasini, Javier Nistal, Stefan Lattner, George Fazekas
Autoregressive models are typically applied to sequences of discrete tokens, but recent research indicates that generating sequences of continuous embeddings in an autoregressive manner is also feasible. However, such Continuous Autoregressive Models (CAMs) can suffer from a decline in generation quality over extended sequences due to error accumulation during inference. We introduce a novel method to address this issue by injecting random noise into the input embeddings during training. This procedure makes the model robust against varying error levels at inference. We further reduce error accumulation through an inference procedure that introduces low-level noise. Experiments on musical audio generation show that CAM substantially outperforms existing autoregressive and non-autoregressive approaches while preserving audio quality over extended sequences. This work paves the way for generating continuous embeddings in a purely autoregressive setting, opening new possibilities for real-time and interactive generative applications.
arXiv.org
Continuous Autoregressive Models with Noise Augmentation Avoid...
Autoregressive models are typically applied to sequences of discrete tokens, but recent research indicates that generating sequences of continuous embeddings in an autoregressive manner is also...
An example of issue copied from repo to repo:
https://github.com/jaywalnut310/vits/issues/11
in vits we predict float duration and then convert it to attention steps. So we need to round floats. VITS applies ceil which results in longer duration than original (usually the scale is 0.9). As a result, you need to scale back to match original length
https://github.com/jaywalnut310/vits/blob/main/models.py#L511
In glowtts there is extra clamp
https://github.com/coqui-ai/TTS/blob/main/TTS/tts/models/glow_tts.py#L351
This thing is copied from repo to repo, fun thing happends in Matcha, where we multiply by length factor after we applied ceil:
https://github.com/shivammehta25/Matcha-TTS/blob/main/matcha/models/matcha_tts.py#L122
https://github.com/jaywalnut310/vits/issues/11
in vits we predict float duration and then convert it to attention steps. So we need to round floats. VITS applies ceil which results in longer duration than original (usually the scale is 0.9). As a result, you need to scale back to match original length
https://github.com/jaywalnut310/vits/blob/main/models.py#L511
In glowtts there is extra clamp
https://github.com/coqui-ai/TTS/blob/main/TTS/tts/models/glow_tts.py#L351
This thing is copied from repo to repo, fun thing happends in Matcha, where we multiply by length factor after we applied ceil:
https://github.com/shivammehta25/Matcha-TTS/blob/main/matcha/models/matcha_tts.py#L122
GitHub
About ceiling for calculating phoneme duration · Issue #11 · jaywalnut310/vits
Is there any reason to use torch.ceil instead of torch.round or other algorithms for calculating phoneme duration? Thank you.
Very good ideas here, dirty data training, joint asr/tts and so on
https://arxiv.org/abs/2412.08237
TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch
Xingchen Song, Mengtao Xing, Changwei Ma, Shengqiang Li, Di Wu, Binbin Zhang, Fuping Pan, Dinghao Zhou, Yuekai Zhang, Shun Lei, Zhendong Peng, Zhiyong Wu
It is well known that LLM-based systems are data-hungry. Recent LLM-based TTS works typically employ complex data processing pipelines to obtain high-quality training data. These sophisticated pipelines require excellent models at each stage (e.g., speech denoising, speech enhancement, speaker diarization, and punctuation models), which themselves demand high-quality training data and are rarely open-sourced. Even with state-of-the-art models, issues persist, such as incomplete background noise removal and misalignment between punctuation and actual speech pauses. Moreover, the stringent filtering strategies often retain only 10-30\% of the original data, significantly impeding data scaling efforts. In this work, we leverage a noise-robust audio tokenizer (S3Tokenizer) to design a simplified yet effective TTS data processing pipeline that maintains data quality while substantially reducing data acquisition costs, achieving a data retention rate of over 50\%. Beyond data scaling challenges, LLM-based TTS systems also incur higher deployment costs compared to conventional approaches. Current systems typically use LLMs solely for text-to-token generation, while requiring separate models (e.g., flow matching models) for token-to-waveform generation, which cannot be directly executed by LLM inference engines, further complicating deployment. To address these challenges, we eliminate redundant modules in both LLM and flow components, replacing the flow model backbone with an LLM architecture. Building upon this simplified flow backbone, we propose a unified architecture for both streaming and non-streaming inference, significantly reducing deployment costs. Finally, we explore the feasibility of unifying TTS and ASR tasks using the same data for training, thanks to the simplified pipeline and the S3Tokenizer that reduces the quality requirements for TTS training data.
https://arxiv.org/abs/2412.08237
TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch
Xingchen Song, Mengtao Xing, Changwei Ma, Shengqiang Li, Di Wu, Binbin Zhang, Fuping Pan, Dinghao Zhou, Yuekai Zhang, Shun Lei, Zhendong Peng, Zhiyong Wu
It is well known that LLM-based systems are data-hungry. Recent LLM-based TTS works typically employ complex data processing pipelines to obtain high-quality training data. These sophisticated pipelines require excellent models at each stage (e.g., speech denoising, speech enhancement, speaker diarization, and punctuation models), which themselves demand high-quality training data and are rarely open-sourced. Even with state-of-the-art models, issues persist, such as incomplete background noise removal and misalignment between punctuation and actual speech pauses. Moreover, the stringent filtering strategies often retain only 10-30\% of the original data, significantly impeding data scaling efforts. In this work, we leverage a noise-robust audio tokenizer (S3Tokenizer) to design a simplified yet effective TTS data processing pipeline that maintains data quality while substantially reducing data acquisition costs, achieving a data retention rate of over 50\%. Beyond data scaling challenges, LLM-based TTS systems also incur higher deployment costs compared to conventional approaches. Current systems typically use LLMs solely for text-to-token generation, while requiring separate models (e.g., flow matching models) for token-to-waveform generation, which cannot be directly executed by LLM inference engines, further complicating deployment. To address these challenges, we eliminate redundant modules in both LLM and flow components, replacing the flow model backbone with an LLM architecture. Building upon this simplified flow backbone, we propose a unified architecture for both streaming and non-streaming inference, significantly reducing deployment costs. Finally, we explore the feasibility of unifying TTS and ASR tasks using the same data for training, thanks to the simplified pipeline and the S3Tokenizer that reduces the quality requirements for TTS training data.
arXiv.org
TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch
It is well known that LLM-based systems are data-hungry. Recent LLM-based TTS works typically employ complex data processing pipelines to obtain high-quality training data. These sophisticated...
The talks of the Codec-SUPERB@SLT 2024 about neural audio codecs and speech language models are up on YouTube
https://www.youtube.com/playlist?list=PLJV_el3uVTsNnC37JYD8kBcNDI7CNJgum
https://www.youtube.com/playlist?list=PLJV_el3uVTsNnC37JYD8kBcNDI7CNJgum
YouTube
Keynote Speeches for the Codec-SUPERB Special Session @ SLT 2024
Share your videos with friends, family, and the world