VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
Is one decoder-only generative model all you need for speech recognition, synthesis, and translation?
https://arxiv.org/abs/2305.16107
Is one decoder-only generative model all you need for speech recognition, synthesis, and translation?
https://arxiv.org/abs/2305.16107
https://github.com/Takaaki-Saeki/zm-text-tts
https://arxiv.org/abs/2301.12596
While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual language models, our framework first performs masked language model pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the text-only data. Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language. All experiments were conducted using public datasets and the implementation will be made available for reproducibility.
https://arxiv.org/abs/2301.12596
While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual language models, our framework first performs masked language model pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the text-only data. Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language. All experiments were conducted using public datasets and the implementation will be made available for reproducibility.
GitHub
GitHub - Takaaki-Saeki/zm-text-tts: [IJCAI'23] Learning to Speak from Text for Low-Resource TTS
[IJCAI'23] Learning to Speak from Text for Low-Resource TTS - Takaaki-Saeki/zm-text-tts
New paper from @GoogleResearch & @GoogleDeepMind
Translatotron 3: Unsupervised Speech-to-Speech Translation
Paper: https://arxiv.org/abs/2305.17547
Audio Samples: https://google-research.github.io/lingvo-lab/translatotron3
Translatotron 3: Unsupervised Speech-to-Speech Translation
Paper: https://arxiv.org/abs/2305.17547
Audio Samples: https://google-research.github.io/lingvo-lab/translatotron3
Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection
Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari
This paper proposes a method for selecting training data for text-to-speech (TTS) synthesis from dark data. TTS models are typically trained on high-quality speech corpora that cost much time and money for data collection, which makes it very challenging to increase speaker variation. In contrast, there is a large amount of data whose availability is unknown (a.k.a, "dark data"), such as YouTube videos. To utilize data other than TTS corpora, previous studies have selected speech data from the corpora on the basis of acoustic quality. However, considering that TTS models robust to data noise have been proposed, we should select data on the basis of its importance as training data to the given TTS model, not the quality of speech itself. Our method with a loop of training and evaluation selects training data on the basis of the automatically predicted quality of synthetic speech of a given TTS model. Results of evaluations using YouTube data reveal that our method outperforms the conventional acoustic-quality-based method.
https://arxiv.org/abs/2210.14850
Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari
This paper proposes a method for selecting training data for text-to-speech (TTS) synthesis from dark data. TTS models are typically trained on high-quality speech corpora that cost much time and money for data collection, which makes it very challenging to increase speaker variation. In contrast, there is a large amount of data whose availability is unknown (a.k.a, "dark data"), such as YouTube videos. To utilize data other than TTS corpora, previous studies have selected speech data from the corpora on the basis of acoustic quality. However, considering that TTS models robust to data noise have been proposed, we should select data on the basis of its importance as training data to the given TTS model, not the quality of speech itself. Our method with a loop of training and evaluation selects training data on the basis of the automatically predicted quality of synthetic speech of a given TTS model. Results of evaluations using YouTube data reveal that our method outperforms the conventional acoustic-quality-based method.
https://arxiv.org/abs/2210.14850
arXiv.org
Text-to-speech synthesis from dark data with...
This paper proposes a method for selecting training data for text-to-speech (TTS) synthesis from dark data. TTS models are typically trained on high-quality speech corpora that cost much time and...
New studio-quality & large-scale speech dataset🎙️
LibriTTS-R is a sound quality improved LibriTTS.
Dataset is freely available: http://openslr.org/141/
Speech samples and TTS outputs in our demo page: https://google.github.io/df-conformer/librittsr/index.html
Paper: https://arxiv.org/abs/2305.18802
LibriTTS-R is a sound quality improved LibriTTS.
Dataset is freely available: http://openslr.org/141/
Speech samples and TTS outputs in our demo page: https://google.github.io/df-conformer/librittsr/index.html
Paper: https://arxiv.org/abs/2305.18802
Spoken dataset of books read in French, initially collected from audiocite.net by the GETALP team for the LeBenchmark project.
http://openslr.org/139/
Audiocite.net is a corpus of read French speech downloaded in November 2021 from the Audiocite.net website.
With a total duration of 6682 hours of audio recording, this corpus is the result of the voluntary work of 130 speakers. The metadata is divided into 4 .jsons files (all(100%), train(80%), dev(10%) and test(10%)) to be used in NLP models.
The corpus and its metadata were uploaded through a script distributing the information in a .csv file. The use of these audio and metadata files is intended for pre-trained speech models.
http://openslr.org/139/
Audiocite.net is a corpus of read French speech downloaded in November 2021 from the Audiocite.net website.
With a total duration of 6682 hours of audio recording, this corpus is the result of the voluntary work of 130 speakers. The metadata is divided into 4 .jsons files (all(100%), train(80%), dev(10%) and test(10%)) to be used in NLP models.
The corpus and its metadata were uploaded through a script distributing the information in a .csv file. The use of these audio and metadata files is intended for pre-trained speech models.
Reenforcement learning in speech from Google
Edit Distance based RL for RNNT decoding
https://arxiv.org/abs/2306.01789
Dongseong Hwang, Changwan Ryu, Khe Chai Sim
RNN-T is currently considered the industry standard in ASR due to its exceptional WERs in various benchmark tests and its ability to support seamless streaming and longform transcription. However, its biggest drawback lies in the significant discrepancy between its training and inference objectives. During training, RNN-T maximizes all alignment probabilities by teacher forcing, while during inference, it uses beam search which may not necessarily find the maximum probable alignment. Additionally, RNN-T's inability to experience mistakes during teacher forcing training makes it more problematic when a mistake occurs in inference. To address this issue, this paper proposes a Reinforcement Learning method that minimizes the gap between training and inference time. Our Edit Distance based RL (EDRL) approach computes rewards based on the edit distance, and trains the network at every action level. The proposed approach yielded SoTA WERs on LibriSpeech for the 600M Conformer RNN-T model.
Edit Distance based RL for RNNT decoding
https://arxiv.org/abs/2306.01789
Dongseong Hwang, Changwan Ryu, Khe Chai Sim
RNN-T is currently considered the industry standard in ASR due to its exceptional WERs in various benchmark tests and its ability to support seamless streaming and longform transcription. However, its biggest drawback lies in the significant discrepancy between its training and inference objectives. During training, RNN-T maximizes all alignment probabilities by teacher forcing, while during inference, it uses beam search which may not necessarily find the maximum probable alignment. Additionally, RNN-T's inability to experience mistakes during teacher forcing training makes it more problematic when a mistake occurs in inference. To address this issue, this paper proposes a Reinforcement Learning method that minimizes the gap between training and inference time. Our Edit Distance based RL (EDRL) approach computes rewards based on the edit distance, and trains the network at every action level. The proposed approach yielded SoTA WERs on LibriSpeech for the 600M Conformer RNN-T model.
Nice paper on Whisper adaptation to word lists
Code: https://github.com/BriansIDP/WhisperBiasing
https://arxiv.org/abs/2306.01942
Can Contextual Biasing Remain Effective with Whisper and GPT-2?
Guangzhi Sun, Xianrui Zheng, Chao Zhang, Philip C. Woodland
End-to-end automatic speech recognition (ASR) and large language models, such as Whisper and GPT-2, have recently been scaled to use vast amounts of training data. Despite the large amount of training data, infrequent content words that occur in a particular task may still exhibit poor ASR performance, with contextual biasing a possible remedy. This paper investigates the effectiveness of neural contextual biasing for Whisper combined with GPT-2. Specifically, this paper proposes integrating an adapted tree-constrained pointer generator (TCPGen) component for Whisper and a dedicated training scheme to dynamically adjust the final output without modifying any Whisper model parameters. Experiments across three datasets show a considerable reduction in errors on biasing words with a biasing list of 1000 words. Contextual biasing was more effective when applied to domain-specific data and can boost the performance of Whisper and GPT-2 without losing their generality.
Code: https://github.com/BriansIDP/WhisperBiasing
https://arxiv.org/abs/2306.01942
Can Contextual Biasing Remain Effective with Whisper and GPT-2?
Guangzhi Sun, Xianrui Zheng, Chao Zhang, Philip C. Woodland
End-to-end automatic speech recognition (ASR) and large language models, such as Whisper and GPT-2, have recently been scaled to use vast amounts of training data. Despite the large amount of training data, infrequent content words that occur in a particular task may still exhibit poor ASR performance, with contextual biasing a possible remedy. This paper investigates the effectiveness of neural contextual biasing for Whisper combined with GPT-2. Specifically, this paper proposes integrating an adapted tree-constrained pointer generator (TCPGen) component for Whisper and a dedicated training scheme to dynamically adjust the final output without modifying any Whisper model parameters. Experiments across three datasets show a considerable reduction in errors on biasing words with a biasing list of 1000 words. Contextual biasing was more effective when applied to domain-specific data and can boost the performance of Whisper and GPT-2 without losing their generality.
GitHub
GitHub - BriansIDP/WhisperBiasing
Contribute to BriansIDP/WhisperBiasing development by creating an account on GitHub.
Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for Speech Understanding
paper page: https://huggingface.co/papers/2306.07944
Large Language Models (LLMs) have been applied in the speech domain, often incurring a performance drop due to misaligned between speech and language representations. To bridge this gap, we propose a joint speech and language model (SLM) using a Speech2Text adapter, which maps speech into text token embedding space without speech information loss. Additionally, using a CTC-based blank-filtering, we can reduce the speech sequence length to that of text. In speech MultiWoz dataset (DSTC11 challenge), SLM largely improves the dialog state tracking (DST) performance (24.7% to 28.4% accuracy). Further to address errors on rare entities, we augment SLM with a Speech2Entity retriever, which uses speech to retrieve relevant entities, and then adds them to the original SLM input as a prefix. With this retrieval-augmented SLM (ReSLM), the DST performance jumps to 34.6% accuracy. Moreover, augmenting the ASR task with the dialog understanding task improves the ASR performance from 9.4% to 8.5% WER.
paper page: https://huggingface.co/papers/2306.07944
Large Language Models (LLMs) have been applied in the speech domain, often incurring a performance drop due to misaligned between speech and language representations. To bridge this gap, we propose a joint speech and language model (SLM) using a Speech2Text adapter, which maps speech into text token embedding space without speech information loss. Additionally, using a CTC-based blank-filtering, we can reduce the speech sequence length to that of text. In speech MultiWoz dataset (DSTC11 challenge), SLM largely improves the dialog state tracking (DST) performance (24.7% to 28.4% accuracy). Further to address errors on rare entities, we augment SLM with a Speech2Entity retriever, which uses speech to retrieve relevant entities, and then adds them to the original SLM input as a prefix. With this retrieval-augmented SLM (ReSLM), the DST performance jumps to 34.6% accuracy. Moreover, augmenting the ASR task with the dialog understanding task improves the ASR performance from 9.4% to 8.5% WER.
huggingface.co
Paper page - Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for
Speech Understanding
Speech Understanding
Join the discussion on this paper page
https://github.com/gweltou/vosk-br
Implemented nice Breton model for Vosk. Very valuable contribution! Please don't hesitate to add a star to that project!
Implemented nice Breton model for Vosk. Very valuable contribution! Please don't hesitate to add a star to that project!
GitHub
GitHub - gweltou/anaouder-cli: Anaouder mouezh e Brezhoneg gant Vosk
Anaouder mouezh e Brezhoneg gant Vosk. Contribute to gweltou/anaouder-cli development by creating an account on GitHub.
https://arxiv.org/abs/2306.07691
https://styletts2.github.io/
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at this https URL.
https://styletts2.github.io/
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at this https URL.
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
https://ai.facebook.com/blog/voicebox-generative-ai-model-speech/
https://ai.facebook.com/blog/voicebox-generative-ai-model-speech/
Meta AI
Introducing Voicebox: The first generative AI model for speech to generalize across tasks with state-of-the-art performance
Voicebox is a state-of-the-art speech generative model based on a new method proposed by Meta AI called Flow Matching. By learning to solve a text-guided speech infilling task with a large scale of data, Voicebox outperforms single-purpose AI models across…
GPT-4 is an ensemble
https://twitter.com/soumithchintala/status/1671267150101721090
we shall see llama ensembles soon
https://twitter.com/soumithchintala/status/1671267150101721090
we shall see llama ensembles soon
X (formerly Twitter)
Soumith Chintala (@soumithchintala) on X
i might have heard the same 😃 -- I guess info like this is passed around but no one wants to say it out loud.
GPT-4: 8 x 220B experts trained with different data/task distributions and 16-iter inference.
Glad that Geohot said it out loud.
Though, at this…
GPT-4: 8 x 220B experts trained with different data/task distributions and 16-iter inference.
Glad that Geohot said it out loud.
Though, at this…
https://twitter.com/forthshinji/status/1672082306239176706
demo: https://aria-k-alethia.github.io/2023laughter-demo/
corpus: https://sites.google.com/site/shinnosuketakamichi/research-topics/laughter_corpus
source: https://github.com/Aria-K-Alethia/laughter-synthesis/
demo: https://aria-k-alethia.github.io/2023laughter-demo/
corpus: https://sites.google.com/site/shinnosuketakamichi/research-topics/laughter_corpus
source: https://github.com/Aria-K-Alethia/laughter-synthesis/
Twitter
Have you ever heard predicted laughter voices? Check it!
demo: https://t.co/BSwOPOJanl
corpus: https://t.co/zLuFtxJ9kf
source: https://t.co/9nxeaJhMYb
人工的な笑い声合成のデモ・データ・コードを公開しました!
demo: https://t.co/BSwOPOJanl
corpus: https://t.co/zLuFtxJ9kf
source: https://t.co/9nxeaJhMYb
人工的な笑い声合成のデモ・データ・コードを公開しました!
https://arxiv.org/abs/2306.13114
https://github.com/aixplain/NoRefER
A Reference-less Quality Metric for Automatic Speech Recognition via Contrastive-Learning of a Multi-Language Model with Self-Supervision
Kamer Ali Yuksel, Thiago Ferreira, Ahmet Gunduz, Mohamed Al-Badrashiny, Golara Javadi
The common standard for quality evaluation of automatic speech recognition (ASR) systems is reference-based metrics such as the Word Error Rate (WER), computed using manual ground-truth transcriptions that are time-consuming and expensive to obtain. This work proposes a multi-language referenceless quality metric, which allows comparing the performance of different ASR models on a speech dataset without ground truth transcriptions. To estimate the quality of ASR hypotheses, a pre-trained language model (LM) is fine-tuned with contrastive learning in a self-supervised learning manner. In experiments conducted on several unseen test datasets consisting of outputs from top commercial ASR engines in various languages, the proposed referenceless metric obtains a much higher correlation with WER scores and their ranks than the perplexity metric from the state-of-art multi-lingual LM in all experiments, and also reduces WER by more than 7% when used for ensembling hypotheses. The fine-tuned model and experiments are made available for the reproducibility: this https URL
https://github.com/aixplain/NoRefER
A Reference-less Quality Metric for Automatic Speech Recognition via Contrastive-Learning of a Multi-Language Model with Self-Supervision
Kamer Ali Yuksel, Thiago Ferreira, Ahmet Gunduz, Mohamed Al-Badrashiny, Golara Javadi
The common standard for quality evaluation of automatic speech recognition (ASR) systems is reference-based metrics such as the Word Error Rate (WER), computed using manual ground-truth transcriptions that are time-consuming and expensive to obtain. This work proposes a multi-language referenceless quality metric, which allows comparing the performance of different ASR models on a speech dataset without ground truth transcriptions. To estimate the quality of ASR hypotheses, a pre-trained language model (LM) is fine-tuned with contrastive learning in a self-supervised learning manner. In experiments conducted on several unseen test datasets consisting of outputs from top commercial ASR engines in various languages, the proposed referenceless metric obtains a much higher correlation with WER scores and their ranks than the perplexity metric from the state-of-art multi-lingual LM in all experiments, and also reduces WER by more than 7% when used for ensembling hypotheses. The fine-tuned model and experiments are made available for the reproducibility: this https URL
GitHub
GitHub - aixplain/NoRefER
Contribute to aixplain/NoRefER development by creating an account on GitHub.
Another semisup thing from Google, better ensembling than ROVER
https://arxiv.org/abs/2306.12012
Learning When to Trust Which Teacher for Weakly Supervised ASR
Aakriti Agrawal, Milind Rao, Anit Kumar Sahu, Gopinath Chennupati, Andreas Stolcke
Automatic speech recognition (ASR) training can utilize multiple experts as teacher models, each trained on a specific domain or accent. Teacher models may be opaque in nature since their architecture may be not be known or their training cadence is different from that of the student ASR model. Still, the student models are updated incrementally using the pseudo-labels generated independently by the expert teachers. In this paper, we exploit supervision from multiple domain experts in training student ASR models. This training strategy is especially useful in scenarios where few or no human transcriptions are available. To that end, we propose a Smart-Weighter mechanism that selects an appropriate expert based on the input audio, and then trains the student model in an unsupervised setting. We show the efficacy of our approach using LibriSpeech and LibriLight benchmarks and find an improvement of 4 to 25\% over baselines that uniformly weight all the experts, use a single expert model, or combine experts using ROVER.
https://arxiv.org/abs/2306.12012
Learning When to Trust Which Teacher for Weakly Supervised ASR
Aakriti Agrawal, Milind Rao, Anit Kumar Sahu, Gopinath Chennupati, Andreas Stolcke
Automatic speech recognition (ASR) training can utilize multiple experts as teacher models, each trained on a specific domain or accent. Teacher models may be opaque in nature since their architecture may be not be known or their training cadence is different from that of the student ASR model. Still, the student models are updated incrementally using the pseudo-labels generated independently by the expert teachers. In this paper, we exploit supervision from multiple domain experts in training student ASR models. This training strategy is especially useful in scenarios where few or no human transcriptions are available. To that end, we propose a Smart-Weighter mechanism that selects an appropriate expert based on the input audio, and then trains the student model in an unsupervised setting. We show the efficacy of our approach using LibriSpeech and LibriLight benchmarks and find an improvement of 4 to 25\% over baselines that uniformly weight all the experts, use a single expert model, or combine experts using ROVER.
UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data (INTERSPEECH 2023)
https://github.com/gmltmd789/UnitSpeech
Demo
https://unitspeech.github.io/
https://github.com/gmltmd789/UnitSpeech
Demo
https://unitspeech.github.io/
GitHub
GitHub - gmltmd789/UnitSpeech: An official implementation of "UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed…
An official implementation of "UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data" - gmltmd789/UnitSpeech