Very good TTS, diffusion is a thing
https://resgrad1.github.io/
https://arxiv.org/abs/2212.14518
ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech
Zehua Chen, Yihan Wu, Yichong Leng, Jiawei Chen, Haohe Liu, Xu Tan, Yang Cui, Ke Wang, Lei He, Sheng Zhao, Jiang Bian, Danilo Mandic
Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples. However, their iterative refinement process in high-dimensional data space results in slow inference speed, which restricts their application in real-time systems. Previous works have explored speeding up by minimizing the number of inference steps but at the cost of sample quality. In this work, to improve the inference speed for DDPM-based TTS model while achieving high sample quality, we propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model (e.g., FastSpeech 2) by predicting the residual between the model output and the corresponding ground-truth speech. ResGrad has several advantages: 1) Compare with other acceleration methods for DDPM which need to synthesize speech from scratch, ResGrad reduces the complexity of task by changing the generation target from ground-truth mel-spectrogram to the residual, resulting into a more lightweight model and thus a smaller real-time factor. 2) ResGrad is employed in the inference process of the existing TTS model in a plug-and-play way, without re-training this model. We verify ResGrad on the single-speaker dataset LJSpeech and two more challenging datasets with multiple speakers (LibriTTS) and high sampling rate (VCTK). Experimental results show that in comparison with other speed-up methods of DDPMs: 1) ResGrad achieves better sample quality with the same inference speed measured by real-time factor; 2) with similar speech quality, ResGrad synthesizes speech faster than baseline methods by more than 10 times. Audio samples are available at this https URL.
https://resgrad1.github.io/
https://arxiv.org/abs/2212.14518
ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech
Zehua Chen, Yihan Wu, Yichong Leng, Jiawei Chen, Haohe Liu, Xu Tan, Yang Cui, Ke Wang, Lei He, Sheng Zhao, Jiang Bian, Danilo Mandic
Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples. However, their iterative refinement process in high-dimensional data space results in slow inference speed, which restricts their application in real-time systems. Previous works have explored speeding up by minimizing the number of inference steps but at the cost of sample quality. In this work, to improve the inference speed for DDPM-based TTS model while achieving high sample quality, we propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model (e.g., FastSpeech 2) by predicting the residual between the model output and the corresponding ground-truth speech. ResGrad has several advantages: 1) Compare with other acceleration methods for DDPM which need to synthesize speech from scratch, ResGrad reduces the complexity of task by changing the generation target from ground-truth mel-spectrogram to the residual, resulting into a more lightweight model and thus a smaller real-time factor. 2) ResGrad is employed in the inference process of the existing TTS model in a plug-and-play way, without re-training this model. We verify ResGrad on the single-speaker dataset LJSpeech and two more challenging datasets with multiple speakers (LibriTTS) and high sampling rate (VCTK). Experimental results show that in comparison with other speed-up methods of DDPMs: 1) ResGrad achieves better sample quality with the same inference speed measured by real-time factor; 2) with similar speech quality, ResGrad synthesizes speech faster than baseline methods by more than 10 times. Audio samples are available at this https URL.
Memory is a thing too
https://arxiv.org/abs/2301.00066v1
Memory Augmented Lookup Dictionary based Language Modeling for Automatic Speech Recognition
Yukun Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang (Bytedance)
Recent studies have shown that using an external Language Model (LM) benefits the end-to-end Automatic Speech Recognition (ASR). However, predicting tokens that appear less frequently in the training set is still quite challenging. The long-tail prediction problems have been widely studied in many applications, but only been addressed by a few studies for ASR and LMs. In this paper, we propose a new memory augmented lookup dictionary based Transformer architecture for LM. The newly introduced lookup dictionary incorporates rich contextual information in training set, which is vital to correctly predict long-tail tokens. With intensive experiments on Chinese and English data sets, our proposed method is proved to outperform the baseline Transformer LM by a great margin on both word/character error rate and tail tokens error rate. This is achieved without impact on the decoding efficiency. Overall, we demonstrate the effectiveness of our proposed method in boosting the ASR decoding performance, especially for long-tail tokens.
https://arxiv.org/abs/2301.00066v1
Memory Augmented Lookup Dictionary based Language Modeling for Automatic Speech Recognition
Yukun Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang (Bytedance)
Recent studies have shown that using an external Language Model (LM) benefits the end-to-end Automatic Speech Recognition (ASR). However, predicting tokens that appear less frequently in the training set is still quite challenging. The long-tail prediction problems have been widely studied in many applications, but only been addressed by a few studies for ASR and LMs. In this paper, we propose a new memory augmented lookup dictionary based Transformer architecture for LM. The newly introduced lookup dictionary incorporates rich contextual information in training set, which is vital to correctly predict long-tail tokens. With intensive experiments on Chinese and English data sets, our proposed method is proved to outperform the baseline Transformer LM by a great margin on both word/character error rate and tail tokens error rate. This is achieved without impact on the decoding efficiency. Overall, we demonstrate the effectiveness of our proposed method in boosting the ASR decoding performance, especially for long-tail tokens.
SLT2022 starts tomorrow, here is the technical program:
https://slt2022.org/technical-program.php#ASR
https://slt2022.org/technical-program.php#ASR
We have released small Uzbek model for Vosk
https://alphacephei.com/vosk/models/vosk-model-small-uz-0.22.zip
WER
13.54 (CommonVoice Test)
12.92 (IS2AI USC test)
https://alphacephei.com/vosk/models/vosk-model-small-uz-0.22.zip
WER
13.54 (CommonVoice Test)
12.92 (IS2AI USC test)
Speech Technology
S4 is promising for text, for audio long-term dependencies are not that important. Or at least you need a special usecase to demonstrate their importance (like speaker change after 10 seconds). https://arxiv.org/abs/2210.17098 Structured State Space Decoder…
Implementation for the above
https://github.com/espnet/espnet/pull/4845
librispeech model trains on Tesla A100 (40GB) x 4 GPUs. It takes about 2.5 days.
https://github.com/espnet/espnet/pull/4845
librispeech model trains on Tesla A100 (40GB) x 4 GPUs. It takes about 2.5 days.
Audio
Sounds pretty good and very fast to synthesize. Finally not ljspeech voice.
Looking on Huggingface models. The impression is that the intention is to bury good models under thousands of bad ones.
Looking more on Huggingface models. Most of them finetuned without specaugment...
You see, one can tune a simple model better than a much more advanced Whisper model. Whisper is not great for non-English languages actually and moreover it is not fine-tuning very well.
https://alphacephei.com/nsh/2023/01/15/whisper-finetuning.html
https://alphacephei.com/nsh/2023/01/15/whisper-finetuning.html
Speech Recognition With Vosk
Whisper Fine-Tuning
Whisper is very popular these days, so here are some more observations on it.
Alibaba's FunASR recently had very good Paraformer model release
https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary
https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary
on semi-supervised models
https://twitter.com/huckiyang/status/1615659564606656512
> My take is that joint supervised training and SSL loss is blooming now (e.g., JUST and Maestro) - it is essential to have a supervised encoder also.
https://arxiv.org/abs/2111.08137
https://arxiv.org/abs/2204.03409
https://twitter.com/huckiyang/status/1615659564606656512
> My take is that joint supervised training and SSL loss is blooming now (e.g., JUST and Maestro) - it is essential to have a supervised encoder also.
https://arxiv.org/abs/2111.08137
https://arxiv.org/abs/2204.03409
https://twitter.com/lesterphv/status/1615990752403918850
> We are proud to announce the first Singing Voice Conversion Challenge (SVCC2023)! Building on the success of the previous VCC events, we plan to further push the limits of voice conversion by now focusing on singing voices, which is more difficult to model than speech.
http://www.vc-challenge.org/
> We are proud to announce the first Singing Voice Conversion Challenge (SVCC2023)! Building on the success of the previous VCC events, we plan to further push the limits of voice conversion by now focusing on singing voices, which is more difficult to model than speech.
http://www.vc-challenge.org/
Twitter
We are proud to announce the first Singing Voice Conversion Challenge (SVCC2023)! Building on the success of the previous VCC events, we plan to further push the limits of voice conversion by now focusing on singing voices, which is more difficult to model…
Reason Speech
ReasonSpeech is a labeled Japanese speech corpus consisting of approximately 19,000 hours of broadcast speech. It was built for the purpose of promoting research on Japanese speech recognition technology.
In addition to the speech corpus, we have released a toolkit for building the corpus and a pre-trained model under a free license.
https://research.reazon.jp/projects/ReazonSpeech/index.html
Trained ESPnet model
Apache-2.0
https://huggingface.co/reason-research/reasonspeech-espnet-v1
Corpus building toolkit
Apache-2.0
https://github.com/reason-research/ReasonSpeech
Japanese speech corpus
CDLA-Sharing-1.0 (However, the purpose of use is limited to information analysis specified in Article 30-4 of the Copyright Act)
https://huggingface.co/datasets/reason-research/reasonspeech
research paper
https://research.reazon.jp/_static/reazonspeech_nlp2023.pdf
ReasonSpeech is a labeled Japanese speech corpus consisting of approximately 19,000 hours of broadcast speech. It was built for the purpose of promoting research on Japanese speech recognition technology.
In addition to the speech corpus, we have released a toolkit for building the corpus and a pre-trained model under a free license.
https://research.reazon.jp/projects/ReazonSpeech/index.html
Trained ESPnet model
Apache-2.0
https://huggingface.co/reason-research/reasonspeech-espnet-v1
Corpus building toolkit
Apache-2.0
https://github.com/reason-research/ReasonSpeech
Japanese speech corpus
CDLA-Sharing-1.0 (However, the purpose of use is limited to information analysis specified in Article 30-4 of the Copyright Act)
https://huggingface.co/datasets/reason-research/reasonspeech
research paper
https://research.reazon.jp/_static/reazonspeech_nlp2023.pdf
https://twitter.com/huckiyang/status/1616343651344343046
From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition
https://arxiv.org/abs/2301.07851
From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition
Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Rohit Prabhavalkar, Tara N. Sainath, Trevor Strohman
In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement that, for the first time, empowers model reprogramming on ASR. Specifically, we investigate how to select trainable components (i.e., encoder) of a conformer-based RNN-Transducer, as a frozen pre-trained backbone. Experiments on a seven-language multilingual LibriSpeech speech (MLS) task show that model reprogramming only requires 4.2% (11M out of 270M) to 6.8% (45M out of 660M) of its original trainable parameters from a full ASR model to perform competitive results in a range of 11.9% to 8.1% WER averaged across different languages. In addition, we discover different setups to make large-scale pre-trained ASR succeed in both monolingual and multilingual speech recognition. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses (e.g., w2v-bert) in terms of lower WER and better training efficiency.
From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition
https://arxiv.org/abs/2301.07851
From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition
Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Rohit Prabhavalkar, Tara N. Sainath, Trevor Strohman
In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement that, for the first time, empowers model reprogramming on ASR. Specifically, we investigate how to select trainable components (i.e., encoder) of a conformer-based RNN-Transducer, as a frozen pre-trained backbone. Experiments on a seven-language multilingual LibriSpeech speech (MLS) task show that model reprogramming only requires 4.2% (11M out of 270M) to 6.8% (45M out of 660M) of its original trainable parameters from a full ASR model to perform competitive results in a range of 11.9% to 8.1% WER averaged across different languages. In addition, we discover different setups to make large-scale pre-trained ASR succeed in both monolingual and multilingual speech recognition. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses (e.g., w2v-bert) in terms of lower WER and better training efficiency.
X (formerly Twitter)
Huck Yang (@huckiyang) on X
Happy to share our latest work on Reprogramming frozen Conformer-ASR (CAR) for more languages recognition - fun works during my stay in @tnsainath's team at @GoogleAI, jointly with Bo Li, Yu Zhang, @nchen1993, and @RPPrabhavalkar - time to design your own…
https://twitter.com/alex_conneau/status/1614014965496811520
FLEURS paper won the best paper award at SLT 2022!
@ieee_slt
SLT: https://slt2022.org/best-papers.php
arXiv: https://arxiv.org/abs/2205.12446
FLEURS paper won the best paper award at SLT 2022!
@ieee_slt
SLT: https://slt2022.org/best-papers.php
arXiv: https://arxiv.org/abs/2205.12446
Twitter
Our FLEURS paper won the best paper award at SLT 2022! @ieee_slt
SLT: https://t.co/lxkpjfOQJa
arXiv: https://t.co/L9W3wiNpUu
Thanks to the organizers! Grateful for the collaboration with many great colleagues 🙂
SLT: https://t.co/lxkpjfOQJa
arXiv: https://t.co/L9W3wiNpUu
Thanks to the organizers! Grateful for the collaboration with many great colleagues 🙂
Recently I've came around one ASR SaaS service Soniox. Overall, pretty nice, fast and clean UI, good transcription accuracy and features. 5 hours a month for free for user.
I've read their whitepaper too with very cool results.
https://soniox.com/media/SonioxSpeechToTextBenchmarksNov2022.pdf
Well, from the whitepaper every service is more or less the same, some are better, some worse. I've quickly made a test with a audio broadcast file. Here are the results.
Whisper really shines for English. As for others, they all are more or less the same. Whitepapers not very meaningful.
I've read their whitepaper too with very cool results.
https://soniox.com/media/SonioxSpeechToTextBenchmarksNov2022.pdf
Well, from the whitepaper every service is more or less the same, some are better, some worse. I've quickly made a test with a audio broadcast file. Here are the results.
AssemblyAI stream 14.79
AWS stream 17.20
Azure stream 11.47
Deepgram stream 18.23
Google stream 15.48
Rev stream 17.09
Speechmatics stream 9.75
Soniox stream 12.73
Assembly async 11.01
Rev async 15.25
Soniox async 11.81
Whisper Largev2 async 8.94
Whisper Med.En async 9.29
Nemo RNNT async 19.61
Whisper really shines for English. As for others, they all are more or less the same. Whitepapers not very meaningful.
1ST DUTCH SPEECH TECH DAY
Monday, 20 February 2023
Location: Netherlands Institute for Sound & Vision, Hilversum
https://sites.google.com/view/dutchspeechtechday/home
Monday, 20 February 2023
Location: Netherlands Institute for Sound & Vision, Hilversum
https://sites.google.com/view/dutchspeechtechday/home
Google
Dutch Speech Tech Day
The 2nd Dutch Speech Tech Day will take place on Monday February 19, 2024 at Beeld & Geluid in Hilversum. The 2nd Dutch Speech Tech Day follows the first, very successful Dutch Speech Tech Day which took place at the same location in February 2023, and which…
We made a big test of available Russian models (in Russian)
https://alphacephei.com/nsh/2023/01/22/russian-models.html
In short: Nemo RNNT is good, Whisper is not very good for Russian, even adapted, Vosk still not bad, we are working to improve.
https://alphacephei.com/nsh/2023/01/22/russian-models.html
In short: Nemo RNNT is good, Whisper is not very good for Russian, even adapted, Vosk still not bad, we are working to improve.
Speech Recognition With Vosk
Открытые модели для распознавания русской речи
Обновлено 15.04.2024:
Feels like Zipformer and other formers
Paper:
https://arxiv.org/abs/2210.00077
E-Branchformer: Branchformer with Enhanced merging for speech recognition
Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu J. Han, Shinji Watanabe
Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating convolution and self-attention but they have not managed to match Conformer's performance. The recently introduced Branchformer achieves comparable performance to Conformer by using dedicated branches of convolution and self-attention and merging local and global context from each branch. In this paper, we propose E-Branchformer, which enhances Branchformer by applying an effective merging method and stacking additional point-wise modules. E-Branchformer sets new state-of-the-art word error rates (WERs) 1.81% and 3.65% on LibriSpeech test-clean and test-other sets without using any external training data.
Paper:
https://arxiv.org/abs/2210.00077
E-Branchformer: Branchformer with Enhanced merging for speech recognition
Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu J. Han, Shinji Watanabe
Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating convolution and self-attention but they have not managed to match Conformer's performance. The recently introduced Branchformer achieves comparable performance to Conformer by using dedicated branches of convolution and self-attention and merging local and global context from each branch. In this paper, we propose E-Branchformer, which enhances Branchformer by applying an effective merging method and stacking additional point-wise modules. E-Branchformer sets new state-of-the-art word error rates (WERs) 1.81% and 3.65% on LibriSpeech test-clean and test-other sets without using any external training data.