Speech Technology
1.59K subscribers
122 photos
4 videos
1 file
2.12K links
Download Telegram
Very good TTS, diffusion is a thing

https://resgrad1.github.io/

https://arxiv.org/abs/2212.14518

ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech

Zehua Chen, Yihan Wu, Yichong Leng, Jiawei Chen, Haohe Liu, Xu Tan, Yang Cui, Ke Wang, Lei He, Sheng Zhao, Jiang Bian, Danilo Mandic

Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples. However, their iterative refinement process in high-dimensional data space results in slow inference speed, which restricts their application in real-time systems. Previous works have explored speeding up by minimizing the number of inference steps but at the cost of sample quality. In this work, to improve the inference speed for DDPM-based TTS model while achieving high sample quality, we propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model (e.g., FastSpeech 2) by predicting the residual between the model output and the corresponding ground-truth speech. ResGrad has several advantages: 1) Compare with other acceleration methods for DDPM which need to synthesize speech from scratch, ResGrad reduces the complexity of task by changing the generation target from ground-truth mel-spectrogram to the residual, resulting into a more lightweight model and thus a smaller real-time factor. 2) ResGrad is employed in the inference process of the existing TTS model in a plug-and-play way, without re-training this model. We verify ResGrad on the single-speaker dataset LJSpeech and two more challenging datasets with multiple speakers (LibriTTS) and high sampling rate (VCTK). Experimental results show that in comparison with other speed-up methods of DDPMs: 1) ResGrad achieves better sample quality with the same inference speed measured by real-time factor; 2) with similar speech quality, ResGrad synthesizes speech faster than baseline methods by more than 10 times. Audio samples are available at this https URL.
Memory is a thing too

https://arxiv.org/abs/2301.00066v1

Memory Augmented Lookup Dictionary based Language Modeling for Automatic Speech Recognition

Yukun Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang (Bytedance)

Recent studies have shown that using an external Language Model (LM) benefits the end-to-end Automatic Speech Recognition (ASR). However, predicting tokens that appear less frequently in the training set is still quite challenging. The long-tail prediction problems have been widely studied in many applications, but only been addressed by a few studies for ASR and LMs. In this paper, we propose a new memory augmented lookup dictionary based Transformer architecture for LM. The newly introduced lookup dictionary incorporates rich contextual information in training set, which is vital to correctly predict long-tail tokens. With intensive experiments on Chinese and English data sets, our proposed method is proved to outperform the baseline Transformer LM by a great margin on both word/character error rate and tail tokens error rate. This is achieved without impact on the decoding efficiency. Overall, we demonstrate the effectiveness of our proposed method in boosting the ASR decoding performance, especially for long-tail tokens.
SLT2022 starts tomorrow, here is the technical program:

https://slt2022.org/technical-program.php#ASR
We have released small Uzbek model for Vosk

https://alphacephei.com/vosk/models/vosk-model-small-uz-0.22.zip

WER

13.54 (CommonVoice Test)
12.92 (IS2AI USC test)
Audio
Sounds pretty good and very fast to synthesize. Finally not ljspeech voice.
Looking on Huggingface models. The impression is that the intention is to bury good models under thousands of bad ones.
Looking more on Huggingface models. Most of them finetuned without specaugment...
You see, one can tune a simple model better than a much more advanced Whisper model. Whisper is not great for non-English languages actually and moreover it is not fine-tuning very well.

https://alphacephei.com/nsh/2023/01/15/whisper-finetuning.html
on semi-supervised models

https://twitter.com/huckiyang/status/1615659564606656512

> My take is that joint supervised training and SSL loss is blooming now (e.g., JUST and Maestro) - it is essential to have a supervised encoder also.

https://arxiv.org/abs/2111.08137

https://arxiv.org/abs/2204.03409
https://twitter.com/lesterphv/status/1615990752403918850

> We are proud to announce the first Singing Voice Conversion Challenge (SVCC2023)! Building on the success of the previous VCC events, we plan to further push the limits of voice conversion by now focusing on singing voices, which is more difficult to model than speech.

http://www.vc-challenge.org/
Reason Speech

ReasonSpeech is a labeled Japanese speech corpus consisting of approximately 19,000 hours of broadcast speech. It was built for the purpose of promoting research on Japanese speech recognition technology.

In addition to the speech corpus, we have released a toolkit for building the corpus and a pre-trained model under a free license.

https://research.reazon.jp/projects/ReazonSpeech/index.html

Trained ESPnet model

Apache-2.0

https://huggingface.co/reason-research/reasonspeech-espnet-v1

Corpus building toolkit

Apache-2.0

https://github.com/reason-research/ReasonSpeech

Japanese speech corpus

CDLA-Sharing-1.0 (However, the purpose of use is limited to information analysis specified in Article 30-4 of the Copyright Act)

https://huggingface.co/datasets/reason-research/reasonspeech

research paper

https://research.reazon.jp/_static/reazonspeech_nlp2023.pdf
https://twitter.com/huckiyang/status/1616343651344343046

From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

https://arxiv.org/abs/2301.07851

From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Rohit Prabhavalkar, Tara N. Sainath, Trevor Strohman

In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement that, for the first time, empowers model reprogramming on ASR. Specifically, we investigate how to select trainable components (i.e., encoder) of a conformer-based RNN-Transducer, as a frozen pre-trained backbone. Experiments on a seven-language multilingual LibriSpeech speech (MLS) task show that model reprogramming only requires 4.2% (11M out of 270M) to 6.8% (45M out of 660M) of its original trainable parameters from a full ASR model to perform competitive results in a range of 11.9% to 8.1% WER averaged across different languages. In addition, we discover different setups to make large-scale pre-trained ASR succeed in both monolingual and multilingual speech recognition. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses (e.g., w2v-bert) in terms of lower WER and better training efficiency.
Recently I've came around one ASR SaaS service Soniox. Overall, pretty nice, fast and clean UI, good transcription accuracy and features. 5 hours a month for free for user.

I've read their whitepaper too with very cool results.

https://soniox.com/media/SonioxSpeechToTextBenchmarksNov2022.pdf

Well, from the whitepaper every service is more or less the same, some are better, some worse. I've quickly made a test with a audio broadcast file. Here are the results.

AssemblyAI stream 14.79
AWS stream 17.20
Azure stream 11.47
Deepgram stream 18.23
Google stream 15.48
Rev stream 17.09
Speechmatics stream 9.75
Soniox stream 12.73

Assembly async 11.01
Rev async 15.25
Soniox async 11.81
Whisper Largev2 async 8.94
Whisper Med.En async 9.29
Nemo RNNT async 19.61

Whisper really shines for English. As for others, they all are more or less the same. Whitepapers not very meaningful.
We made a big test of available Russian models (in Russian)

https://alphacephei.com/nsh/2023/01/22/russian-models.html

In short: Nemo RNNT is good, Whisper is not very good for Russian, even adapted, Vosk still not bad, we are working to improve.
Feels like Zipformer and other formers

Paper:

https://arxiv.org/abs/2210.00077

E-Branchformer: Branchformer with Enhanced merging for speech recognition

Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu J. Han, Shinji Watanabe

Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating convolution and self-attention but they have not managed to match Conformer's performance. The recently introduced Branchformer achieves comparable performance to Conformer by using dedicated branches of convolution and self-attention and merging local and global context from each branch. In this paper, we propose E-Branchformer, which enhances Branchformer by applying an effective merging method and stacking additional point-wise modules. E-Branchformer sets new state-of-the-art word error rates (WERs) 1.81% and 3.65% on LibriSpeech test-clean and test-other sets without using any external training data.