Speech Technology – Telegram

Speech Technology

1.59K subscribers

122 photos

4 videos

1 file

2.12K links

Download Telegram

About

Blog

Apps

Platform

Speech Technology

1.59K subscribers

Speech Technology

Very good TTS, diffusion is a thing

https://resgrad1.github.io/

https://arxiv.org/abs/2212.14518

ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech

Zehua Chen, Yihan Wu, Yichong Leng, Jiawei Chen, Haohe Liu, Xu Tan, Yang Cui, Ke Wang, Lei He, Sheng Zhao, Jiang Bian, Danilo Mandic

Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples. However, their iterative refinement process in high-dimensional data space results in slow inference speed, which restricts their application in real-time systems. Previous works have explored speeding up by minimizing the number of inference steps but at the cost of sample quality. In this work, to improve the inference speed for DDPM-based TTS model while achieving high sample quality, we propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model (e.g., FastSpeech 2) by predicting the residual between the model output and the corresponding ground-truth speech. ResGrad has several advantages: 1) Compare with other acceleration methods for DDPM which need to synthesize speech from scratch, ResGrad reduces the complexity of task by changing the generation target from ground-truth mel-spectrogram to the residual, resulting into a more lightweight model and thus a smaller real-time factor. 2) ResGrad is employed in the inference process of the existing TTS model in a plug-and-play way, without re-training this model. We verify ResGrad on the single-speaker dataset LJSpeech and two more challenging datasets with multiple speakers (LibriTTS) and high sampling rate (VCTK). Experimental results show that in comparison with other speed-up methods of DDPMs: 1) ResGrad achieves better sample quality with the same inference speed measured by real-time factor; 2) with similar speech quality, ResGrad synthesizes speech faster than baseline methods by more than 10 times. Audio samples are available at this https URL.

866 views19:59

Speech Technology

Memory is a thing too

https://arxiv.org/abs/2301.00066v1

Memory Augmented Lookup Dictionary based Language Modeling for Automatic Speech Recognition

Yukun Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang (Bytedance)

Recent studies have shown that using an external Language Model (LM) benefits the end-to-end Automatic Speech Recognition (ASR). However, predicting tokens that appear less frequently in the training set is still quite challenging. The long-tail prediction problems have been widely studied in many applications, but only been addressed by a few studies for ASR and LMs. In this paper, we propose a new memory augmented lookup dictionary based Transformer architecture for LM. The newly introduced lookup dictionary incorporates rich contextual information in training set, which is vital to correctly predict long-tail tokens. With intensive experiments on Chinese and English data sets, our proposed method is proved to outperform the baseline Transformer LM by a great margin on both word/character error rate and tail tokens error rate. This is achieved without impact on the decoding efficiency. Overall, we demonstrate the effectiveness of our proposed method in boosting the ASR decoding performance, especially for long-tail tokens.

787 views22:54

Speech Technology

SLT2022 starts tomorrow, here is the technical program:

https://slt2022.org/technical-program.php#ASR

720 views00:25

Speech Technology

We have released small Uzbek model for Vosk

https://alphacephei.com/vosk/models/vosk-model-small-uz-0.22.zip

WER

13.54 (CommonVoice Test)
12.92 (IS2AI USC test)

784 views11:13

Speech Technology

Speech Technology

S4 is promising for text, for audio long-term dependencies are not that important. Or at least you need a special usecase to demonstrate their importance (like speaker change after 10 seconds). https://arxiv.org/abs/2210.17098 Structured State Space Decoder…

Implementation for the above

https://github.com/espnet/espnet/pull/4845

librispeech model trains on Tesla A100 (40GB) x 4 GPUs. It takes about 2.5 days.

713 viewsedited 10:33

Speech Technology

https://github.com/rhasspy/larynx2

GitHub - rhasspy/piper: A fast, local neural text to speech system

A fast, local neural text to speech system. Contribute to rhasspy/piper development by creating an account on GitHub.

834 views23:05

Speech Technology

Sounds pretty good and very fast to synthesize. Finally not ljspeech voice.

792 views23:10

Speech Technology

Looking on Huggingface models. The impression is that the intention is to bury good models under thousands of bad ones.

660 views00:31

Speech Technology

Looking more on Huggingface models. Most of them finetuned without specaugment...

686 views20:30

Speech Technology

You see, one can tune a simple model better than a much more advanced Whisper model. Whisper is not great for non-English languages actually and moreover it is not fine-tuning very well.

https://alphacephei.com/nsh/2023/01/15/whisper-finetuning.html

Speech Recognition With Vosk

Whisper Fine-Tuning

Whisper is very popular these days, so here are some more observations on it.

805 viewsedited 23:44

Speech Technology

Alibaba's FunASR recently had very good Paraformer model release

https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary

625 views23:35

Speech Technology

on semi-supervised models

https://twitter.com/huckiyang/status/1615659564606656512

> My take is that joint supervised training and SSL loss is blooming now (e.g., JUST and Maestro) - it is essential to have a supervised encoder also.

https://arxiv.org/abs/2111.08137

https://arxiv.org/abs/2204.03409

606 views00:14

Speech Technology

https://twitter.com/lesterphv/status/1615990752403918850

> We are proud to announce the first Singing Voice Conversion Challenge (SVCC2023)! Building on the success of the previous VCC events, we plan to further push the limits of voice conversion by now focusing on singing voices, which is more difficult to model than speech.

http://www.vc-challenge.org/

We are proud to announce the first Singing Voice Conversion Challenge (SVCC2023)! Building on the success of the previous VCC events, we plan to further push the limits of voice conversion by now focusing on singing voices, which is more difficult to model…

629 views13:38

Speech Technology

Reason Speech

ReasonSpeech is a labeled Japanese speech corpus consisting of approximately 19,000 hours of broadcast speech. It was built for the purpose of promoting research on Japanese speech recognition technology.

In addition to the speech corpus, we have released a toolkit for building the corpus and a pre-trained model under a free license.

https://research.reazon.jp/projects/ReazonSpeech/index.html

Trained ESPnet model

Apache-2.0

https://huggingface.co/reason-research/reasonspeech-espnet-v1

Corpus building toolkit

Apache-2.0

https://github.com/reason-research/ReasonSpeech

Japanese speech corpus

CDLA-Sharing-1.0 (However, the purpose of use is limited to information analysis specified in Article 30-4 of the Copyright Act)

https://huggingface.co/datasets/reason-research/reasonspeech

research paper

https://research.reazon.jp/_static/reazonspeech_nlp2023.pdf

525 viewsedited 19:33

Speech Technology

https://twitter.com/huckiyang/status/1616343651344343046

From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

https://arxiv.org/abs/2301.07851

From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Rohit Prabhavalkar, Tara N. Sainath, Trevor Strohman

In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement that, for the first time, empowers model reprogramming on ASR. Specifically, we investigate how to select trainable components (i.e., encoder) of a conformer-based RNN-Transducer, as a frozen pre-trained backbone. Experiments on a seven-language multilingual LibriSpeech speech (MLS) task show that model reprogramming only requires 4.2% (11M out of 270M) to 6.8% (45M out of 660M) of its original trainable parameters from a full ASR model to perform competitive results in a range of 11.9% to 8.1% WER averaged across different languages. In addition, we discover different setups to make large-scale pre-trained ASR succeed in both monolingual and multilingual speech recognition. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses (e.g., w2v-bert) in terms of lower WER and better training efficiency.

X (formerly Twitter)

Huck Yang (@huckiyang) on X

Happy to share our latest work on Reprogramming frozen Conformer-ASR (CAR) for more languages recognition - fun works during my stay in @tnsainath's team at @GoogleAI, jointly with Bo Li, Yu Zhang, @nchen1993, and @RPPrabhavalkar - time to design your own…

634 views20:03

Speech Technology

https://twitter.com/alex_conneau/status/1614014965496811520

FLEURS paper won the best paper award at SLT 2022!
@ieee_slt

SLT: https://slt2022.org/best-papers.php
arXiv: https://arxiv.org/abs/2205.12446

Our FLEURS paper won the best paper award at SLT 2022! @ieee_slt

SLT: https://t.co/lxkpjfOQJa
arXiv: https://t.co/L9W3wiNpUu

Thanks to the organizers! Grateful for the collaboration with many great colleagues 🙂

592 views14:53

Speech Technology

Recently I've came around one ASR SaaS service Soniox. Overall, pretty nice, fast and clean UI, good transcription accuracy and features. 5 hours a month for free for user.

I've read their whitepaper too with very cool results.

https://soniox.com/media/SonioxSpeechToTextBenchmarksNov2022.pdf

Well, from the whitepaper every service is more or less the same, some are better, some worse. I've quickly made a test with a audio broadcast file. Here are the results.

AssemblyAI      stream  14.79
AWS             stream  17.20
Azure           stream  11.47 
Deepgram        stream  18.23
Google          stream  15.48
Rev             stream  17.09
Speechmatics    stream   9.75
Soniox          stream  12.73

Assembly        async   11.01
Rev             async   15.25
Soniox          async   11.81
Whisper Largev2 async    8.94
Whisper Med.En  async    9.29
Nemo RNNT       async   19.61

Whisper really shines for English. As for others, they all are more or less the same. Whitepapers not very meaningful.

679 viewsedited 21:02

Speech Technology

1ST DUTCH SPEECH TECH DAY
Monday, 20 February 2023

Location: Netherlands Institute for Sound & Vision, Hilversum

https://sites.google.com/view/dutchspeechtechday/home

Dutch Speech Tech Day

The 2nd Dutch Speech Tech Day will take place on Monday February 19, 2024 at Beeld & Geluid in Hilversum. The 2nd Dutch Speech Tech Day follows the first, very successful Dutch Speech Tech Day which took place at the same location in February 2023, and which…

618 views19:51

Speech Technology

We made a big test of available Russian models (in Russian)

https://alphacephei.com/nsh/2023/01/22/russian-models.html

In short: Nemo RNNT is good, Whisper is not very good for Russian, even adapted, Vosk still not bad, we are working to improve.

Speech Recognition With Vosk

Открытые модели для распознавания русской речи

Обновлено 15.04.2024:

704 views23:22

Speech Technology

https://mobile.twitter.com/shinjiw_at_cmu/status/1618083172918644736

E-Branchformer from ASAPP

E-Branchformer is very good. Now, it's in ESPnet, and we tested it with various tasks, but it almost always got better performance than conformer (not a very large gain but a steady improvement).

ASAPP tops ASR leaderboard with E‑Branchformer https://t.co/SOnwqb9by9…

607 views08:44

Speech Technology

Feels like Zipformer and other formers

Paper:

https://arxiv.org/abs/2210.00077

E-Branchformer: Branchformer with Enhanced merging for speech recognition

Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu J. Han, Shinji Watanabe

Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating convolution and self-attention but they have not managed to match Conformer's performance. The recently introduced Branchformer achieves comparable performance to Conformer by using dedicated branches of convolution and self-attention and merging local and global context from each branch. In this paper, we propose E-Branchformer, which enhances Branchformer by applying an effective merging method and stacking additional point-wise modules. E-Branchformer sets new state-of-the-art word error rates (WERs) 1.81% and 3.65% on LibriSpeech test-clean and test-other sets without using any external training data.

639 views08:47