Speech Technology

Audio

Sounds pretty good and very fast to synthesize. Finally not ljspeech voice.

792 views23:10

Speech Technology

Looking on Huggingface models. The impression is that the intention is to bury good models under thousands of bad ones.

660 views00:31

Speech Technology

Looking more on Huggingface models. Most of them finetuned without specaugment...

686 views20:30

Speech Technology

You see, one can tune a simple model better than a much more advanced Whisper model. Whisper is not great for non-English languages actually and moreover it is not fine-tuning very well.

https://alphacephei.com/nsh/2023/01/15/whisper-finetuning.html

Speech Recognition With Vosk

Whisper Fine-Tuning

Whisper is very popular these days, so here are some more observations on it.

805 viewsedited 23:44

Speech Technology

Alibaba's FunASR recently had very good Paraformer model release

https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary

625 views23:35

Speech Technology

on semi-supervised models

https://twitter.com/huckiyang/status/1615659564606656512

> My take is that joint supervised training and SSL loss is blooming now (e.g., JUST and Maestro) - it is essential to have a supervised encoder also.

https://arxiv.org/abs/2111.08137

https://arxiv.org/abs/2204.03409

606 views00:14

Speech Technology

https://twitter.com/lesterphv/status/1615990752403918850

> We are proud to announce the first Singing Voice Conversion Challenge (SVCC2023)! Building on the success of the previous VCC events, we plan to further push the limits of voice conversion by now focusing on singing voices, which is more difficult to model than speech.

http://www.vc-challenge.org/

Twitter

We are proud to announce the first Singing Voice Conversion Challenge (SVCC2023)! Building on the success of the previous VCC events, we plan to further push the limits of voice conversion by now focusing on singing voices, which is more difficult to model…

629 views13:38

Speech Technology

Reason Speech

ReasonSpeech is a labeled Japanese speech corpus consisting of approximately 19,000 hours of broadcast speech. It was built for the purpose of promoting research on Japanese speech recognition technology.

In addition to the speech corpus, we have released a toolkit for building the corpus and a pre-trained model under a free license.

https://research.reazon.jp/projects/ReazonSpeech/index.html

Trained ESPnet model

Apache-2.0

https://huggingface.co/reason-research/reasonspeech-espnet-v1

Corpus building toolkit

Apache-2.0

https://github.com/reason-research/ReasonSpeech

Japanese speech corpus

CDLA-Sharing-1.0 (However, the purpose of use is limited to information analysis specified in Article 30-4 of the Copyright Act)

https://huggingface.co/datasets/reason-research/reasonspeech

research paper

https://research.reazon.jp/_static/reazonspeech_nlp2023.pdf

525 viewsedited 19:33

Speech Technology

https://twitter.com/huckiyang/status/1616343651344343046

From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

https://arxiv.org/abs/2301.07851

From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Rohit Prabhavalkar, Tara N. Sainath, Trevor Strohman

In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement that, for the first time, empowers model reprogramming on ASR. Specifically, we investigate how to select trainable components (i.e., encoder) of a conformer-based RNN-Transducer, as a frozen pre-trained backbone. Experiments on a seven-language multilingual LibriSpeech speech (MLS) task show that model reprogramming only requires 4.2% (11M out of 270M) to 6.8% (45M out of 660M) of its original trainable parameters from a full ASR model to perform competitive results in a range of 11.9% to 8.1% WER averaged across different languages. In addition, we discover different setups to make large-scale pre-trained ASR succeed in both monolingual and multilingual speech recognition. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses (e.g., w2v-bert) in terms of lower WER and better training efficiency.

X (formerly Twitter)

Huck Yang (@huckiyang) on X

Happy to share our latest work on Reprogramming frozen Conformer-ASR (CAR) for more languages recognition - fun works during my stay in @tnsainath's team at @GoogleAI, jointly with Bo Li, Yu Zhang, @nchen1993, and @RPPrabhavalkar - time to design your own…

634 views20:03

Speech Technology

https://twitter.com/alex_conneau/status/1614014965496811520

FLEURS paper won the best paper award at SLT 2022!
@ieee_slt

SLT: https://slt2022.org/best-papers.php
arXiv: https://arxiv.org/abs/2205.12446

Twitter

Our FLEURS paper won the best paper award at SLT 2022! @ieee_slt

SLT: https://t.co/lxkpjfOQJa
arXiv: https://t.co/L9W3wiNpUu

Thanks to the organizers! Grateful for the collaboration with many great colleagues 🙂

592 views14:53

Speech Technology

Recently I've came around one ASR SaaS service Soniox. Overall, pretty nice, fast and clean UI, good transcription accuracy and features. 5 hours a month for free for user.

I've read their whitepaper too with very cool results.

https://soniox.com/media/SonioxSpeechToTextBenchmarksNov2022.pdf

Well, from the whitepaper every service is more or less the same, some are better, some worse. I've quickly made a test with a audio broadcast file. Here are the results.

AssemblyAI      stream  14.79
AWS             stream  17.20
Azure           stream  11.47 
Deepgram        stream  18.23
Google          stream  15.48
Rev             stream  17.09
Speechmatics    stream   9.75
Soniox          stream  12.73

Assembly        async   11.01
Rev             async   15.25
Soniox          async   11.81
Whisper Largev2 async    8.94
Whisper Med.En  async    9.29
Nemo RNNT       async   19.61

Whisper really shines for English. As for others, they all are more or less the same. Whitepapers not very meaningful.

679 viewsedited 21:02

Speech Technology

1ST DUTCH SPEECH TECH DAY
Monday, 20 February 2023

Location: Netherlands Institute for Sound & Vision, Hilversum

https://sites.google.com/view/dutchspeechtechday/home

Google

Dutch Speech Tech Day

The 2nd Dutch Speech Tech Day will take place on Monday February 19, 2024 at Beeld & Geluid in Hilversum. The 2nd Dutch Speech Tech Day follows the first, very successful Dutch Speech Tech Day which took place at the same location in February 2023, and which…

618 views19:51

Speech Technology

We made a big test of available Russian models (in Russian)

https://alphacephei.com/nsh/2023/01/22/russian-models.html

In short: Nemo RNNT is good, Whisper is not very good for Russian, even adapted, Vosk still not bad, we are working to improve.

Speech Recognition With Vosk

Открытые модели для распознавания русской речи

Обновлено 15.04.2024:

704 views23:22

Speech Technology

https://mobile.twitter.com/shinjiw_at_cmu/status/1618083172918644736

E-Branchformer from ASAPP

Twitter

E-Branchformer is very good. Now, it's in ESPnet, and we tested it with various tasks, but it almost always got better performance than conformer (not a very large gain but a steady improvement).

ASAPP tops ASR leaderboard with E‑Branchformer https://t.co/SOnwqb9by9…

607 views08:44

Speech Technology

Feels like Zipformer and other formers

Paper:

https://arxiv.org/abs/2210.00077

E-Branchformer: Branchformer with Enhanced merging for speech recognition

Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu J. Han, Shinji Watanabe

Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating convolution and self-attention but they have not managed to match Conformer's performance. The recently introduced Branchformer achieves comparable performance to Conformer by using dedicated branches of convolution and self-attention and merging local and global context from each branch. In this paper, we propose E-Branchformer, which enhances Branchformer by applying an effective merging method and stacking additional point-wise modules. E-Branchformer sets new state-of-the-art word error rates (WERs) 1.81% and 3.65% on LibriSpeech test-clean and test-other sets without using any external training data.

639 views08:47

Speech Technology

Thats something new:

https://arxiv.org/abs/2301.08730

Novel-View Acoustic Synthesis

Changan Chen, Alexander Richard, Roman Shapovalov, Vamsi Krishna Ithapu, Natalia Neverova, Kristen Grauman, Andrea Vedaldi

We introduce the novel-view acoustic synthesis (NVAS) task: given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint? We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space by analyzing the input audio-visual cues. To benchmark this task, we collect two first-of-their-kind large-scale multi-view audio-visual datasets, one synthetic and one real. We show that our model successfully reasons about the spatial cues and synthesizes faithful audio on both datasets. To our knowledge, this work represents the very first formulation, dataset, and approach to solve the novel-view acoustic synthesis task, which has exciting potential applications ranging from AR/VR to art and design. Unlocked by this work, we believe that the future of novel-view synthesis is in multi-modal learning from videos.

698 views20:45

Speech Technology

https://voicebot.ai/2023/01/24/soundhound-raises-25m-weeks-after-major-layoffs/

Voicebot.ai

SoundHound Raises $25M Weeks After Major Layoffs

SoundHound has raised $25 million in equity financing from an unknown set of investors. The funding comes only a couple..

725 views20:57

Speech Technology

speechbrain funded by OVH

https://twitter.com/mirco_ravanelli/status/1618345249675542528

Twitter

My team is using @OVHcloud to expand #SpeechBrain and explore innovative #research ideas. A big thank you to @OVHcloud for helping us!

787 views21:04

Speech Technology

https://sites.google.com/view/merlion-ccs-challenge/

About
The inaugural MERLIon CCS Challenge focuses on developing robust language identification and language diarization systems that are reliable for non-standard, accented, spontaneous code-switched, child-directed speech collected via Zoom

705 views22:26

Speech Technology

https://twitter.com/chrisdonahuey/status/1620232090066497536

595 views12:00

Speech Technology

https://github.com/yangdongchao/InstructTTS

GitHub

GitHub - yangdongchao/InstructTTS: The deme page of InstructTTS

The deme page of InstructTTS. Contribute to yangdongchao/InstructTTS development by creating an account on GitHub.

599 views14:48