Speech Technology

https://twitter.com/alex_conneau/status/1614014965496811520

FLEURS paper won the best paper award at SLT 2022!
@ieee_slt

SLT: https://slt2022.org/best-papers.php
arXiv: https://arxiv.org/abs/2205.12446

Twitter

Our FLEURS paper won the best paper award at SLT 2022! @ieee_slt

SLT: https://t.co/lxkpjfOQJa
arXiv: https://t.co/L9W3wiNpUu

Thanks to the organizers! Grateful for the collaboration with many great colleagues 🙂

592 views14:53

Speech Technology

Recently I've came around one ASR SaaS service Soniox. Overall, pretty nice, fast and clean UI, good transcription accuracy and features. 5 hours a month for free for user.

I've read their whitepaper too with very cool results.

https://soniox.com/media/SonioxSpeechToTextBenchmarksNov2022.pdf

Well, from the whitepaper every service is more or less the same, some are better, some worse. I've quickly made a test with a audio broadcast file. Here are the results.

AssemblyAI      stream  14.79
AWS             stream  17.20
Azure           stream  11.47 
Deepgram        stream  18.23
Google          stream  15.48
Rev             stream  17.09
Speechmatics    stream   9.75
Soniox          stream  12.73

Assembly        async   11.01
Rev             async   15.25
Soniox          async   11.81
Whisper Largev2 async    8.94
Whisper Med.En  async    9.29
Nemo RNNT       async   19.61

Whisper really shines for English. As for others, they all are more or less the same. Whitepapers not very meaningful.

679 viewsedited 21:02

Speech Technology

1ST DUTCH SPEECH TECH DAY
Monday, 20 February 2023

Location: Netherlands Institute for Sound & Vision, Hilversum

https://sites.google.com/view/dutchspeechtechday/home

Google

Dutch Speech Tech Day

The 2nd Dutch Speech Tech Day will take place on Monday February 19, 2024 at Beeld & Geluid in Hilversum. The 2nd Dutch Speech Tech Day follows the first, very successful Dutch Speech Tech Day which took place at the same location in February 2023, and which…

618 views19:51

Speech Technology

We made a big test of available Russian models (in Russian)

https://alphacephei.com/nsh/2023/01/22/russian-models.html

In short: Nemo RNNT is good, Whisper is not very good for Russian, even adapted, Vosk still not bad, we are working to improve.

Speech Recognition With Vosk

Открытые модели для распознавания русской речи

Обновлено 15.04.2024:

704 views23:22

Speech Technology

https://mobile.twitter.com/shinjiw_at_cmu/status/1618083172918644736

E-Branchformer from ASAPP

Twitter

E-Branchformer is very good. Now, it's in ESPnet, and we tested it with various tasks, but it almost always got better performance than conformer (not a very large gain but a steady improvement).

ASAPP tops ASR leaderboard with E‑Branchformer https://t.co/SOnwqb9by9…

607 views08:44

Speech Technology

Feels like Zipformer and other formers

Paper:

https://arxiv.org/abs/2210.00077

E-Branchformer: Branchformer with Enhanced merging for speech recognition

Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu J. Han, Shinji Watanabe

Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating convolution and self-attention but they have not managed to match Conformer's performance. The recently introduced Branchformer achieves comparable performance to Conformer by using dedicated branches of convolution and self-attention and merging local and global context from each branch. In this paper, we propose E-Branchformer, which enhances Branchformer by applying an effective merging method and stacking additional point-wise modules. E-Branchformer sets new state-of-the-art word error rates (WERs) 1.81% and 3.65% on LibriSpeech test-clean and test-other sets without using any external training data.

639 views08:47

Speech Technology

Thats something new:

https://arxiv.org/abs/2301.08730

Novel-View Acoustic Synthesis

Changan Chen, Alexander Richard, Roman Shapovalov, Vamsi Krishna Ithapu, Natalia Neverova, Kristen Grauman, Andrea Vedaldi

We introduce the novel-view acoustic synthesis (NVAS) task: given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint? We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space by analyzing the input audio-visual cues. To benchmark this task, we collect two first-of-their-kind large-scale multi-view audio-visual datasets, one synthetic and one real. We show that our model successfully reasons about the spatial cues and synthesizes faithful audio on both datasets. To our knowledge, this work represents the very first formulation, dataset, and approach to solve the novel-view acoustic synthesis task, which has exciting potential applications ranging from AR/VR to art and design. Unlocked by this work, we believe that the future of novel-view synthesis is in multi-modal learning from videos.

698 views20:45

Speech Technology

https://voicebot.ai/2023/01/24/soundhound-raises-25m-weeks-after-major-layoffs/

Voicebot.ai

SoundHound Raises $25M Weeks After Major Layoffs

SoundHound has raised $25 million in equity financing from an unknown set of investors. The funding comes only a couple..

725 views20:57

Speech Technology

speechbrain funded by OVH

https://twitter.com/mirco_ravanelli/status/1618345249675542528

Twitter

My team is using @OVHcloud to expand #SpeechBrain and explore innovative #research ideas. A big thank you to @OVHcloud for helping us!

787 views21:04

Speech Technology

https://sites.google.com/view/merlion-ccs-challenge/

About
The inaugural MERLIon CCS Challenge focuses on developing robust language identification and language diarization systems that are reliable for non-standard, accented, spontaneous code-switched, child-directed speech collected via Zoom

705 views22:26

Speech Technology

https://twitter.com/chrisdonahuey/status/1620232090066497536

595 views12:00

Speech Technology

https://github.com/yangdongchao/InstructTTS

GitHub

GitHub - yangdongchao/InstructTTS: The deme page of InstructTTS

The deme page of InstructTTS. Contribute to yangdongchao/InstructTTS development by creating an account on GitHub.

599 views14:48

Speech Technology

https://twitter.com/shinjiw_at_cmu/status/1620766409390448641

Twitter

ESPnet v.202301 has been released! https://t.co/Fe42dl41sE

We have too many updates (s4, Hubert, e-branchformer, mSUPERB, EURO UASR, VISInger, Time-Sync Decoding, MFA, Rawnet x-vector, Whisper, AphasiaBank, reazonspeech, Aux. CTC conditioning).

Thanks to…

586 views17:02

Speech Technology

IWSLT also has many speech translation tracks

https://iwslt.org/2023/#shared-tasks

818 views23:04

Speech Technology

IWSLT has nice lecture channel too

https://www.youtube.com/@sigslt

551 views23:05

Speech Technology

For example

https://www.youtube.com/watch?v=xKNAa8ihE7g

580 views23:08

Speech Technology

https://zenodo.org/record/7389996

Zenodo

Data Repository for MYRiAD: A Multi-Array Room Acoustic Database

In the development of acoustic signal processing algorithms, their evaluation in various acoustic environments is of utmost importance. In order to advance evaluation in realistic and reproducible scenarios, several high-quality acoustic databases have been…

578 views20:53

Speech Technology

Respected guys

https://arxiv.org/abs/2301.13341

Neural Target Speech Extraction: An Overview

Katerina Zmolikova, Marc Delcroix, Tsubasa Ochiai, Keisuke Kinoshita, Jan Černocký, Dong Yu

Humans can listen to a target speaker even in challenging acoustic conditions that have noise, reverberation, and interfering speakers. This phenomenon is known as the cocktail-party effect. For decades, researchers have focused on approaching the listening ability of humans. One critical issue is handling interfering speakers because the target and non-target speech signals share similar characteristics, complicating their discrimination. Target speech/speaker extraction (TSE) isolates the speech signal of a target speaker from a mixture of several speakers with or without noises and reverberations using clues that identify the speaker in the mixture. Such clues might be a spatial clue indicating the direction of the target speaker, a video of the speaker's lips, or a pre-recorded enrollment utterance from which their voice characteristics can be derived. TSE is an emerging field of research that has received increased attention in recent years because it offers a practical approach to the cocktail-party problem and involves such aspects of signal processing as audio, visual, array processing, and deep learning. This paper focuses on recent neural-based approaches and presents an in-depth overview of TSE. We guide readers through the different major approaches, emphasizing the similarities among frameworks and discussing potential future directions.

632 views22:05

Speech Technology

Some Zipformer ideas (multistream is nice):

https://github.com/k2-fsa/icefall/issues/837#issuecomment-1412312846

GitHub

Zipformer explanation · Issue #837 · k2-fsa/icefall

Hello guys. Could you explain in several words what is zipformer. Thanks a lot.

586 views09:50

Speech Technology

https://twitter.com/alphacep/status/1621612504840273928

NeMo 1.15 is out now! Theres a whole bunch of powerful ASR features added in this release including Hybrid CTC-RNNT models, Multiblank Transducer, Multi Head Attention Adapters, Conformer longformer inference, and a Beam Search API!

First, we dicuss Hybrid CTC-RNNT models. We can train a single model with both losses, and then perform inference with either decoder. It turns out, we can attain better CTC results, and converge 40-50% faster for CTC head when jointly trained.

Next up, we have Multiblank Transducers supported in NeMo. It is an extension of RNNT loss - in which tokens can jump multiple timesteps per predicted token, allowing for highly efficient inference - even at sample level ! Refer to the paper here
With this change, you can now easily train a multi blank RNNT model and obtain better WER but also much faster inference than regular RNNT models.

Next up, we now support Multi Head Attention Adapters in NeMo ASR. With this approach, now any NeMo module can be retrofitted into an adapter module. We see significant parameter efficiency when compared to Houlsby Adapter. With the newly updated scripts for adapter training, we can now easily train either Linear adapters or MHA adapters from the same script. More details can be found in the PR

Long form audio transcription has long been a challange for Conformer based ASR models, because of the attention component. So we now support Longformer based transcriptions - even for pre-trained models ! You can use the transcribe_speech script for this! We find that if you further finetune the model after conversion to Longformer attention, you can recover most of the WER and still get excellent long audio transcription of up to 30-40 minutes in one shot forward pass.

A long-asked feature is to support beam search in NeMo ASR in a easy to use way. So we unified the way we do CTC beam search with external libraries with the simple model.transcribe() method! You can simply update the config, and then transcribe !

We also begin support for AIStore as a framework for terabyte-scale datasets as a scalable solution to train ASR models on enormous real world datasets.

Twitter

RT @HaseoX94: NeMo 1.15 is out now! Theres a whole bunch of powerful ASR features added in this release including Hybrid CTC-RNNT models, M…

627 viewsedited 23:58

Speech Technology

It is interesting that for things like NER for latest research Google returned to structured prediction instead of pure transformers

https://github.com/lyutyuh/ASP

https://arxiv.org/abs/2210.14698

Autoregressive Structured Prediction with Language Models

Tianyu Liu, Yuchen Jiang, Nicholas Monath, Ryan Cotterell, Mrinmaya Sachan

Recent years have seen a paradigm shift in NLP towards using pretrained language models ({PLM}) for a wide range of tasks.
However, there are many difficult design decisions to represent structures (e.g. tagged text, coreference chains) in a way such that they can be captured by PLMs. Prior work on structured prediction with PLMs typically flattens the structured output into a sequence, which limits the quality of structural information being learned and leads to inferior performance compared to classic discriminative models. In this work, we describe an approach to model structures as sequences of actions in an autoregressive manner with PLMs, allowing in-structure dependencies to be learned without any loss.
Our approach achieves the new state-of-the-art on all the structured prediction tasks we looked at, namely, named entity recognition, end-to-end relation extraction, and coreference resolution.

GitHub

GitHub - lyutyuh/ASP: PyTorch implementation and pre-trained models for ASP - Autoregressive Structured Prediction with Language…

PyTorch implementation and pre-trained models for ASP - Autoregressive Structured Prediction with Language Models, EMNLP 22. https://arxiv.org/pdf/2210.14698.pdf - lyutyuh/ASP

602 views23:04

About

Blog

Apps

Platform