Speech Technology

🚨🚨The singing voice conversion challenge 2023 summary paper is out!

Tl; dr:
✅Human level naturalness achieved by top teams!
❌Conversion similarity: still a long way to go!

Kudos to the team 🙌🙌 @lesterphv @jiatongshi @shaunliu231

https://t.co/ebjT0N92eG

830 views23:42

Speech Technology

https://arxiv.org/abs/2306.13114

https://github.com/aixplain/NoRefER

A Reference-less Quality Metric for Automatic Speech Recognition via Contrastive-Learning of a Multi-Language Model with Self-Supervision
Kamer Ali Yuksel, Thiago Ferreira, Ahmet Gunduz, Mohamed Al-Badrashiny, Golara Javadi
The common standard for quality evaluation of automatic speech recognition (ASR) systems is reference-based metrics such as the Word Error Rate (WER), computed using manual ground-truth transcriptions that are time-consuming and expensive to obtain. This work proposes a multi-language referenceless quality metric, which allows comparing the performance of different ASR models on a speech dataset without ground truth transcriptions. To estimate the quality of ASR hypotheses, a pre-trained language model (LM) is fine-tuned with contrastive learning in a self-supervised learning manner. In experiments conducted on several unseen test datasets consisting of outputs from top commercial ASR engines in various languages, the proposed referenceless metric obtains a much higher correlation with WER scores and their ranks than the perplexity metric from the state-of-art multi-lingual LM in all experiments, and also reduces WER by more than 7% when used for ensembling hypotheses. The fine-tuned model and experiments are made available for the reproducibility: this https URL

GitHub

GitHub - aixplain/NoRefER

Contribute to aixplain/NoRefER development by creating an account on GitHub.

1.31K viewsedited 23:52

Speech Technology

Another semisup thing from Google, better ensembling than ROVER

https://arxiv.org/abs/2306.12012

Learning When to Trust Which Teacher for Weakly Supervised ASR

Aakriti Agrawal, Milind Rao, Anit Kumar Sahu, Gopinath Chennupati, Andreas Stolcke

Automatic speech recognition (ASR) training can utilize multiple experts as teacher models, each trained on a specific domain or accent. Teacher models may be opaque in nature since their architecture may be not be known or their training cadence is different from that of the student ASR model. Still, the student models are updated incrementally using the pseudo-labels generated independently by the expert teachers. In this paper, we exploit supervision from multiple domain experts in training student ASR models. This training strategy is especially useful in scenarios where few or no human transcriptions are available. To that end, we propose a Smart-Weighter mechanism that selects an appropriate expert based on the input audio, and then trains the student model in an unsupervised setting. We show the efficacy of our approach using LibriSpeech and LibriLight benchmarks and find an improvement of 4 to 25\% over baselines that uniformly weight all the experts, use a single expert model, or combine experts using ROVER.

863 viewsedited 00:15

Speech Technology

IWSLT 2023 program is available

https://iwslt.org/2023/program

IWSLT

Program

Home of the IWSLT conference and SIGSLT.

736 views13:21

Speech Technology

UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data (INTERSPEECH 2023)

https://github.com/gmltmd789/UnitSpeech

Demo

https://unitspeech.github.io/

GitHub

GitHub - gmltmd789/UnitSpeech: An official implementation of "UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed…

An official implementation of "UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data" - gmltmd789/UnitSpeech

839 viewsedited 15:38

Speech Technology

On Interspeech 2023 program Daniel Povey has Johns Hopkins University affilation (again)

https://interspeech2023.org/wp-content/uploads/2023/06/INTERSPEECH_2023_Booklet_v1.pdf

776 views16:51

Speech Technology

A useful effort to collect interspeech paper repos by https://github.com/DmitryRyumin

Please start/share and help to fill the remaining parts, it is a huge effort

https://github.com/DmitryRyumin/INTERSPEECH-2023-Papers

one can automate it probably

GitHub

DmitryRyumin - Overview

PhD in Engineering. DmitryRyumin has 29 repositories available. Follow their code on GitHub.

979 views16:54

Speech Technology

From

https://arxiv.org/abs/2306.17103

LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT
Le Zhuo, Ruibin Yuan, Jiahao Pan, Yinghao Ma, Yizhi LI, Ge Zhang, Si Liu, Roger Dannenberg, Jie Fu, Chenghua Lin, Emmanouil Benetos, Wenhu Chen, Wei Xue, Yike Guo
We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our novel, training-free approach utilizes Whisper, a weakly supervised robust speech recognition model, and GPT-4, today's most performant chat-based large language model. In the proposed method, Whisper functions as the "ear" by transcribing the audio, while GPT-4 serves as the "brain," acting as an annotator with a strong performance for contextualized output selection and correction. Our experiments show that LyricWhiz significantly reduces Word Error Rate compared to existing methods in English and can effectively transcribe lyrics across multiple languages. Furthermore, we use LyricWhiz to create the first publicly available, large-scale, multilingual lyrics transcription dataset with a CC-BY-NC-SA copyright license, based on MTG-Jamendo, and offer a human-annotated subset for noise level estimation and evaluation. We anticipate that our proposed method and dataset will advance the development of multilingual lyrics transcription, a challenging and emerging task.

Prompt to combine ASR results with GPT-4

Task: As a GPT-4 based lyrics transcription post-processor, your task is to analyze multiple ASR model-generated versions of a song’s lyrics and determine the most accurate version closest to the true lyrics. Also filter out invalid lyrics when all predictions are nonsense.

Input: The input is in JSON format:
{“prediction_1”: “line1;line2;...”, ...}

Output: Your output must be strictly in readable JSON format without any extra text:
{
“reasons”: “reason1;reason2;...”,
“closest_prediction”: <key_of_prediction>
“output”: “line1;line2...”
}

Requirements: For the "reasons" field, you have to provide a reason for the choice of the "closest_prediction" field. For the "closest_prediction" field, choose the prediction key that is closest to the true lyrics. Only when all predictions greatly differ from each other or are completely nonsense or meaningless, which means that none of the predictions is valid, fill in "None" in this field. For the "output" field, you need to output the final lyrics of closest_prediction. If the "closest_prediction" field is "None", you should also output "None" in this field. The language of the input lyrics is English.

1.32K viewsedited 08:20

Speech Technology

Another similar one with LLAMA

https://arxiv.org/abs/2306.16007

Prompting Large Language Models for Zero-Shot Domain Adaptation in Speech Recognition
Yuang Li, Yu Wu, Jinyu Li, Shujie Liu
The integration of Language Models (LMs) has proven to be an effective way to address domain shifts in speech recognition. However, these approaches usually require a significant amount of target domain text data for the training of LMs. Different from these methods, in this work, with only a domain-specific text prompt, we propose two zero-shot ASR domain adaptation methods using LLaMA, a 7-billion-parameter large language model (LLM). LLM is used in two ways: 1) second-pass rescoring: reranking N-best hypotheses of a given ASR system with LLaMA; 2) deep LLM-fusion: incorporating LLM into the decoder of an encoder-decoder based ASR system. Experiments show that, with only one domain prompt, both methods can effectively reduce word error rates (WER) on out-of-domain TedLium-2 and SPGISpeech datasets. Especially, the deep LLM-fusion has the advantage of better recall of entity and out-of-vocabulary words.

1.37K views08:40

Speech Technology

The approach is reasonable at least

https://github.com/akashmjn/tinydiarize

GitHub

GitHub - akashmjn/tinydiarize: Minimal extension of OpenAI's Whisper adding speaker diarization with special tokens

Minimal extension of OpenAI's Whisper adding speaker diarization with special tokens - akashmjn/tinydiarize

1.6K views11:24

Speech Technology

Are you skilled at generating synthesized or converted speech samples? Are you concerned about the potential implications of deepfake speech? Are you interested to contribute to advancing technology for detecting such 'fake' speech using machine learning?

If yes, you are warmly invited to contribute to the fifth edition of the ASVspoof (Automatic Speaker Verification and Spoofing Countermeasures) challenge! ASVspoof is centered around the challenges to design spoofing-robust automatic speaker verification solutions and application-agnostic speech deepfake detectors.

You may join us either as a data provider (phase 1) or as a challenge participant (phase 2). We are now inviting expressions of interest from potential data contributors.

For further details, please refer to the ASVspoof 5 Evaluation Plan which can be downloaded from our website at: https://www.asvspoof.org/

Kind regards,
On behalf of the ASVspoof 5 organising committee
organisers@lists.asvspoof.org be

July 1, 2023 Phase 1 - registration opens
July 1, 2023 - training and development data available
July 1, 2023 - TTS/VC adaptation and input data available
July 1, 2023 - surrogate ASV/CM available
July 15, 2023 - Phase 1 CodaLab platform opens
July 15 to September 15, 2023 - submit TTS/VC spoofed data

987 viewsedited 15:25

Speech Technology

Cambridge team is always doing nice research

https://arxiv.org/abs/2307.03088

Label-Synchronous Neural Transducer for End-to-End ASR
Keqi Deng, Philip C. Woodland
Neural transducers provide a natural approach to streaming ASR. However, they augment output sequences with blank tokens which leads to challenges for domain adaptation using text data. This paper proposes a label-synchronous neural transducer (LS-Transducer), which extracts a label-level encoder representation before combining it with the prediction network output. Hence blank tokens are no longer needed and the prediction network can be easily adapted using text data. An Auto-regressive Integrate-and-Fire (AIF) mechanism is proposed to generate the label-level encoder representation while retaining the streaming property. In addition, a streaming joint decoding method is designed to improve ASR accuracy. Experiments show that compared to standard neural transducers, the proposed LS-Transducer gave a 10% relative WER reduction (WERR) for intra-domain Librispeech-100h data, as well as 17% and 19% relative WERRs on cross-domain TED-LIUM 2 and AESRC2020 data with an adapted prediction network.

927 views11:50

Speech Technology

Beside diarization with tinydiarize, whisper can do audio tagging well

https://arxiv.org/abs/2307.03183

Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers

Yuan Gong, Sameer Khurana, Leonid Karlinsky, James Glass

In this paper, we focus on Whisper, a recent automatic speech recognition model trained with a massive 680k hour labeled speech corpus recorded in diverse conditions. We first show an interesting finding that while Whisper is very robust against real-world background sounds (e.g., music), its audio representation is actually not noise-invariant, but is instead highly correlated to non-speech sounds, indicating that Whisper recognizes speech conditioned on the noise type. With this finding, we build a unified audio tagging and speech recognition model Whisper-AT by freezing the backbone of Whisper, and training a lightweight audio tagging model on top of it. With <1% extra computational cost, Whisper-AT can recognize audio events, in addition to spoken text, in a single forward pass.

1.07K views12:52

Speech Technology

https://twitter.com/RoshanSSharma2/status/1678523240472358912

Interested in Spoken Language? Our new paper "SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks" at #ACL2023 introduces open-source data, tools, and benchmarks for 4 SLU tasks.
https://lnkd.in/ePiUjTiU
Presentation: 11AM on July 11
See you there!

https://arxiv.org/abs/2212.10525

SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks

Suwon Shon, Siddhant Arora, Chyi-Jiunn Lin, Ankita Pasad, Felix Wu, Roshan Sharma, Wei-Lun Wu, Hung-Yi Lee, Karen Livescu, Shinji Watanabe

Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community, but have not received as much attention as lower-level tasks like speech and speaker recognition. In particular, there are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce such benchmark datasets for several tasks. In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape. We contribute four tasks: question answering and summarization involve inference over longer speech sequences; named entity localization addresses the speech-specific task of locating the targeted content in the signal; dialog act classification identifies the function of a given speech utterance. We follow the blueprint of the Spoken Language Understanding Evaluation (SLUE) benchmark suite. In order to facilitate the development of SLU models that leverage the success of pre-trained speech representations, we will be publishing for each task (i) annotations for a relatively small fine-tuning set, (ii) annotated development and test sets, and (iii) baseline models for easy reproducibility and comparisons. In this work, we present the details of data collection and annotation and the performance of the baseline models. We also perform sensitivity analysis of pipeline models' performance (speech recognizer + text model) to the speech recognition accuracy, using more than 20 state-of-the-art speech recognition models.

Twitter

Interested in Spoken Language? Our new paper "SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks" at #ACL2023 introduces open-source data, tools, and benchmarks for 4 SLU tasks.
https://t.co/FsKWLwNOVQ
Presentation: 11AM on July…

785 views07:41

Speech Technology

While experiment test sets are questionable, the overall directioni is somewhat interesting

https://arxiv.org/abs/2307.04172

Can Generative Large Language Models Perform ASR Error Correction?

Rao Ma, Mengjie Qian, Potsawee Manakul, Mark Gales, Kate Knill

ASR error correction continues to serve as an important part of post-processing for speech recognition systems. Traditionally, these models are trained with supervised training using the decoding results of the underlying ASR system and the reference text. This approach is computationally intensive and the model needs to be re-trained when switching the underlying ASR model. Recent years have seen the development of large language models and their ability to perform natural language processing tasks in a zero-shot manner. In this paper, we take ChatGPT as an example to examine its ability to perform ASR error correction in the zero-shot or 1-shot settings. We use the ASR N-best list as model input and propose unconstrained error correction and N-best constrained error correction methods. Results on a Conformer-Transducer model and the pre-trained Whisper model show that we can largely improve the ASR system performance with error correction using the powerful ChatGPT model.

866 views18:02

Speech Technology

This talk focuses on some foundational problems in practical speech recognition and discusses some solutions for each of these problems.

https://www.youtube.com/watch?v=Y6s4EzDTAwA

945 views14:21

Speech Technology

Speech restoration method Miipher (used to generate LibriTTS-R) has been accepted to WASPAA!! It converts degraded speech to studio quality, and generates almost inexhaustible training data for speech generation.

Demo: https://google.github.io/df-conformer/miipher/
Paper: https://arxiv.org/abs/2303.01664

Original thing is also nice

LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus
Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang, Wei Han, Ankur Bapna
This paper introduces a new speech dataset called ``LibriTTS-R'' designed for text-to-speech (TTS) use. It is derived by applying speech restoration to the LibriTTS corpus, which consists of 585 hours of speech data at 24 kHz sampling rate from 2,456 speakers and the corresponding texts. The constituent samples of LibriTTS-R are identical to those of LibriTTS, with only the sound quality improved. Experimental results show that the LibriTTS-R ground-truth samples showed significantly improved sound quality compared to those in LibriTTS. In addition, neural end-to-end TTS trained with LibriTTS-R achieved speech naturalness on par with that of the ground-truth samples. The corpus is freely available for download from http://www.openslr.org/141/.

1.03K viewsedited 14:25

Speech Technology

https://arxiv.org/abs/2307.03917

On decoder-only architecture for speech-to-text and large language model integration

Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu, Bo Ren, Linquan Liu, Yu Wu

Microsoft

Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based large language models. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.

1.13K views16:10

Speech Technology

In line with this work, we're open-sourcing a new dataset to help the broader community improve fairness of speech recognition models. The dataset includes ~27K utterances in recorded speech from 595 paid participants.

Dataset ➡️ https://ai.meta.com/datasets/speech-fairness-dataset/

https://twitter.com/MetaAI/status/1679525451667238913

971 viewsedited 21:26

Speech Technology

This talk focuses on some foundational problems in practical speech recognition and discusses some solutions for each of these problems. https://www.youtube.com/watch?v=Y6s4EzDTAwA

The talk is really nice and touches many hot problems with modern tech:

1. RNNT models are fast but don't really work for rare words. A deeper integration of LM is needed. LODR-like integration helps. A rare word WER metric is required too.
2. Modern transducers are very bad at finding true alignment, they win in accuracy by pushing everything to the end.
3. Streaming speech recognition is 2 times less accurate. Google hope to repair more than 50% with more advanced neural network architecture.
4. Self-supervised training does not really work as Google sees. They propose their own loss more focused on ASR instead of constrastive loss.

some extra points discussed:

1. Are blank states harmful
2. Is it possible to include intonation and other emotions into lattice representation
3. WER is not the right way for streaming either

Afterthought: a lot of things we are doing now can fundamentally change in the future

1.33K views21:58

Speech Technology

https://github.com/ga642381/Speech-Prompts-Adapters

GitHub

GitHub - ga642381/Speech-Prompts-Adapters: This Repository surveys the paper focusing on Prompting and Adapters for Speech Processing.

This Repository surveys the paper focusing on Prompting and Adapters for Speech Processing. - GitHub - ga642381/Speech-Prompts-Adapters: This Repository surveys the paper focusing on Prompting and...

1.28K views00:19

About

Blog

Apps

Platform