Speech Technology

Forwarded from Nick Fisher

https://petewarden.com/2024/10/21/introducing-moonshine-the-new-state-of-the-art-for-speech-to-text/

Introducing Moonshine, the new state of the art for speech to text

Can you imagine using a keyboard where it took a key press two seconds to show up on screen? That’s the typical latency for most voice interfaces, so it’s no wonder they’ve failed…

1.21K views16:52

Speech Technology

A good Chinese MLLM

https://github.com/westlake-baichuan-mllm/bc-omni

https://arxiv.org/abs/2410.08565

Baichuan-Omni Technical Report

The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper, we introduce Baichuan-Omni, the first open-source 7B Multimodal Large Language Model (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering an advanced multimodal interactive experience and strong performance. We propose an effective multimodal training schema starting with 7B model...

1.13K views21:12

Speech Technology

Quite in-depth paper on continuous vs discrete representation

https://arxiv.org/abs/2410.16048

Continuous Speech Synthesis using per-token Latent Diffusion

Arnon Turetzky, Nimrod Shabtay, Slava Shechtman, Hagai Aronowitz, David Haws, Ron Hoory, Avihu Dekel

The success of autoregressive transformer models with discrete tokens has inspired quantization-based approaches for continuous modalities, though these often limit reconstruction quality. We therefore introduce SALAD, a per-token latent diffusion model for zero-shot text-to-speech, that operates on continuous representations. SALAD builds upon the recently proposed expressive diffusion head for image generation, and extends it to generate variable-length outputs. Our approach utilizes semantic tokens for providing contextual information and determining the stopping condition. We suggest three continuous variants for our method, extending popular discrete speech synthesis techniques. Additionally, we implement discrete baselines for each variant and conduct a comparative analysis of discrete versus continuous speech modeling techniques. Our results demonstrate that both continuous and discrete approaches are highly competent, and that SALAD achieves a superior intelligibility score while obtaining speech quality and speaker similarity on par with the ground-truth audio.

arXiv.org

Continuous Speech Synthesis using per-token Latent Diffusion

The success of autoregressive transformer models with discrete tokens has inspired quantization-based approaches for continuous modalities, though these often limit reconstruction quality. We...

1.02K views00:36

Speech Technology

F5 made a splash. This is a bit more complicated but also a better version (more reasonable audio codec for example)

https://maskgct.github.io

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, Zhizheng Wu

The recent large-scale text-to-speech (TTS) systems are usually grouped as autoregressive and non-autoregressive systems. The autoregressive systems implicitly model duration but exhibit certain deficiencies in robustness and lack of duration controllability. Non-autoregressive systems require explicit alignment information between text and speech during training and predict durations for linguistic units (e.g. phone), which may compromise their naturalness. In this paper, we introduce Masked Generative Codec Transformer (MaskGCT), a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision, as well as phone-level duration prediction. MaskGCT is a two-stage model: in the first stage, the model uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model, and in the second stage, the model predicts acoustic tokens conditioned on these semantic tokens. MaskGCT follows the mask-and-predict learning paradigm. During training, MaskGCT learns to predict masked semantic or acoustic tokens based on given conditions and prompts. During inference, the model generates tokens of a specified length in a parallel manner. Experiments with 100K hours of in-the-wild speech demonstrate that MaskGCT outperforms the current state-of-the-art zero-shot TTS systems in terms of quality, similarity, and intelligibility. Audio samples are available at this https URL. We release our code and model checkpoints at this https URL.

1.21K views01:28

Speech Technology

"We don't want 200ms latency, that's just not useful"

Will Williams is CTO of Speechmatics in Cambridge. In this sponsored episode - he shares deep technical insights into modern speech recognition technology and system architecture. The episode covers several key technical areas:

Speechmatics' hybrid approach to ASR, which focusses on unsupervised learning methods, achieving comparable results with 100x less data than fully supervised approaches. Williams explains why this is more efficient and generalizable than end-to-end models like Whisper.

Their production architecture implementing multiple operating points for different latency-accuracy trade-offs, with careful latency padding (up to 1.8 seconds) to ensure consistent user experience. The system uses lattice-based decoding with language model integration for improved accuracy.

The challenges and solutions in real-time ASR, including their approach to diarization (speaker identification), handling cross-talk, and implicit source separation. Williams explains why these problems remain difficult even with modern deep learning approaches.

Their testing and deployment infrastructure, including the use of mirrored environments for catching edge cases in production, and their strategy of maintaining global models rather than allowing customer-specific fine-tuning.

Technical evolution in ASR, from early days of custom CUDA kernels and manual memory management to modern frameworks, with Williams offering interesting critiques of current PyTorch memory management approaches and arguing for more efficient direct memory allocation in production systems.

https://www.youtube.com/watch?v=k6eXkBtYIHg

YouTube

One Step Closer to the Star Trek Voice AI Assistant!

Will Williams is CTO of Speechmatics in Cambridge. In this sponsored episode - he shares deep technical insights into modern speech recognition technology and system architecture. The episode covers several key technical areas:

* Speechmatics' hybrid approach…

1.22K views12:47

Speech Technology

https://twitter.com/SamueleCornell/status/1849115845516984758

https://arxiv.org/abs/2408.09215

Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition
Samuele Cornell, Jordan Darefsky, Zhiyao Duan, Shinji Watanabe

Currently, a common approach in many speech processing tasks is to leverage large scale pre-trained models by fine-tuning them on in-domain data for a particular application. Yet obtaining even a small amount of such data can be problematic, especially for sensitive domains and conversational speech scenarios, due to both privacy issues and annotation costs. To address this, synthetic data generation using single speaker datasets has been employed. Yet, for multi-speaker cases, such an approach often requires extensive manual effort and is prone to domain mismatches. In this work, we propose a synthetic data generation pipeline for multi-speaker conversational ASR, leveraging a large language model (LLM) for content creation and a conversational multi-speaker text-to-speech (TTS) model for speech synthesis. We conduct evaluation by fine-tuning the Whisper ASR model for telephone and distant conversational speech settings, using both in-domain data and generated synthetic data. Our results show that the proposed method is able to significantly outperform classical multi-speaker generation approaches that use external, non-conversational speech datasets.

X (formerly Twitter)

Samuele Cornell (@SamueleCornell) on X

In this paper we show that is possible to create synthetic 2-speakers conversations with TTS and LLMs and fine-tune successfully Whisper for multi-speaker ASR generalizing well to real-world scenarios:
https://t.co/Z4b4YrLzpR
Examples of such synth data:…

1.24K views18:01

Speech Technology

Some notes on Speechmatics interview:

Latency should be dynamic, modern advertising about small latency is not reasonable, but dynamic context-dependent latency is a thing. AudioLLMs enable that.

Lattices are not the optimal way of representation of the search space if you have may aspects of speech (emotion, etc). Vectorized representations suit GPU better, more compact and learnable. By using lattices we have some control over results but restrict ourselves at the same time.

Wav2vec-like learning Speechmatics uses is 100x faster but at the same time it is very hard to learn long distribution tail without lexical information just from the audio. Semi-supervised learning or full e2e approach definitely have an advantage.

Continuous learning (active inference) is something to think about more actively, yes, something very important for the future.

1.32K viewsedited 08:22

Speech Technology

We released new Vosk models for Persian, WER improved significantly

https://alphacephei.com/vosk/models/vosk-model-fa-0.42.zip
https://alphacephei.com/vosk/models/vosk-model-small-fa-0.42.zip

For more details see

https://github.com/alphacep/awesome-speech/blob/main/persian.md#asr-results

1.29K viewsedited 21:29

Speech Technology

Name speaks for itself

https://github.com/yakami129/VirtualWife

GitHub

GitHub - yakami129/VirtualWife: VirtualWife是一个虚拟数字人项目，支持B站直播，支持openai、ollama

VirtualWife是一个虚拟数字人项目，支持B站直播，支持openai、ollama. Contribute to yakami129/VirtualWife development by creating an account on GitHub.

1.2K views21:30

Speech Technology

https://arxiv.org/abs/2410.18908

A Survey on Speech Large Language Models

Jing Peng, Yucheng Wang, Yu Xi, Xu Li, Xizhuo Zhang, Kai Yu

Large Language Models (LLMs) exhibit strong contextual understanding and remarkable multi-task performance. Therefore, researchers have been seeking to integrate LLMs in the broad sense of Spoken Language Understanding (SLU) field. Different from the traditional method of cascading LLMs to process text generated by Automatic Speech Recognition(ASR), new efforts have focused on designing architectures centered around Audio Feature Extraction - Multimodal Information Fusion - LLM Inference(Speech LLMs). This approach enables richer audio feature extraction while simultaneously facilitating end-to-end fusion of audio and text modalities, thereby achieving deeper understanding and reasoning from audio data. This paper elucidates the development of Speech LLMs, offering an in-depth analysis of system architectures and training strategies. Through extensive research and a series of targeted experiments, the paper assesses Speech LLMs' advancements in Rich Audio Transcription and its potential for Cross-task Integration within the SLU field. Additionally, it indicates key challenges uncovered through experimentation, such as the Dormancy of LLMs under certain conditions. The paper further delves into the training strategies for Speech LLMs, proposing potential solutions based on these findings, and offering valuable insights and references for future research in this domain, as well as LLM applications in multimodal contexts.

arXiv.org

A Survey on Speech Large Language Models for Understanding

Speech understanding is essential for interpreting the diverse forms of information embedded in spoken language, including linguistic, paralinguistic, and non-linguistic cues that are vital for...

1.45K views22:14

Speech Technology

Nice paper with few interesting details. Extra CTC head for Whisper stabilization is interesting for example.

https://arxiv.org/abs/2409.09543

Target Speaker ASR with Whisper

Alexander Polok, Dominik Klement, Matthew Wiesner, Sanjeev Khudanpur, Jan Černocký, Lukáš Burget

We propose a novel approach to enable the use of large, single speaker ASR models, such as Whisper, for target speaker ASR. The key insight of this method is that it is much easier to model relative differences among speakers by learning to condition on frame-level diarization outputs, than to learn the space of all speaker embeddings. We find that adding even a single bias term per diarization output type before the first transformer block can transform single speaker ASR models, into target speaker ASR models. Our target-speaker ASR model can be used for speaker attributed ASR by producing, in sequence, a transcript for each hypothesized speaker in a diarization output. This simplified model for speaker attributed ASR using only a single microphone outperforms cascades of speech separation and diarization by 11% absolute ORC-WER on the NOTSOFAR-1 dataset.

arXiv.org

Target Speaker ASR with Whisper

We propose a novel approach to enable the use of large, single-speaker ASR models, such as Whisper, for target speaker ASR. The key claim of this method is that it is much easier to model relative...

1.49K views21:45

Speech Technology

https://huggingface.co/nvidia/stt_uz_fastconformer_hybrid_large_pc

huggingface.co

nvidia/stt_uz_fastconformer_hybrid_large_pc · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.64K views08:10

Speech Technology

Even with our new speech codec, producing a 2-minute dialogue requires generating over 5000 tokens. To model these long sequences, we developed a specialized Transformer architecture that can efficiently handle hierarchies of information, matching the structure of our acoustic tokens.

https://deepmind.google/discover/blog/pushing-the-frontiers-of-audio-generation/

Google DeepMind

Pushing the frontiers of audio generation

Our pioneering speech generation technologies are helping people around the world interact with more natural, conversational and intuitive digital assistants and AI tools.

1.41K views08:14

Speech Technology

Fish Agent V0.1 3B is a groundbreaking Voice-to-Voice model capable of capturing and generating environmental audio information with unprecedented accuracy. What sets it apart is its semantic-token-free architecture, eliminating the need for traditional semantic encoders/decoders like Whisper and CosyVoice.

Additionally, it stands as a state-of-the-art text-to-speech (TTS) model, trained on an extensive dataset of 700,000 hours of multilingual audio content.

This model is a continue-pretrained version of Qwen-2.5-3B-Instruct for 200B voice & text tokens.

https://huggingface.co/fishaudio/fish-agent-v0.1-3b

huggingface.co

fishaudio/fish-agent-v0.1-3b · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.74K views08:16

Speech Technology

Overall, we find no evidence that multiscale aspects of MR-HuBERT lead to improved acquisition of high level concepts. The question now is how to build an architecture that does leverage this hierarchy?🤔 (4/5)

https://twitter.com/theo_clark_/status/1852299593272131874

https://arxiv.org/abs/2410.23955

X (formerly Twitter)

Theo Clark (@theo_clark_) on X

1.44K views14:59

Speech Technology

It is simply bad

https://arxiv.org/abs/2411.03866

Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward

Shashi Kumar, Iuliia Thorbecke, Sergio Burdisso, Esaú Villatoro-Tello, Manjunath K E, Kadri Hacioğlu, Pradeep Rangappa, Petr Motlicek, Aravind Ganapathiraju, Andreas Stolcke

Recent research has demonstrated that training a linear connector between speech foundation encoders and large language models (LLMs) enables this architecture to achieve strong ASR capabilities. Despite the impressive results, it remains unclear whether these simple approaches are robust enough across different scenarios and speech conditions, such as domain shifts and different speech perturbations. In this paper, we address these questions by conducting various ablation experiments using a recent and widely adopted approach called SLAM-ASR. We present novel empirical findings that offer insights on how to effectively utilize the SLAM-ASR architecture across a wide range of settings. Our main findings indicate that the SLAM-ASR exhibits poor performance in cross-domain evaluation settings. Additionally, speech perturbations within in-domain data, such as changes in speed or the presence of additive noise, can significantly impact performance. Our findings offer critical insights for fine-tuning and configuring robust LLM-based ASR models, tailored to different data characteristics and computational resources.

arXiv.org

Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly,...

Recent research has demonstrated that training a linear connector between speech foundation encoders and large language models (LLMs) enables this architecture to achieve strong ASR capabilities....

1.46K views19:26

Speech Technology

Apple's papers are always very practical. This one is also good, many in-depth experiments and practical cases. Note that biasing effect is minimal (usually WER goes down a little 17% -> 15%).

https://arxiv.org/abs/2411.00664

Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval

Nikolaos Flemotomos, Roger Hsiao, Pawel Swietojanski, Takaaki Hori, Dogan Can, Xiaodan Zhuang

Neural contextual biasing allows speech recognition models to leverage contextually relevant information, leading to improved transcription accuracy. However, the biasing mechanism is typically based on a cross-attention module between the audio and a catalogue of biasing entries, which means computational complexity can pose severe practical limitations on the size of the biasing catalogue and consequently on accuracy improvements. This work proposes an approximation to cross-attention scoring based on vector quantization and enables compute- and memory-efficient use of large biasing catalogues. We propose to use this technique jointly with a retrieval based contextual biasing approach. First, we use an efficient quantized retrieval module to shortlist biasing entries by grounding them on audio. Then we use retrieved entries for biasing. Since the proposed approach is agnostic to the biasing method, we investigate using full cross-attention, LLM prompting, and a combination of the two. We show that retrieval based shortlisting allows the system to efficiently leverage biasing catalogues of several thousands of entries, resulting in up to 71% relative error rate reduction in personal entity recognition. At the same time, the proposed approximation algorithm reduces compute time by 20% and memory usage by 85-95%, for lists of up to one million entries, when compared to standard dot-product cross-attention.

arXiv.org

Optimizing Contextual Speech Recognition Using Vector Quantization...

Neural contextual biasing allows speech recognition models to leverage contextually relevant information, leading to improved transcription accuracy. However, the biasing mechanism is typically...

2.1K views21:35

Speech Technology

https://github.com/fixie-ai/ultravox/releases/tag/v0.4.1

GitHub

Release v0.4.1 · fixie-ai/ultravox

We're releasing Ultravox 0.4.1 today. The weights have been pushed to Hugging Face (along with updated datasets for training). If you're using the Ultravox Realtime APIs, v0.4.1 is the new ...

1.57K views20:16

Speech Technology

Some numbers for codec quality for Russian audio dataset

BigVGAN2 is good, but very slow (112M parameters). MEL-Vocos is not perfect. Encodec-Vocos is probably good.

Should we test something else like SNAC?

1.62K viewsedited 23:00

Speech Technology

https://github.com/john852517791/awesome-fake-audio-detection

GitHub

GitHub - john852517791/awesome-fake-audio-detection: A list of tools, papers and code related to Fake Audio Detection.

A list of tools, papers and code related to Fake Audio Detection. - john852517791/awesome-fake-audio-detection

1.74K views00:15

Speech Technology

https://www.youtube.com/watch?v=pRUrO0x637A

YouTube

November 15th LTI Colloquium Speaker - Yu Zhang

Hearing the AGI from GMM HMM to GPT 4o Yu Zhang