Speech Technology

As we advocate for prosody evaluations in TTS systems, this paper is important.

The metric itself is questionable though so the results (I'd experiment with CFG value in flow matching systems)

https://arxiv.org/abs/2509.19928

Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration

Yifan Yang, Bing Han, Hui Wang, Long Zhou, Wei Wang, Mingyu Cui, Xu Tan, Xie Chen

Prosody diversity is essential for achieving naturalness and expressiveness in zero-shot text-to-speech (TTS). However, frequently used acoustic metrics capture only partial views of prosodic variation and correlate poorly with human perception, leaving the problem of reliably quantifying prosody diversity underexplored. To bridge this gap, we introduce ProsodyEval, a prosody diversity assessment dataset that provides Prosody Mean Opinion Score (PMOS) alongside conventional acoustic metrics. ProsodyEval comprises 1000 speech samples derived from 7 mainstream TTS systems, with 2000 human ratings. Building on this, we propose the Discretized Speech Weighted Edit Distance (DS-WED), a new objective diversity metric that quantifies prosodic variation via weighted edit distance over semantic tokens. Experiments on ProsodyEval show that DS-WED achieves substantially higher correlation with human judgments than existing acoustic metrics, while remaining highly robust in speech tokenization from HuBERT and WavLM. Leveraging DS-WED, we benchmark state-of-the-art open-source TTS systems on LibriSpeech test-clean and Seed-TTS test-en, and further explorations uncover several factors that influence prosody diversity, including generative modeling paradigms, duration control, and reinforcement learning. Moreover, we find that current large audio language models (LALMs) remain limited in capturing prosodic variations. Audio samples are available at this https URL.

arXiv.org

Measuring Prosody Diversity in Zero-Shot TTS: A New Metric,...

Prosody diversity is essential for achieving naturalness and expressiveness in zero-shot text-to-speech (TTS). However, frequently used acoustic metrics capture only partial views of prosodic...

1.18K views12:33

Speech Technology

We consider resonance as a core physical principle of signal processing so really interested in research like this recent one.

1.18K views23:14

Speech Technology

https://github.com/alexandrefrancois/noFFT

https://alexandrefrancois.org/Resonate/

https://alexandrefrancois.org/assets/publications/FrancoisARJ-ICMC2025.pdf

This paper describes Resonate, an original low latency, low memory footprint, and low computational cost algorithm to evaluate perceptually relevant spectral information from audio signals. The fundamental building block is a resonator model that accumulates the signal contribution around its resonant frequency in the time domain, using the Exponentially Weighted Moving Average (EWMA). A compact, iterative formulation of the model affords computing an update at each signal input sample, requiring no buffering and involving only a handful of arithmetic operations. Consistently with on-line perceptual signal analysis, the EWMA gives more weight to recent input values, whereas the contributions of older values decay exponentially. A single parameter governs the dynamics of the system. Banks of such resonators, independently tuned to geometrically spaced resonant frequencies, compute an instantaneous, perceptually relevant estimate of the spectral content of an input signal in real-time. Both memory and per-sample computational complexity of such a bank are linear in the number of resonators, and independent of the number of input samples processed, or duration of processed signal. Furthermore, since the resonators are independent, there is no constraint on the tuning of their resonant frequencies or time constants, and all per sample computations can be parallelized across resonators. The cumulative computational cost for a given duration increases linearly with the number of input samples processed. The low latency afforded by Resonate opens the door to real-time music and speech applications that are out of the reach of FFT-based methods. The efficiency of the approach could reduce computational costs and inspire new designs for low-level audio processing layers in machine learning systems.

GitHub

GitHub - alexandrefrancois/noFFT: A reference implementation of the Resonate algorithm in C++ for Python.

A reference implementation of the Resonate algorithm in C++ for Python. - alexandrefrancois/noFFT

1.65K viewsedited 23:20

Speech Technology

The principle itself is applicable not just to signal processing, but to upper layers too, something in line with https://en.wikipedia.org/wiki/Adaptive_resonance_theory

1.26K views23:25

Speech Technology

One more reminder that VAE is better than MEL

https://github.com/ZhikangNiu/Semantic-VAE

Good and simple improvement over F5TTS

https://arxiv.org/abs/2509.22167

Zhikang Niu, Shujie Hu, Jeongsoo Choi, Yushen Chen, Peining Chen, Pengcheng Zhu, Yunting Yang, Bowen Zhang, Jian Zhao, Chunhui Wang, Xie Chen

While mel-spectrograms have been widely utilized as intermediate representations in zero-shot text-to-speech (TTS), their inherent redundancy leads to inefficiency in learning text-speech alignment. Compact VAE-based latent representations have recently emerged as a stronger alternative, but they also face a fundamental optimization dilemma: higher-dimensional latent spaces improve reconstruction quality and speaker similarity, but degrade intelligibility, while lower-dimensional spaces improve intelligibility at the expense of reconstruction fidelity. To overcome this dilemma, we propose Semantic-VAE, a novel VAE framework that utilizes semantic alignment regularization in the latent space. This design alleviates the reconstruction-generation trade-off by capturing semantic structure in high-dimensional latent representations. Extensive experiments demonstrate that Semantic-VAE significantly improves synthesis quality and training efficiency. When integrated into F5-TTS, our method achieves 2.10% WER and 0.64 speaker similarity on LibriSpeech-PC, outperforming mel-based systems (2.23%, 0.60) and vanilla acoustic VAE baselines (2.65%, 0.59). We also release the code and models to facilitate further research.

GitHub

GitHub - ZhikangNiu/Semantic-VAE: Official code for "Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis"

Official code for "Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis" - ZhikangNiu/Semantic-VAE

1.59K viewsedited 03:54

Speech Technology

https://huggingface.co/Atotti/Qwen3-Omni-AudioTransformer

Encoder extracted from Qwen3-Omni, expected to be trained on 20m hours of data

huggingface.co

Atotti/Qwen3-Omni-AudioTransformer · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.45K views12:58

Speech Technology

We released 4 new models for Kazakh and Kyrgyz languages. Models are trained for the old Vosk, they still have a lot of value for some applications where you need to quickly update the LM.

https://alphacephei.com/vosk/models

vosk-model-small-ky-0.42

WER fleurs 18.95
WER cv 16.96

vosk-model-ky-0.42

WER fleurs 13.45
WER cv 8.75

vosk-model-small-kz-0.42

WER fleurs 21.10
WER cv 30.00
WER ksc 9.70
WER ksc-other 24.86

vosk-model-kz-0.42

WER fleurs 13.09
WER cv 12.50
WER ksc 4.49
WER ksc-other 18.51

VOSK Offline Speech Recognition API

VOSK Models

Accurate speech recognition for Android, iOS, Raspberry Pi and servers with Python, Java, C#, Swift and Node.

2.28K viewsedited 00:52

Speech Technology

Since everyone understood already discrete tokens doesn't work here is a continuous variant

https://github.com/inclusionAI/Ming-UniAudio

GitHub

GitHub - inclusionAI/Ming-UniAudio: Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation

Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation - inclusionAI/Ming-UniAudio

1.17K views23:08

Speech Technology

BUT continues work on Whisper instead of OpenAI

https://github.com/BUTSpeechFIT/SOT-DiCoW

https://arxiv.org/abs/2510.03723

Adapting Diarization-Conditioned Whisper for End-to-End Multi-Talker Speech Recognition

Martin Kocour, Martin Karafiat, Alexander Polok, Dominik Klement, Lukáš Burget, Jan Černocký

We propose a speaker-attributed (SA) Whisper-based model for multi-talker speech recognition that combines target-speaker modeling with serialized output training (SOT). Our approach leverages a Diarization-Conditioned Whisper (DiCoW) encoder to extract target-speaker embeddings, which are concatenated into a single representation and passed to a shared decoder. This enables the model to transcribe overlapping speech as a serialized output stream with speaker tags and timestamps. In contrast to target-speaker ASR systems such as DiCoW, which decode each speaker separately, our approach performs joint decoding, allowing the decoder to condition on the context of all speakers simultaneously. Experiments show that the model outperforms existing SOT-based approaches and surpasses DiCoW on multi-talker mixtures (e.g., LibriMix).

GitHub

GitHub - BUTSpeechFIT/SOT-DiCoW: Multi-talker ASR based on DiCoW with Serialized Output Training

Multi-talker ASR based on DiCoW with Serialized Output Training - BUTSpeechFIT/SOT-DiCoW

1.32K views09:07

Speech Technology

Very true. Good read for those who chase 100ms response time

https://www.speechmatics.com/company/articles-and-news/why-fastest-voice-tech-is-a-trap

Speechmatics

Why “fastest” voice tech is a trap

Why chasing the 'fastest' speech-to-text breaks voice agents. Discover how Speechmatics balances speed and accuracy for real-world conversations.

1.36K views21:30

Speech Technology

Somewhat advanced recent TTS, basically modern F5 with VibeVoice latents at 7.5Hz and distillation from diffusion model

https://github.com/smallbraineng/smalltts

GitHub

GitHub - smallbraineng/smalltts: superfast text to speech in any voice

superfast text to speech in any voice. Contribute to smallbraineng/smalltts development by creating an account on GitHub.

1.25K viewsedited 14:09

Speech Technology

Everyone talks about https://github.com/neuphonic/neucodec

Neuphonhic itself celebrates funding round as a promising British startup.

Not sure why, codec is really huge 800M parameters, must be very context-dependent.

GitHub

GitHub - neuphonic/neucodec: A package for NeuCodec: a 50hz, 0.8kbps, 24kHz audio codec.

A package for NeuCodec: a 50hz, 0.8kbps, 24kHz audio codec. - neuphonic/neucodec

1.05K views17:04

Speech Technology

Meta's take on audio LLMs

https://arxiv.org/abs/2510.06195

Latent Speech-Text Transformer

Yen-Ju Lu, Yashesh Gaur, Wei Zhou, Benjamin Muller, Jesus Villalba, Najim Dehak, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Srinivasan Iyer, Duc Le

Auto-regressive speech-text models are typically pre-trained on a large number of interleaved sequences of text tokens and raw speech encoded as speech tokens using vector quantization. These models have demonstrated state-of-the-art performance in speech-to-speech understanding and generation benchmarks, together with promising scaling laws, primarily enabled by the representational alignment between text and speech. Nevertheless, they suffer from shortcomings, partly owing to the disproportionately longer sequences of speech tokens in contrast to textual tokens. This results in a large compute imbalance between modalities during pre-training as well as during inference, and a potential hindrance to effectively aligning speech and text, ultimately translating to several orders of magnitude slower scaling laws. We introduce the Latent Speech-Text Transformer (LST), which makes pre-training speech-text models more data-efficient by dynamically and inexpensively aggregating speech tokens into latent speech patches. These patches serve as higher-level units that can either align with corresponding textual units to aid capability transfer or even encapsulate common speech sequences like silences to be more compute-efficient. We show that LST outperforms vanilla approaches on speech-to-speech as well as text-to-text benchmarks in both data- and compute-controlled settings, the former indicating more effective representational alignment and the latter indicating steeper scaling laws for speech-text models. On HellaSwag story completion, LST achieves 6.5% absolute gain in speech accuracy under compute-controlled training and 5.3% under data-controlled training, while also improving text performance. We will release our models, code, and the evaluation data to facilitate further research.

arXiv.org

Latent Speech-Text Transformer

Auto-regressive speech-text models pre-trained on interleaved text tokens and discretized speech tokens demonstrate strong speech understanding and generation, yet remain substantially less...

1.18K views19:38

Speech Technology

Also

https://x.com/Yen_Ju_Lu/status/1975967071398998153

1.06K views19:40

Speech Technology

CoT has to come

https://arxiv.org/abs/2510.07497

Can Speech LLMs Think while Listening?
Yi-Jen Shih, Desh Raj, Chunyang Wu, Wei Zhou, SK Bong, Yashesh Gaur, Jay Mahadeokar, Ozlem Kalinli, Mike Seltzer
Recent advances in speech large language models (speech LLMs) have enabled seamless spoken interactions, but these systems still struggle with complex reasoning tasks. Previously, chain-of-thought (CoT) prompting or fine-tuning has been to shown to significantly improve the reasoning abilities of text-based LLMs. In this work, we investigate the effect of CoT fine-tuning for multi-stream speech LLMs, demonstrating that reasoning in text space improves the accuracy of speech LLMs by 2.4x, on average, over a suite of spoken reasoning tasks. Beyond accuracy, the latency of the spoken response is a crucial factor for interacting with voice-based agents. Inspired by the human behavior of "thinking while listening," we propose methods to reduce the additional latency from reasoning by allowing the model to start reasoning before the user query has ended. To achieve this, we introduce an entropy-based metric, "question completeness," which acts as an indicator to guide the model on the optimal time to start reasoning. This method provides greater control over the accuracy-latency trade-off compared with heuristic-based approaches and, under equivalent latency conditions, yields a 4% accuracy gain on ARC-Easy. Finally, we use Direct Preference Optimization (DPO) on preference data created using rejection sampling to push the accuracy-latency pareto frontier further, resulting in a 70% reduction in latency without loss in accuracy.

arXiv.org

Can Speech LLMs Think while Listening?

Recent advances in speech large language models (speech LLMs) have enabled seamless spoken interactions, but these systems still struggle with complex reasoning tasks. Previously, chain-of-thought...

1.12K views18:08

Speech Technology

Next talk is on 17 Oct at 1pm (UTC+0). Alexander Polok from Brno University of Technology is going to talk about multi-talker ASR!

Please also note that the talk is slightly earlier than usual.

Below is the link to the talk.

https://ed-ac-uk.zoom.us/j/88650204315
Meeting ID: 886 5020 4315
Passcode: sigml2011

✉️ Don't forget to subscribe to our mailing list https://groups.google.com/g/isca-sigml
🎦 Previous talks and recordings can be found at https://homepages.inf.ed.ac.uk/htang2/sigml/seminar/

Adapting Single-Speaker ASR to Handle Conversations

State-of-the-art ASR systems perform exceptionally well in single-speaker scenarios, but they often struggle with conversations that feature significant speech overlap. Traditional target-speaker ASR methods, which rely on speaker embeddings or enrollment, face challenges in generalization and typically require prior knowledge of the speakers. To overcome these limitations, this talk introduces DiCoW (Diarization-Conditioned Whisper), which conditions ASR on diarization outputs to achieve robust multi-talker transcription with minimal training data. DiCoW has already powered the award-winning CHiME-8 and MLC-SLM systems.

Building on this success, I will present SE-DiCoW (Self-Enrolled DiCoW), an improved version that automatically resolves speaker ambiguities by selecting enrollments from long-form recordings. The talk will also highlight EMMA MT-ASR, the first unified benchmark for multi-talker ASR, alongside recent DiCoW extensions developed during JSALT 2025, demonstrating the evolving capabilities of diarization-conditioned approaches.

Bio: Alexander Polok is a Junior Researcher and PhD student at the Faculty of Information Technology, Brno University of Technology (BUT). His research focuses on speech recognition, with an emphasis on practical and efficient methods for applying ASR models in conversational settings. He has received several honors, including the Brno PhD Talent Scholarship, the Jury Award for CHiME 8, and the MLC-SLM Best Reproducibility Award. He also participated in the JSALT workshops in 2023 and 2025.

Zoom

Join our Cloud HD Video Meeting

Zoom is the leader in modern enterprise cloud communications.

1.42K viewsedited 09:19

Speech Technology

https://www.youtube.com/watch?v=dJIQoZ3uxsk

Microsoft is a top ASR team and always been

YouTube

The development of spoken LM - Jinyu Li

Talk 28 of the Conversational AI Reading Group about "The development of spoken LM" by Jinyu Li .

For further information about the Reading Group, please check out https://poonehmousavi.github.io/rg

1.38K views20:20

Speech Technology

Interesting repo of the day, whisper adaptation on texts

https://github.com/hon9kon9ize/whistle

https://arxiv.org/abs/2509.10452

WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers

Akshat Pandey, Karun Kumar, Raphael Tang

Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four out-of-domain datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by 12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines in 27 of 32 scenarios.

GitHub

GitHub - hon9kon9ize/whistle: Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers

Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers - hon9kon9ize/whistle

1.62K views22:33

Speech Technology

NVIDIA released OmniVinci

https://github.com/NVlabs/OmniVinci

https://arxiv.org/abs/2510.15870

2.49K views17:05

Speech Technology

https://www.linkedin.com/posts/jlqueguiner_life-update-ive-officially-moved-to-new-activity-7386778452181405696-BlEw

We recently learned that Gladia's CEO @JiliJeanlouis moves to NYC. Congratulations!

I think its kinda important move and tells more about Europe.

*Life update* I’ve officially moved to New York City to lead Gladia Inc. — and I can’t wait to celebrate this new chapter with…

*Life update* I’ve officially moved to New York City to lead Gladia Inc. — and I can’t wait to celebrate this new chapter with you all next week! 🇺🇸🎉

What a ride it’s been. My wife and I packed up our three kids (losing their favorite toys in the process…

1.08K views13:57

Speech Technology

As technology advances proper evaluation becomes more and more complex. This is a great example

https://arxiv.org/abs/2510.16567

Hallucination Benchmark for Speech Foundation Models

Alkis Koudounas, Moreno La Quatra, Manuel Giollo, Sabato Marco Siniscalchi, Elena Baralis

Hallucinations in automatic speech recognition (ASR) systems refer to fluent and coherent transcriptions produced by neural ASR models that are completely unrelated to the underlying acoustic input (i.e., the speech signal). While similar to conventional decoding errors in potentially compromising the usability of transcriptions for downstream applications, hallucinations can be more detrimental due to their preservation of syntactically and semantically plausible structure. This apparent coherence can mislead subsequent processing stages and introduce serious risks, particularly in critical domains such as healthcare and law. Conventional evaluation metrics are primarily centered on error-based metrics and fail to distinguish between phonetic inaccuracies and hallucinations. Consequently, there is a critical need for new evaluation frameworks that can effectively identify and assess models with a heightened propensity for generating hallucinated content. To this end, we introduce SHALLOW, the first benchmark framework that systematically categorizes and quantifies hallucination phenomena in ASR along four complementary axes: lexical, phonetic, morphological, and semantic. We define targeted metrics within each category to produce interpretable profiles of model behavior. Through evaluation across various architectures and speech domains, we have found that SHALLOW metrics correlate strongly with word error rate (WER) when recognition quality is high (i.e., low WER). Still, this correlation weakens substantially as WER increases. SHALLOW, therefore, captures fine-grained error patterns that WER fails to distinguish under degraded and challenging conditions. Our framework supports specific diagnosis of model weaknesses and provides feedback for model improvement beyond what aggregate error rates can offer.

arXiv.org

Hallucination Benchmark for Speech Foundation Models

Hallucinations in automatic speech recognition (ASR) systems refer to fluent and coherent transcriptions produced by neural ASR models that are completely unrelated to the underlying acoustic...

1.32K views14:01

About

Blog

Apps

Platform