Speech Technology
1.6K subscribers
122 photos
4 videos
1 file
2.12K links
Download Telegram
As we advocate for prosody evaluations in TTS systems, this paper is important.

The metric itself is questionable though so the results (I'd experiment with CFG value in flow matching systems)

https://arxiv.org/abs/2509.19928

Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration

Yifan Yang, Bing Han, Hui Wang, Long Zhou, Wei Wang, Mingyu Cui, Xu Tan, Xie Chen

Prosody diversity is essential for achieving naturalness and expressiveness in zero-shot text-to-speech (TTS). However, frequently used acoustic metrics capture only partial views of prosodic variation and correlate poorly with human perception, leaving the problem of reliably quantifying prosody diversity underexplored. To bridge this gap, we introduce ProsodyEval, a prosody diversity assessment dataset that provides Prosody Mean Opinion Score (PMOS) alongside conventional acoustic metrics. ProsodyEval comprises 1000 speech samples derived from 7 mainstream TTS systems, with 2000 human ratings. Building on this, we propose the Discretized Speech Weighted Edit Distance (DS-WED), a new objective diversity metric that quantifies prosodic variation via weighted edit distance over semantic tokens. Experiments on ProsodyEval show that DS-WED achieves substantially higher correlation with human judgments than existing acoustic metrics, while remaining highly robust in speech tokenization from HuBERT and WavLM. Leveraging DS-WED, we benchmark state-of-the-art open-source TTS systems on LibriSpeech test-clean and Seed-TTS test-en, and further explorations uncover several factors that influence prosody diversity, including generative modeling paradigms, duration control, and reinforcement learning. Moreover, we find that current large audio language models (LALMs) remain limited in capturing prosodic variations. Audio samples are available at this https URL.
We consider resonance as a core physical principle of signal processing so really interested in research like this recent one.
https://github.com/alexandrefrancois/noFFT

https://alexandrefrancois.org/Resonate/

https://alexandrefrancois.org/assets/publications/FrancoisARJ-ICMC2025.pdf

This paper describes Resonate, an original low latency, low memory footprint, and low computational cost algorithm to evaluate perceptually relevant spectral information from audio signals. The fundamental building block is a resonator model that accumulates the signal contribution around its resonant frequency in the time domain, using the Exponentially Weighted Moving Average (EWMA). A compact, iterative formulation of the model affords computing an update at each signal input sample, requiring no buffering and involving only a handful of arithmetic operations. Consistently with on-line perceptual signal analysis, the EWMA gives more weight to recent input values, whereas the contributions of older values decay exponentially. A single parameter governs the dynamics of the system. Banks of such resonators, independently tuned to geometrically spaced resonant frequencies, compute an instantaneous, perceptually relevant estimate of the spectral content of an input signal in real-time. Both memory and per-sample computational complexity of such a bank are linear in the number of resonators, and independent of the number of input samples processed, or duration of processed signal. Furthermore, since the resonators are independent, there is no constraint on the tuning of their resonant frequencies or time constants, and all per sample computations can be parallelized across resonators. The cumulative computational cost for a given duration increases linearly with the number of input samples processed. The low latency afforded by Resonate opens the door to real-time music and speech applications that are out of the reach of FFT-based methods. The efficiency of the approach could reduce computational costs and inspire new designs for low-level audio processing layers in machine learning systems.
The principle itself is applicable not just to signal processing, but to upper layers too, something in line with https://en.wikipedia.org/wiki/Adaptive_resonance_theory
One more reminder that VAE is better than MEL

https://github.com/ZhikangNiu/Semantic-VAE

Good and simple improvement over F5TTS

https://arxiv.org/abs/2509.22167

Zhikang Niu, Shujie Hu, Jeongsoo Choi, Yushen Chen, Peining Chen, Pengcheng Zhu, Yunting Yang, Bowen Zhang, Jian Zhao, Chunhui Wang, Xie Chen

While mel-spectrograms have been widely utilized as intermediate representations in zero-shot text-to-speech (TTS), their inherent redundancy leads to inefficiency in learning text-speech alignment. Compact VAE-based latent representations have recently emerged as a stronger alternative, but they also face a fundamental optimization dilemma: higher-dimensional latent spaces improve reconstruction quality and speaker similarity, but degrade intelligibility, while lower-dimensional spaces improve intelligibility at the expense of reconstruction fidelity. To overcome this dilemma, we propose Semantic-VAE, a novel VAE framework that utilizes semantic alignment regularization in the latent space. This design alleviates the reconstruction-generation trade-off by capturing semantic structure in high-dimensional latent representations. Extensive experiments demonstrate that Semantic-VAE significantly improves synthesis quality and training efficiency. When integrated into F5-TTS, our method achieves 2.10% WER and 0.64 speaker similarity on LibriSpeech-PC, outperforming mel-based systems (2.23%, 0.60) and vanilla acoustic VAE baselines (2.65%, 0.59). We also release the code and models to facilitate further research.
We released 4 new models for Kazakh and Kyrgyz languages. Models are trained for the old Vosk, they still have a lot of value for some applications where you need to quickly update the LM.

https://alphacephei.com/vosk/models

vosk-model-small-ky-0.42

WER fleurs 18.95
WER cv 16.96

vosk-model-ky-0.42

WER fleurs 13.45
WER cv 8.75

vosk-model-small-kz-0.42

WER fleurs 21.10
WER cv 30.00
WER ksc 9.70
WER ksc-other 24.86

vosk-model-kz-0.42

WER fleurs 13.09
WER cv 12.50
WER ksc 4.49
WER ksc-other 18.51
BUT continues work on Whisper instead of OpenAI

https://github.com/BUTSpeechFIT/SOT-DiCoW

https://arxiv.org/abs/2510.03723

Adapting Diarization-Conditioned Whisper for End-to-End Multi-Talker Speech Recognition

Martin Kocour, Martin Karafiat, Alexander Polok, Dominik Klement, Lukáš Burget, Jan Černocký

We propose a speaker-attributed (SA) Whisper-based model for multi-talker speech recognition that combines target-speaker modeling with serialized output training (SOT). Our approach leverages a Diarization-Conditioned Whisper (DiCoW) encoder to extract target-speaker embeddings, which are concatenated into a single representation and passed to a shared decoder. This enables the model to transcribe overlapping speech as a serialized output stream with speaker tags and timestamps. In contrast to target-speaker ASR systems such as DiCoW, which decode each speaker separately, our approach performs joint decoding, allowing the decoder to condition on the context of all speakers simultaneously. Experiments show that the model outperforms existing SOT-based approaches and surpasses DiCoW on multi-talker mixtures (e.g., LibriMix).
Everyone talks about https://github.com/neuphonic/neucodec

Neuphonhic itself celebrates funding round as a promising British startup.

Not sure why, codec is really huge 800M parameters, must be very context-dependent.
Meta's take on audio LLMs

https://arxiv.org/abs/2510.06195

Latent Speech-Text Transformer

Yen-Ju Lu, Yashesh Gaur, Wei Zhou, Benjamin Muller, Jesus Villalba, Najim Dehak, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Srinivasan Iyer, Duc Le

Auto-regressive speech-text models are typically pre-trained on a large number of interleaved sequences of text tokens and raw speech encoded as speech tokens using vector quantization. These models have demonstrated state-of-the-art performance in speech-to-speech understanding and generation benchmarks, together with promising scaling laws, primarily enabled by the representational alignment between text and speech. Nevertheless, they suffer from shortcomings, partly owing to the disproportionately longer sequences of speech tokens in contrast to textual tokens. This results in a large compute imbalance between modalities during pre-training as well as during inference, and a potential hindrance to effectively aligning speech and text, ultimately translating to several orders of magnitude slower scaling laws. We introduce the Latent Speech-Text Transformer (LST), which makes pre-training speech-text models more data-efficient by dynamically and inexpensively aggregating speech tokens into latent speech patches. These patches serve as higher-level units that can either align with corresponding textual units to aid capability transfer or even encapsulate common speech sequences like silences to be more compute-efficient. We show that LST outperforms vanilla approaches on speech-to-speech as well as text-to-text benchmarks in both data- and compute-controlled settings, the former indicating more effective representational alignment and the latter indicating steeper scaling laws for speech-text models. On HellaSwag story completion, LST achieves 6.5% absolute gain in speech accuracy under compute-controlled training and 5.3% under data-controlled training, while also improving text performance. We will release our models, code, and the evaluation data to facilitate further research.
CoT has to come

https://arxiv.org/abs/2510.07497

Can Speech LLMs Think while Listening?
Yi-Jen Shih, Desh Raj, Chunyang Wu, Wei Zhou, SK Bong, Yashesh Gaur, Jay Mahadeokar, Ozlem Kalinli, Mike Seltzer
Recent advances in speech large language models (speech LLMs) have enabled seamless spoken interactions, but these systems still struggle with complex reasoning tasks. Previously, chain-of-thought (CoT) prompting or fine-tuning has been to shown to significantly improve the reasoning abilities of text-based LLMs. In this work, we investigate the effect of CoT fine-tuning for multi-stream speech LLMs, demonstrating that reasoning in text space improves the accuracy of speech LLMs by 2.4x, on average, over a suite of spoken reasoning tasks. Beyond accuracy, the latency of the spoken response is a crucial factor for interacting with voice-based agents. Inspired by the human behavior of "thinking while listening," we propose methods to reduce the additional latency from reasoning by allowing the model to start reasoning before the user query has ended. To achieve this, we introduce an entropy-based metric, "question completeness," which acts as an indicator to guide the model on the optimal time to start reasoning. This method provides greater control over the accuracy-latency trade-off compared with heuristic-based approaches and, under equivalent latency conditions, yields a 4% accuracy gain on ARC-Easy. Finally, we use Direct Preference Optimization (DPO) on preference data created using rejection sampling to push the accuracy-latency pareto frontier further, resulting in a 70% reduction in latency without loss in accuracy.
Next talk is on 17 Oct at 1pm (UTC+0). Alexander Polok from Brno University of Technology is going to talk about multi-talker ASR!

Please also note that the talk is slightly earlier than usual.

Below is the link to the talk.

https://ed-ac-uk.zoom.us/j/88650204315
Meeting ID: 886 5020 4315
Passcode: sigml2011

✉️ Don't forget to subscribe to our mailing list https://groups.google.com/g/isca-sigml
🎦 Previous talks and recordings can be found at https://homepages.inf.ed.ac.uk/htang2/sigml/seminar/

Adapting Single-Speaker ASR to Handle Conversations

State-of-the-art ASR systems perform exceptionally well in single-speaker scenarios, but they often struggle with conversations that feature significant speech overlap. Traditional target-speaker ASR methods, which rely on speaker embeddings or enrollment, face challenges in generalization and typically require prior knowledge of the speakers. To overcome these limitations, this talk introduces DiCoW (Diarization-Conditioned Whisper), which conditions ASR on diarization outputs to achieve robust multi-talker transcription with minimal training data. DiCoW has already powered the award-winning CHiME-8 and MLC-SLM systems.

Building on this success, I will present SE-DiCoW (Self-Enrolled DiCoW), an improved version that automatically resolves speaker ambiguities by selecting enrollments from long-form recordings. The talk will also highlight EMMA MT-ASR, the first unified benchmark for multi-talker ASR, alongside recent DiCoW extensions developed during JSALT 2025, demonstrating the evolving capabilities of diarization-conditioned approaches.

Bio: Alexander Polok is a Junior Researcher and PhD student at the Faculty of Information Technology, Brno University of Technology (BUT). His research focuses on speech recognition, with an emphasis on practical and efficient methods for applying ASR models in conversational settings. He has received several honors, including the Brno PhD Talent Scholarship, the Jury Award for CHiME 8, and the MLC-SLM Best Reproducibility Award. He also participated in the JSALT workshops in 2023 and 2025.
Interesting repo of the day, whisper adaptation on texts

https://github.com/hon9kon9ize/whistle

https://arxiv.org/abs/2509.10452

WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers

Akshat Pandey, Karun Kumar, Raphael Tang

Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four out-of-domain datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by 12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines in 27 of 32 scenarios.
As technology advances proper evaluation becomes more and more complex. This is a great example

https://arxiv.org/abs/2510.16567

Hallucination Benchmark for Speech Foundation Models

Alkis Koudounas, Moreno La Quatra, Manuel Giollo, Sabato Marco Siniscalchi, Elena Baralis

Hallucinations in automatic speech recognition (ASR) systems refer to fluent and coherent transcriptions produced by neural ASR models that are completely unrelated to the underlying acoustic input (i.e., the speech signal). While similar to conventional decoding errors in potentially compromising the usability of transcriptions for downstream applications, hallucinations can be more detrimental due to their preservation of syntactically and semantically plausible structure. This apparent coherence can mislead subsequent processing stages and introduce serious risks, particularly in critical domains such as healthcare and law. Conventional evaluation metrics are primarily centered on error-based metrics and fail to distinguish between phonetic inaccuracies and hallucinations. Consequently, there is a critical need for new evaluation frameworks that can effectively identify and assess models with a heightened propensity for generating hallucinated content. To this end, we introduce SHALLOW, the first benchmark framework that systematically categorizes and quantifies hallucination phenomena in ASR along four complementary axes: lexical, phonetic, morphological, and semantic. We define targeted metrics within each category to produce interpretable profiles of model behavior. Through evaluation across various architectures and speech domains, we have found that SHALLOW metrics correlate strongly with word error rate (WER) when recognition quality is high (i.e., low WER). Still, this correlation weakens substantially as WER increases. SHALLOW, therefore, captures fine-grained error patterns that WER fails to distinguish under degraded and challenging conditions. Our framework supports specific diagnosis of model weaknesses and provides feedback for model improvement beyond what aggregate error rates can offer.