Speech Technology
1.6K subscribers
122 photos
4 videos
1 file
2.12K links
Download Telegram
We like reviews. People still use ngram rescoring + LSTM for best accuracy. Most effective system just ensemble everything, kaggle-style.

https://arxiv.org/abs/2507.18161

Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges

Samuele Cornell, Christoph Boeddeker, Taejin Park, He Huang, Desh Raj, Matthew Wiesner, Yoshiki Masuyama, Xuankai Chang, Zhong-Qiu Wang, Stefano Squartini, Paola Garcia, Shinji Watanabe

The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech. With participation from 9 teams submitting 32 diverse systems, these challenges have contributed to state-of-the-art research in the field. This paper outlines the challenges' design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions. From this analysis it emerges that: 1) Most participants use end-to-end (e2e) ASR systems, whereas hybrid systems were prevalent in previous CHiME challenges. This transition is mainly due to the availability of robust large-scale pre-trained models, which lowers the data burden for e2e-ASR. 2) Despite recent advances in neural speech separation and enhancement (SSE), all teams still heavily rely on guided source separation, suggesting that current neural SSE techniques are still unable to reliably deal with complex scenarios and different recording setups. 3) All best systems employ diarization refinement via target-speaker diarization techniques. Accurate speaker counting in the first diarization pass is thus crucial to avoid compounding errors and CHiME-8 DASR participants especially focused on this part. 4) Downstream evaluation via meeting summarization can correlate weakly with transcription quality due to the remarkable effectiveness of large-language models in handling errors. On the NOTSOFAR-1 scenario, even systems with over 50% time-constrained minimum permutation WER can perform roughly on par with the most effective ones (around 11%). 5) Despite recent progress, accurately transcribing spontaneous speech in challenging acoustic environments remains difficult, even when using computationally intensive system ensembles.
We like some in-depth evaluations in this research

https://github.com/Anuttacon/speech_drame

https://arxiv.org/abs/2511.01261

Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play

Jiatong Shi, Jionghao Han, Yichen Lu, Santiago Pascual, Pengfei Wu, Chenye Cui, Shinji Watanabe, Chao Weng, Cong Zhou

Role-play has become a key testbed for generative models, expanding from text-only dialogue to multimodal interaction. Extending role-play to speech captures prosody, emotion, and delivery, but also poses new evaluation challenges. Current pipelines often use audio large language models (ALLMs) as zero-shot judges, which miss paralinguistic cues, collapse multiple aspects into coarse scores, and rely on synthetic speech references that fail to reflect real-world roles. We present Speech-DRAME, a unified framework that contributes at three levels: (i) Speech-DRAME-EvalBench, an evaluation benchmark with bilingual human-annotated data and protocols for training and testing speech evaluation models (SEMs), (ii) DRAME-Eval, a fine-tuned evaluation model, which substantially outperforms zero-shot and few-shot ALLMs, and (iii) Speech-DRAME-RoleBench, a speech role-play benchmark that leverages DRAME-Eval as an automatic judge to compare speech foundation models (SFMs). Speech-DRAME distinguishes between two complementary evaluation strategies: Archetype Evaluation, a top-down approach measuring adherence to broad role archetypes, and Realism Evaluation, a bottom-up approach grounded in real human speech that emphasizes nuanced role quality. Compared to zero-shot ALLM judges, DRAME-Eval achieves stronger agreement with human ratings (Pearson correlation from 0.480 to 0.629 in archetypes, and 0.390 to 0.625 in realism). By integrating transparent benchmark resources, modeling approaches, and system-level evaluation, Speech-DRAME provides the first comprehensive, reproducible foundation for assessing spoken role-play.
Greetings from Voice Tech For All team!

We are pleased to announce the launch of the Voice Tech for All Challenge — a Text-to-Speech (TTS) innovation challenge hosted by IISc and SPIRE Lab, powered by Bhashini, GIZ’s FAIR Forward, ARMMAN, and ARTPARK, along with Google for Developers as our Community Partner.

This challenge invites startups, developers, researchers, students and faculty members to build the next generation of multilingual, expressive Text-to-Speech (TTS) systems, making voice technology accessible to community health workers, especially for low-resource Indian languages.

Why Join?

Access high-quality open datasets in 11 Indian languages (SYSPIN + SPICOR)
Build the SOTA open source multi-speaker, multilingual TTS with accent & style transfer
Winning model to be deployed in maternal health assistant (ARMMAN)
🏆 Prizes worth ₹8.5 Lakhs await!
🔗 Registration link: https://syspin.iisc.ac.in/register
🌐Learn more: https://syspin.iisc.ac.in/voicetechforall

Warm regards,
Team Voice Tech For All
IISc (Indian Institute of Science)
This should have nice properties

https://huggingface.co/aiola/drax-v1

https://github.com/aiola-lab/drax

https://arxiv.org/abs/2510.04162

Drax: Speech Recognition with Discrete Flow Matching

Aviv Navon, Aviv Shamsian, Neta Glazer, Yael Segal-Feldman, Gill Hetz, Joseph Keshet, Ethan Fetaya

Diffusion and flow-based non-autoregressive (NAR) models have shown strong promise in large language modeling, however, their potential for automatic speech recognition (ASR) remains largely unexplored. We propose Drax, a discrete flow matching framework for ASR that enables efficient parallel decoding. To better align training with inference, we construct an audio-conditioned probability path that guides the model through trajectories resembling likely intermediate inference errors, rather than direct random noise to target transitions. Our theoretical analysis links the generalization gap to divergences between training and inference occupancies, controlled by cumulative velocity errors, thereby motivating our design choice. Empirical evaluation demonstrates that our approach attains recognition accuracy on par with state-of-the-art speech models while offering improved accuracy-efficiency trade-offs, highlighting discrete flow matching as a promising direction for advancing NAR ASR.
Sounds reasonable for TTS

https://github.com/auspicious3000/ProsodyLM

ProsodyLM — a speech language model
→ With novel prosody tokenization (not audio tokenization)
→ Achieves superior prosody capabilities with pre-training only (no alignment)

https://arxiv.org/abs/2507.20091

ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models

Kaizhi Qian, Xulin Fan, Junrui Ni, Slava Shechtman, Mark Hasegawa-Johnson, Chuang Gan, Yang Zhang

Speech language models refer to language models with speech processing and understanding capabilities. One key desirable capability for speech language models is the ability to capture the intricate interdependency between content and prosody. The existing mainstream paradigm of training speech language models, which converts speech into discrete tokens before feeding them into LLMs, is sub-optimal in learning prosody information -- we find that the resulting LLMs do not exhibit obvious emerging prosody processing capabilities via pre-training alone. To overcome this, we propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody. Each speech utterance is first transcribed into text, followed by a sequence of word-level prosody tokens. Compared with conventional speech tokenization schemes, the proposed tokenization scheme retains more complete prosody information, and is more understandable to text-based LLMs. We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone, ranging from harnessing the prosody nuances in generated speech, such as contrastive focus, understanding emotion and stress in an utterance, to maintaining prosody consistency in long contexts.
Also


Combining Autoregressive Models and Phonological Knowledge Bases for Improved Accuracy in Korean Grapheme-to-Phoneme Conversion
https://ieeexplore.ieee.org/document/11045935
Real-Time Speech AI just got faster with Parakeet-Realtime-EOU-120m.
This NVIDIA streaming ASR model is designed specifically for Voice AI agents requiring low-latency interactions.

* Ultra-Low Latency: Achieves streaming recognition with latency as low as 80ms.
* Smart EOU Detection: Automatically signals "End-of-Utterance" with a dedicated <EOU> token, allowing agents to know exactly when a user stops speaking without long pauses.
* Efficient Architecture: Built on the cache-aware FastConformer-RNNT architecture with 120M parameters, optimized for edge deployment.

🤗 Try the model on Hugging Face: https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1
https://huggingface.co/spaces/Supertone/supertonic released their models. Fast and well tuned NAR TTS with flow matching. Sound a bit uniform, but overall very nice.

No code, just ONNX model.

Paper here:

https://arxiv.org/abs/2503.23108

SupertonicTTS: Towards Highly Efficient and Streamlined Text-to-Speech System

Hyeongju Kim, Jinhyeok Yang, Yechan Yu, Seunghun Ji, Jacob Morton, Frederik Bous, Joon Byun, Juheon Lee

We introduce SupertonicTTS, a novel text-to-speech (TTS) system designed for efficient and streamlined speech synthesis. SupertonicTTS comprises three components: a speech autoencoder for continuous latent representation, a text-to-latent module leveraging flow-matching for text-to-latent mapping, and an utterance-level duration predictor. To enable a lightweight architecture, we employ a low-dimensional latent space, temporal compression of latents, and ConvNeXt blocks. The TTS pipeline is further simplified by operating directly on raw character-level text and employing cross-attention for text-speech alignment, thus eliminating the need for grapheme-to-phoneme (G2P) modules and external aligners. In addition, we propose context-sharing batch expansion that accelerates loss convergence and stabilizes text-speech alignment with minimal memory and I/O overhead. Experimental results demonstrate that SupertonicTTS delivers performance comparable to contemporary zero-shot TTS models with only 44M parameters, while significantly reducing architectural complexity and computational cost. Audio samples are available at: this https URL.
Everyone plays with FocalCodec today

https://lucadellalib.github.io/focalcodec-web/
Interspeech 2026 challenges are about to start

* NeckVibe Challenge: Voice Disorder Detection via Real-World Monitoring of Neck-Surface Vibration
* TidyVoice Challenge: Cross-Lingual Speaker Verification
* Transfer of Pragmatic Intent in Speech-to-Speech Translation
* Audio Encoder Capability Challenge for Large Audio Language Models
* IQRA: Arabic Mispronunciation Detection and Diagnosis Challenge
* Audio Reasoning Challenge
* Unsupervised Speech in the Wild Challenge https://upschallenge.org/
These tech was once very strictly proteced

https://github.com/dywsy21/STCTS

https://arxiv.org/abs/2512.00451

STCTS: Generative Semantic Compression for Ultra-Low Bitrate Speech via Explicit Text-Prosody-Timbre Decomposition

Siyu Wang, Haitao Li

Voice communication in bandwidth-constrained environments--maritime, satellite, and tactical networks--remains prohibitively expensive. Traditional codecs struggle below 1 kbps, while existing semantic approaches (STT-TTS) sacrifice prosody and speaker identity. We present STCTS, a generative semantic compression framework enabling natural voice communication at approximately 80 bps. STCTS explicitly decomposes speech into linguistic content, prosodic expression, and speaker timbre, applying tailored compression: context-aware text encoding (approximately 70 bps), sparse prosody transmission via TTS interpolation (less than 14 bps at 0.1-1 Hz), and amortized speaker embedding.
Evaluations on LibriSpeech demonstrate a 75x bitrate reduction versus Opus (6 kbps) and 12x versus EnCodec (1 kbps), while maintaining perceptual quality (NISQA MOS greater than 4.26). We also discover a bimodal quality distribution with prosody sampling rate: sparse and dense updates both achieve high quality, while mid-range rates degrade due to perceptual discontinuities--guiding optimal configuration design. Beyond efficiency, our modular architecture supports privacy-preserving encryption, human-interpretable transmission, and flexible deployment on edge devices, offering a robust solution for ultra-low bandwidth scenarios.
Recent research focuses more on dialogue models


Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking

https://arxiv.org/abs/2511.22503

Katia Vendrame, Bolaji Yusuf, Santosh Kesiraju, Šimon Sedláček, Oldřich Plchot, Jan Černocký

End-to-end spoken dialogue state tracking (DST) is made difficult by the tandem of having to handle speech input and data scarcity. Combining speech foundation encoders and large language models has been proposed in recent work as to alleviate some of this difficulty. Although this approach has been shown to result in strong spoken DST models, achieving state-of-the-art performance in realistic multi-turn DST, it struggles to generalize across domains and requires annotated spoken DST training data for each domain of interest. However, collecting such data for every target domain is both costly and difficult. Noting that textual DST data is more easily obtained for various domains, in this work, we propose jointly training on available spoken DST data and written textual data from other domains as a way to achieve cross-domain generalization. We conduct experiments which show the efficacy of our proposed method for getting good cross-domain DST performance without relying on spoken training data from the target domains.




Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

https://arxiv.org/abs/2505.02707

Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu

A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, reason, and respond proactively, fostering fluid, dynamic, and emotionally resonant interactions. We introduce Voila, a family of large voice-language foundation models that make a step towards this vision. Voila moves beyond traditional pipeline systems by adopting a new end-to-end architecture that enables full-duplex, low-latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion. It achieves a response latency of just 195 milliseconds, surpassing the average human response time. Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models (LLMs) with powerful acoustic modeling, enabling natural, persona-aware voice generation -- where users can simply write text instructions to define the speaker's identity, tone, and other characteristics. Moreover, Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds. Beyond spoken dialogue, Voila is designed as a unified model for a wide range of voice-based applications, including automatic speech recognition (ASR), Text-to-Speech (TTS), and, with minimal adaptation, multilingual speech translation. Voila is fully open-sourced to support open research and accelerate progress toward next-generation human-machine interactions.


SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model

https://arxiv.org/abs/2505.15670

Ke Hu, Ehsan Hosseini-Asl, Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr Żelasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, Boris Ginsburg
Spoken dialogue is an intuitive form of human-computer interaction, yet current speech language models often remain constrained to turn-based exchanges, lacking real-time adaptability such as user barge-in. We propose a novel duplex speech to speech (S2S) architecture featuring continuous user inputs and codec agent outputs with channel fusion that directly models simultaneous user and agent streams. Using a pretrained streaming encoder for user input enables the first duplex S2S model without requiring speech pretrain. Separate architectures for agent and user modeling facilitate codec fine-tuning for better agent voices and halve the bitrate (0.6 kbps) compared to previous works. Experimental results show that the proposed model outperforms previous duplex models in reasoning, turn-taking, and barge-in abilities. The model requires significantly less speech data, as speech pretrain is skipped, which markedly simplifies the process of building a duplex S2S model from any LLMs. Finally, it is the first openly available duplex S2S model with training and inference code to foster reproducibility.



SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training

https://arxiv.org/abs/2412.15649

Wenxi Chen, Ziyang Ma, Ruiqi Yan, Yuzhe Liang, Xiquan Li, Ruiyang Xu, Zhikang Niu, Yanqiao Zhu, Yifan Yang, Zhanxun Liu, Kai Yu, Yuxuan Hu, Jinyu Li, Yan Lu, Shujie Liu, Xie Chen

Recent advancements highlight the potential of end-to-end real-time spoken dialogue systems, showcasing their low latency and high quality. In this paper, we introduce SLAM-Omni, a timbre-controllable, end-to-end voice interaction system with single-stage training. SLAM-Omni achieves zero-shot timbre control by modeling spoken language with semantic tokens and decoupling speaker information to a vocoder. By predicting grouped speech semantic tokens at each step, our method significantly reduces the sequence length of audio tokens, accelerating both training and inference. Additionally, we propose historical text prompting to compress dialogue history, facilitating efficient multi-round interactions. Comprehensive evaluations reveal that SLAM-Omni outperforms prior models of similar scale, requiring only 15 hours of training on 4 GPUs with limited data. Notably, it is the first spoken dialogue system to achieve competitive performance with a single-stage training approach, eliminating the need for pre-training on TTS or ASR tasks. Further experiments validate its multilingual and multi-turn dialogue capabilities on larger datasets.
One more reminder that supervised is usually better than unsupervised. This applies to many cases in structure learning

https://arxiv.org/abs/2512.03301

Comparing Unsupervised and Supervised Semantic Speech Tokens: A Case Study of Child ASR

Mohan Shi, Natarajan Balaji Shankar, Kaiyuan Zhang, Zilai Wang, Abeer Alwan

Discrete speech tokens have gained attention for their storage efficiency and integration with Large Language Models (LLMs). They are commonly categorized into acoustic and semantic tokens, with the latter being more advantageous for Automatic Speech Recognition (ASR). Traditionally, unsupervised K-means clustering has been used to extract semantic speech tokens from Speech Foundation Models (SFMs). Recently, supervised methods, such as finite scalar quantization (FSQ) trained with ASR loss, have emerged for speech generation. Both approaches leverage pre-trained SFMs, benefiting low-resource tasks such as child ASR.
This paper systematically compares supervised and unsupervised semantic speech tokens for child ASR. Results show that supervised methods not only outperform unsupervised ones but even unexpectedly surpass continuous representations, and they perform well even in ultra-low bitrate settings. These findings highlight the advantages of supervised semantic tokens and offer insights for improving discrete speech tokenization.