Speech Technology – Telegram

Speech Technology

1.6K subscribers

122 photos

4 videos

1 file

2.12K links

Download Telegram

About

Blog

Apps

Platform

Speech Technology

1.6K subscribers

Speech Technology

We like some in-depth evaluations in this research

https://github.com/Anuttacon/speech_drame

https://arxiv.org/abs/2511.01261

Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play

Jiatong Shi, Jionghao Han, Yichen Lu, Santiago Pascual, Pengfei Wu, Chenye Cui, Shinji Watanabe, Chao Weng, Cong Zhou

Role-play has become a key testbed for generative models, expanding from text-only dialogue to multimodal interaction. Extending role-play to speech captures prosody, emotion, and delivery, but also poses new evaluation challenges. Current pipelines often use audio large language models (ALLMs) as zero-shot judges, which miss paralinguistic cues, collapse multiple aspects into coarse scores, and rely on synthetic speech references that fail to reflect real-world roles. We present Speech-DRAME, a unified framework that contributes at three levels: (i) Speech-DRAME-EvalBench, an evaluation benchmark with bilingual human-annotated data and protocols for training and testing speech evaluation models (SEMs), (ii) DRAME-Eval, a fine-tuned evaluation model, which substantially outperforms zero-shot and few-shot ALLMs, and (iii) Speech-DRAME-RoleBench, a speech role-play benchmark that leverages DRAME-Eval as an automatic judge to compare speech foundation models (SFMs). Speech-DRAME distinguishes between two complementary evaluation strategies: Archetype Evaluation, a top-down approach measuring adherence to broad role archetypes, and Realism Evaluation, a bottom-up approach grounded in real human speech that emphasizes nuanced role quality. Compared to zero-shot ALLM judges, DRAME-Eval achieves stronger agreement with human ratings (Pearson correlation from 0.480 to 0.629 in archetypes, and 0.390 to 0.625 in realism). By integrating transparent benchmark resources, modeling approaches, and system-level evaluation, Speech-DRAME provides the first comprehensive, reproducible foundation for assessing spoken role-play.

GitHub - Anuttacon/speech_drame

Contribute to Anuttacon/speech_drame development by creating an account on GitHub.

945 views01:29

Speech Technology

Greetings from Voice Tech For All team!

We are pleased to announce the launch of the Voice Tech for All Challenge — a Text-to-Speech (TTS) innovation challenge hosted by IISc and SPIRE Lab, powered by Bhashini, GIZ’s FAIR Forward, ARMMAN, and ARTPARK, along with Google for Developers as our Community Partner.

This challenge invites startups, developers, researchers, students and faculty members to build the next generation of multilingual, expressive Text-to-Speech (TTS) systems, making voice technology accessible to community health workers, especially for low-resource Indian languages.

Why Join?

Access high-quality open datasets in 11 Indian languages (SYSPIN + SPICOR)
Build the SOTA open source multi-speaker, multilingual TTS with accent & style transfer
Winning model to be deployed in maternal health assistant (ARMMAN)
🏆 Prizes worth ₹8.5 Lakhs await!
🔗 Registration link: https://syspin.iisc.ac.in/register
🌐Learn more: https://syspin.iisc.ac.in/voicetechforall

Warm regards,
Team Voice Tech For All
IISc (Indian Institute of Science)

907 views13:35

Speech Technology

This should have nice properties

https://huggingface.co/aiola/drax-v1

https://github.com/aiola-lab/drax

https://arxiv.org/abs/2510.04162

Drax: Speech Recognition with Discrete Flow Matching

Aviv Navon, Aviv Shamsian, Neta Glazer, Yael Segal-Feldman, Gill Hetz, Joseph Keshet, Ethan Fetaya

Diffusion and flow-based non-autoregressive (NAR) models have shown strong promise in large language modeling, however, their potential for automatic speech recognition (ASR) remains largely unexplored. We propose Drax, a discrete flow matching framework for ASR that enables efficient parallel decoding. To better align training with inference, we construct an audio-conditioned probability path that guides the model through trajectories resembling likely intermediate inference errors, rather than direct random noise to target transitions. Our theoretical analysis links the generalization gap to divergences between training and inference occupancies, controlled by cumulative velocity errors, thereby motivating our design choice. Empirical evaluation demonstrates that our approach attains recognition accuracy on par with state-of-the-art speech models while offering improved accuracy-efficiency trade-offs, highlighting discrete flow matching as a promising direction for advancing NAR ASR.

960 views13:43

Speech Technology

Sounds reasonable for TTS

https://github.com/auspicious3000/ProsodyLM

ProsodyLM — a speech language model
→ With novel prosody tokenization (not audio tokenization)
→ Achieves superior prosody capabilities with pre-training only (no alignment)

https://arxiv.org/abs/2507.20091

ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models

Kaizhi Qian, Xulin Fan, Junrui Ni, Slava Shechtman, Mark Hasegawa-Johnson, Chuang Gan, Yang Zhang

Speech language models refer to language models with speech processing and understanding capabilities. One key desirable capability for speech language models is the ability to capture the intricate interdependency between content and prosody. The existing mainstream paradigm of training speech language models, which converts speech into discrete tokens before feeding them into LLMs, is sub-optimal in learning prosody information -- we find that the resulting LLMs do not exhibit obvious emerging prosody processing capabilities via pre-training alone. To overcome this, we propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody. Each speech utterance is first transcribed into text, followed by a sequence of word-level prosody tokens. Compared with conventional speech tokenization schemes, the proposed tokenization scheme retains more complete prosody information, and is more understandable to text-based LLMs. We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone, ranging from harnessing the prosody nuances in generated speech, such as contrastive focus, understanding emotion and stress in an utterance, to maintaining prosody consistency in long contexts.

GitHub - auspicious3000/ProsodyLM: ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models

ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models - auspicious3000/ProsodyLM

1.09K viewsedited 16:50

Speech Technology

It's important to have the means to adjust network behaviour, so methods like below are very interesting

https://arxiv.org/abs/2505.12973

Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models

Homograph disambiguation remains a significant challenge in grapheme-to-phoneme (G2P) conversion, especially for low-resource languages. This challenge is twofold: (1) creating balanced and...

1.27K views19:21

Speech Technology

Also

Combining Autoregressive Models and Phonological Knowledge Bases for Improved Accuracy in Korean Grapheme-to-Phoneme Conversion
https://ieeexplore.ieee.org/document/11045935

1.32K views19:23

Speech Technology

Looks interesting

https://github.com/videosdk-live/NAMO-Turn-Detector-v1

GitHub - videosdk-live/NAMO-Turn-Detector-v1: High-performance, semantic turn detection for conversational AI

High-performance, semantic turn detection for conversational AI - videosdk-live/NAMO-Turn-Detector-v1

1.59K views14:49

Speech Technology

Real-Time Speech AI just got faster with Parakeet-Realtime-EOU-120m.
This NVIDIA streaming ASR model is designed specifically for Voice AI agents requiring low-latency interactions.

* Ultra-Low Latency: Achieves streaming recognition with latency as low as 80ms.
* Smart EOU Detection: Automatically signals "End-of-Utterance" with a dedicated <EOU> token, allowing agents to know exactly when a user stops speaking without long pauses.
* Efficient Architecture: Built on the cache-aware FastConformer-RNNT architecture with 120M parameters, optimized for edge deployment.

🤗 Try the model on Hugging Face: https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1

1.86K views16:45

Speech Technology

https://huggingface.co/spaces/Supertone/supertonic released their models. Fast and well tuned NAR TTS with flow matching. Sound a bit uniform, but overall very nice.

No code, just ONNX model.

Paper here:

https://arxiv.org/abs/2503.23108

SupertonicTTS: Towards Highly Efficient and Streamlined Text-to-Speech System

Hyeongju Kim, Jinhyeok Yang, Yechan Yu, Seunghun Ji, Jacob Morton, Frederik Bous, Joon Byun, Juheon Lee

We introduce SupertonicTTS, a novel text-to-speech (TTS) system designed for efficient and streamlined speech synthesis. SupertonicTTS comprises three components: a speech autoencoder for continuous latent representation, a text-to-latent module leveraging flow-matching for text-to-latent mapping, and an utterance-level duration predictor. To enable a lightweight architecture, we employ a low-dimensional latent space, temporal compression of latents, and ConvNeXt blocks. The TTS pipeline is further simplified by operating directly on raw character-level text and employing cross-attention for text-speech alignment, thus eliminating the need for grapheme-to-phoneme (G2P) modules and external aligners. In addition, we propose context-sharing batch expansion that accelerates loss convergence and stabilizes text-speech alignment with minimal memory and I/O overhead. Experimental results demonstrate that SupertonicTTS delivers performance comparable to contemporary zero-shot TTS models with only 44M parameters, while significantly reducing architectural complexity and computational cost. Audio samples are available at: this https URL.

Supertonic (TTS) - a Hugging Face Space by Supertone

Lightning-Fast, On-Device TTS

1.94K viewsedited 18:08

Speech Technology

Everyone plays with FocalCodec today

https://lucadellalib.github.io/focalcodec-web/

1.18K views02:15

Speech Technology

So no more Kuytai? Gradium is out of stealth to solve voice includes Laurent Mazare and Alexandre Défossez

https://x.com/mattturck/status/1995899063175155852

X (formerly Twitter)

Matt Turck (@mattturck) on X

A major new entrant in voice AI: @GradiumAI

If we were designing computers from scratch today, the default interface probably wouldn’t be a keyboard. It would be voice.

Voice is the most natural interface we have, and probably the most underserved modality…

1.08K views23:19

Speech Technology

Interspeech 2026 challenges are about to start

* NeckVibe Challenge: Voice Disorder Detection via Real-World Monitoring of Neck-Surface Vibration
* TidyVoice Challenge: Cross-Lingual Speaker Verification
* Transfer of Pragmatic Intent in Speech-to-Speech Translation
* Audio Encoder Capability Challenge for Large Audio Language Models
* IQRA: Arabic Mispronunciation Detection and Diagnosis Challenge
* Audio Reasoning Challenge
* Unsupervised Speech in the Wild Challenge https://upschallenge.org/

1.12K views23:10

Speech Technology

These tech was once very strictly proteced

https://github.com/dywsy21/STCTS

https://arxiv.org/abs/2512.00451

STCTS: Generative Semantic Compression for Ultra-Low Bitrate Speech via Explicit Text-Prosody-Timbre Decomposition

Siyu Wang, Haitao Li

Voice communication in bandwidth-constrained environments--maritime, satellite, and tactical networks--remains prohibitively expensive. Traditional codecs struggle below 1 kbps, while existing semantic approaches (STT-TTS) sacrifice prosody and speaker identity. We present STCTS, a generative semantic compression framework enabling natural voice communication at approximately 80 bps. STCTS explicitly decomposes speech into linguistic content, prosodic expression, and speaker timbre, applying tailored compression: context-aware text encoding (approximately 70 bps), sparse prosody transmission via TTS interpolation (less than 14 bps at 0.1-1 Hz), and amortized speaker embedding.
Evaluations on LibriSpeech demonstrate a 75x bitrate reduction versus Opus (6 kbps) and 12x versus EnCodec (1 kbps), while maintaining perceptual quality (NISQA MOS greater than 4.26). We also discover a bimodal quality distribution with prosody sampling rate: sparse and dense updates both achieve high quality, while mid-range rates degrade due to perceptual discontinuities--guiding optimal configuration design. Beyond efficiency, our modular architecture supports privacy-preserving encryption, human-interpretable transmission, and flexible deployment on edge devices, offering a robust solution for ultra-low bandwidth scenarios.

GitHub - dywsy21/STCTS: A voice communication system that achieves ~80 bps while maintaining near-natural quality

A voice communication system that achieves ~80 bps while maintaining near-natural quality - dywsy21/STCTS

1.13K viewsedited 00:40

Speech Technology

https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B

microsoft/VibeVoice-Realtime-0.5B · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.08K views19:09

Speech Technology

Recent research focuses more on dialogue models

Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking

https://arxiv.org/abs/2511.22503

Katia Vendrame, Bolaji Yusuf, Santosh Kesiraju, Šimon Sedláček, Oldřich Plchot, Jan Černocký

End-to-end spoken dialogue state tracking (DST) is made difficult by the tandem of having to handle speech input and data scarcity. Combining speech foundation encoders and large language models has been proposed in recent work as to alleviate some of this difficulty. Although this approach has been shown to result in strong spoken DST models, achieving state-of-the-art performance in realistic multi-turn DST, it struggles to generalize across domains and requires annotated spoken DST training data for each domain of interest. However, collecting such data for every target domain is both costly and difficult. Noting that textual DST data is more easily obtained for various domains, in this work, we propose jointly training on available spoken DST data and written textual data from other domains as a way to achieve cross-domain generalization. We conduct experiments which show the efficacy of our proposed method for getting good cross-domain DST performance without relying on spoken training data from the target domains.

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

https://arxiv.org/abs/2505.02707

Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu

A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, reason, and respond proactively, fostering fluid, dynamic, and emotionally resonant interactions. We introduce Voila, a family of large voice-language foundation models that make a step towards this vision. Voila moves beyond traditional pipeline systems by adopting a new end-to-end architecture that enables full-duplex, low-latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion. It achieves a response latency of just 195 milliseconds, surpassing the average human response time. Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models (LLMs) with powerful acoustic modeling, enabling natural, persona-aware voice generation -- where users can simply write text instructions to define the speaker's identity, tone, and other characteristics. Moreover, Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds. Beyond spoken dialogue, Voila is designed as a unified model for a wide range of voice-based applications, including automatic speech recognition (ASR), Text-to-Speech (TTS), and, with minimal adaptation, multilingual speech translation. Voila is fully open-sourced to support open research and accelerate progress toward next-generation human-machine interactions.

SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model

https://arxiv.org/abs/2505.15670

Ke Hu, Ehsan Hosseini-Asl, Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr Żelasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, Boris Ginsburg

947 viewsedited 00:22

Speech Technology

Spoken dialogue is an intuitive form of human-computer interaction, yet current speech language models often remain constrained to turn-based exchanges, lacking real-time adaptability such as user barge-in. We propose a novel duplex speech to speech (S2S) architecture featuring continuous user inputs and codec agent outputs with channel fusion that directly models simultaneous user and agent streams. Using a pretrained streaming encoder for user input enables the first duplex S2S model without requiring speech pretrain. Separate architectures for agent and user modeling facilitate codec fine-tuning for better agent voices and halve the bitrate (0.6 kbps) compared to previous works. Experimental results show that the proposed model outperforms previous duplex models in reasoning, turn-taking, and barge-in abilities. The model requires significantly less speech data, as speech pretrain is skipped, which markedly simplifies the process of building a duplex S2S model from any LLMs. Finally, it is the first openly available duplex S2S model with training and inference code to foster reproducibility.

SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training

https://arxiv.org/abs/2412.15649

Wenxi Chen, Ziyang Ma, Ruiqi Yan, Yuzhe Liang, Xiquan Li, Ruiyang Xu, Zhikang Niu, Yanqiao Zhu, Yifan Yang, Zhanxun Liu, Kai Yu, Yuxuan Hu, Jinyu Li, Yan Lu, Shujie Liu, Xie Chen

Recent advancements highlight the potential of end-to-end real-time spoken dialogue systems, showcasing their low latency and high quality. In this paper, we introduce SLAM-Omni, a timbre-controllable, end-to-end voice interaction system with single-stage training. SLAM-Omni achieves zero-shot timbre control by modeling spoken language with semantic tokens and decoupling speaker information to a vocoder. By predicting grouped speech semantic tokens at each step, our method significantly reduces the sequence length of audio tokens, accelerating both training and inference. Additionally, we propose historical text prompting to compress dialogue history, facilitating efficient multi-round interactions. Comprehensive evaluations reveal that SLAM-Omni outperforms prior models of similar scale, requiring only 15 hours of training on 4 GPUs with limited data. Notably, it is the first spoken dialogue system to achieve competitive performance with a single-stage training approach, eliminating the need for pre-training on TTS or ASR tasks. Further experiments validate its multilingual and multi-turn dialogue capabilities on larger datasets.

Joint Speech and Text Training for LLM-Based End-to-End Spoken...

End-to-end spoken dialogue state tracking (DST) is made difficult by the tandem of having to handle speech input and data scarcity. Combining speech foundation encoders and large language models...

1.14K views00:22

Speech Technology

Just a big DiT TTS (2.4B)

https://jordandarefsky.com/blog/2025/echo/

Echo | Jordan Darefsky

Diffusion-based text-to-speech with fast, high-fidelity voice cloning

1.14K viewsedited 02:54

Speech Technology

Nice collection of TTS datasets

https://huggingface.co/datasets/malaysia-ai/Multilingual-TTS

malaysia-ai/Multilingual-TTS · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.2K views10:10

Speech Technology

One more reminder that supervised is usually better than unsupervised. This applies to many cases in structure learning

https://arxiv.org/abs/2512.03301

Comparing Unsupervised and Supervised Semantic Speech Tokens: A Case Study of Child ASR

Mohan Shi, Natarajan Balaji Shankar, Kaiyuan Zhang, Zilai Wang, Abeer Alwan

Discrete speech tokens have gained attention for their storage efficiency and integration with Large Language Models (LLMs). They are commonly categorized into acoustic and semantic tokens, with the latter being more advantageous for Automatic Speech Recognition (ASR). Traditionally, unsupervised K-means clustering has been used to extract semantic speech tokens from Speech Foundation Models (SFMs). Recently, supervised methods, such as finite scalar quantization (FSQ) trained with ASR loss, have emerged for speech generation. Both approaches leverage pre-trained SFMs, benefiting low-resource tasks such as child ASR.
This paper systematically compares supervised and unsupervised semantic speech tokens for child ASR. Results show that supervised methods not only outperform unsupervised ones but even unexpectedly surpass continuous representations, and they perform well even in ultra-low bitrate settings. These findings highlight the advantages of supervised semantic tokens and offer insights for improving discrete speech tokenization.

Comparing Unsupervised and Supervised Semantic Speech Tokens: A...

Discrete speech tokens have gained attention for their storage efficiency and integration with Large Language Models (LLMs). They are commonly categorized into acoustic and semantic tokens, with...

1.34K views00:08

Speech Technology

https://arxiv.org/abs/2508.07315

great speedups

1.97K views23:12

Speech Technology

Some interesting ideas, but zero tests for context consistency. Without tests regressions are expected. Talk on this on Dec 11

https://poonehmousavi.github.io/rg.html

https://arxiv.org/abs/2509.00078

ChipChat: Low-Latency Cascaded Conversational Agent in MLX

Tatiana Likhomanenko, Luke Carlson, Richard He Bai, Zijin Gu, Han Tran, Zakaria Aldeneh, Yizhe Zhang, Ruixiang Zhang, Huangjie Zheng, Navdeep Jaitly

The emergence of large language models (LLMs) has transformed spoken dialog systems, yet the optimal architecture for real-time on-device voice agents remains an open question. While end-to-end approaches promise theoretical advantages, cascaded systems (CSs) continue to outperform them in language understanding tasks, despite being constrained by sequential processing latency. In this work, we introduce ChipChat, a novel low-latency CS that overcomes traditional bottlenecks...

1.53K viewsedited 23:17