Speech Technology
1.6K subscribers
122 photos
4 videos
1 file
2.12K links
Download Telegram
Spoken dialogue is an intuitive form of human-computer interaction, yet current speech language models often remain constrained to turn-based exchanges, lacking real-time adaptability such as user barge-in. We propose a novel duplex speech to speech (S2S) architecture featuring continuous user inputs and codec agent outputs with channel fusion that directly models simultaneous user and agent streams. Using a pretrained streaming encoder for user input enables the first duplex S2S model without requiring speech pretrain. Separate architectures for agent and user modeling facilitate codec fine-tuning for better agent voices and halve the bitrate (0.6 kbps) compared to previous works. Experimental results show that the proposed model outperforms previous duplex models in reasoning, turn-taking, and barge-in abilities. The model requires significantly less speech data, as speech pretrain is skipped, which markedly simplifies the process of building a duplex S2S model from any LLMs. Finally, it is the first openly available duplex S2S model with training and inference code to foster reproducibility.



SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training

https://arxiv.org/abs/2412.15649

Wenxi Chen, Ziyang Ma, Ruiqi Yan, Yuzhe Liang, Xiquan Li, Ruiyang Xu, Zhikang Niu, Yanqiao Zhu, Yifan Yang, Zhanxun Liu, Kai Yu, Yuxuan Hu, Jinyu Li, Yan Lu, Shujie Liu, Xie Chen

Recent advancements highlight the potential of end-to-end real-time spoken dialogue systems, showcasing their low latency and high quality. In this paper, we introduce SLAM-Omni, a timbre-controllable, end-to-end voice interaction system with single-stage training. SLAM-Omni achieves zero-shot timbre control by modeling spoken language with semantic tokens and decoupling speaker information to a vocoder. By predicting grouped speech semantic tokens at each step, our method significantly reduces the sequence length of audio tokens, accelerating both training and inference. Additionally, we propose historical text prompting to compress dialogue history, facilitating efficient multi-round interactions. Comprehensive evaluations reveal that SLAM-Omni outperforms prior models of similar scale, requiring only 15 hours of training on 4 GPUs with limited data. Notably, it is the first spoken dialogue system to achieve competitive performance with a single-stage training approach, eliminating the need for pre-training on TTS or ASR tasks. Further experiments validate its multilingual and multi-turn dialogue capabilities on larger datasets.
One more reminder that supervised is usually better than unsupervised. This applies to many cases in structure learning

https://arxiv.org/abs/2512.03301

Comparing Unsupervised and Supervised Semantic Speech Tokens: A Case Study of Child ASR

Mohan Shi, Natarajan Balaji Shankar, Kaiyuan Zhang, Zilai Wang, Abeer Alwan

Discrete speech tokens have gained attention for their storage efficiency and integration with Large Language Models (LLMs). They are commonly categorized into acoustic and semantic tokens, with the latter being more advantageous for Automatic Speech Recognition (ASR). Traditionally, unsupervised K-means clustering has been used to extract semantic speech tokens from Speech Foundation Models (SFMs). Recently, supervised methods, such as finite scalar quantization (FSQ) trained with ASR loss, have emerged for speech generation. Both approaches leverage pre-trained SFMs, benefiting low-resource tasks such as child ASR.
This paper systematically compares supervised and unsupervised semantic speech tokens for child ASR. Results show that supervised methods not only outperform unsupervised ones but even unexpectedly surpass continuous representations, and they perform well even in ultra-low bitrate settings. These findings highlight the advantages of supervised semantic tokens and offer insights for improving discrete speech tokenization.
Some interesting ideas, but zero tests for context consistency. Without tests regressions are expected. Talk on this on Dec 11

https://poonehmousavi.github.io/rg.html

https://arxiv.org/abs/2509.00078

ChipChat: Low-Latency Cascaded Conversational Agent in MLX

Tatiana Likhomanenko, Luke Carlson, Richard He Bai, Zijin Gu, Han Tran, Zakaria Aldeneh, Yizhe Zhang, Ruixiang Zhang, Huangjie Zheng, Navdeep Jaitly

The emergence of large language models (LLMs) has transformed spoken dialog systems, yet the optimal architecture for real-time on-device voice agents remains an open question. While end-to-end approaches promise theoretical advantages, cascaded systems (CSs) continue to outperform them in language understanding tasks, despite being constrained by sequential processing latency. In this work, we introduce ChipChat, a novel low-latency CS that overcomes traditional bottlenecks...
Diffusion ASR is something that frequently raised recently. Another one is Drax
https://t.me/speechtech/2215

https://arxiv.org/abs/2509.16622

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

Mengqi Wang, Zhan Liu, Zengrui Jin, Guangzhi Sun, Chao Zhang, Philip C. Woodland

Diffusion-based large language models (DLLMs) have recently attracted growing interest as an alternative to autoregressive decoders. In this work, we present an empirical study on using the diffusion-based large language model LLaDA for automatic speech recognition (ASR). We first investigate its use as an external deliberation-based processing module for Whisper-LLaMA transcripts. By leveraging the bidirectional attention and denoising capabilities of LLaDA, we explore random masking, low-confidence masking, and semi-autoregressive strategies, showing that Whisper-LLaDA substantially reduces WER compared with the baseline. On LibriSpeech, the best cascade system achieves 2.25%/4.94% WER on test-clean/test-other, representing a 12.3% relative improvement over the Whisper-LLaMA baseline on the test-other split. In contrast, a plain-text LLaDA without acoustic features fails to improve accuracy, highlighting the importance of audio-conditioned embeddings. We further evaluate Whisper-LLaDA as a standalone decoder for ASR with diffusion-based and semi-autoregressive decoding. Most experimental configurations achieve faster inference than the Whisper-LLaMA baseline, although recognition accuracy is slightly lower. These findings offer an empirical view of diffusion-based LLMs for ASR and point to promising directions for improvements.
Introducing Chatterbox Turbo, the fastest open source Voice AI model with emotions.

Our gift to the dev community this holiday season!

• ~6x faster RFT
• Expressive sound tags: sighs, laughs, coughs
• PerTh watermarking on every output

Here's everything you need to know 👇

https://huggingface.co/ResembleAI/chatterbox-turbo
Google released open source medical dictation model

https://huggingface.co/google/medasr

The counter intuitive thing that this relatively small model beats Gemini 2.5 Pro by a large margin. Probably test is just biased, it is hard to imagine advanced Gemini model can't sort out important things.
This model seems interesting. Model trained on 1.3M hours of data.

> Through this meticulous two-stage process, we obtain approximately 1,000 hours of high-quality speech with detailed paralinguistic annotations, providing a robust foundation for expressive and context-aware speech synthesis.

https://huggingface.co/Soul-AILab/SoulX-Podcast-1.7B

https://arxiv.org/abs/2510.23541

SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity

Hanke Xie, Haopeng Lin, Wenxiao Cao, Dake Guo, Wenjie Tian, Jun Wu, Hanlin Wen, Ruixuan Shang, Hongmei Liu, Zhiqi Jiang, Yuepeng Jiang, Wenxi Chen, Ruiqi Yan, Jiale Qian, Yichao Yan, Shunshun Yin, Ming Tao, Xie Chen, Lei Xie, Xinsheng Wang

Recent advances in text-to-speech (TTS) synthesis have significantly improved speech expressiveness and naturalness. However, most existing systems are tailored for single-speaker synthesis and fall short in generating coherent multi-speaker conversational speech. This technical report presents SoulX-Podcast, a system designed for podcast-style multi-turn, multi-speaker dialogic speech generation, while also achieving state-of-the-art performance in conventional TTS tasks.
To meet the higher naturalness demands of multi-turn spoken dialogue, SoulX-Podcast integrates a range of paralinguistic controls and supports both Mandarin and English, as well as several Chinese dialects, including Sichuanese, Henanese, and Cantonese, enabling more personalized podcast-style speech generation. Experimental results demonstrate that SoulX-Podcast can continuously produce over 90 minutes of conversation with stable speaker timbre and smooth speaker transitions. Moreover, speakers exhibit contextually adaptive prosody, reflecting natural rhythm and intonation changes as dialogues progress. Across multiple evaluation metrics, SoulX-Podcast achieves state-of-the-art performance in both monologue TTS and multi-turn conversational speech synthesis.
Long-term modeling in speech also close

https://www.linkedin.com/posts/longshen-ou_phrasevae-and-phraseldm-latent-diffusion-activity-7408804594631270400-a_Ur/

We solved the long-sequence problem in symbolic music generation 🎶 Here is how we reduce sequence length from 10k+ to just 512, and directly model an entire song, not musical excerpts.

🎧 Demo: https://lnkd.in/g6Y7V9_8

Full-song symbolic music generation has long been constrained by extremely long token sequences, limited context length, and weak support for global structure. Most existing models still operate at the note-attribute level and generate music autoregressively, note by note, and segment by segment.

In our new technical report, we introduce PhraseVAE and PhraseLDM—a phrase-level latent diffusion framework for full-song multitrack symbolic music generation.

Key ideas:
🔹 Shift the modeling unit from note-attribute tokens to musically meaningful phrases, reducing full-song context from 10k+ tokens to 512 latents.
🔹 PhraseVAE compresses variable-length polyphonic note sequences (with instrument identity) into compact 64-D phrase-level latents, achieving near-perfect reconstruction (99.0% F1_op).
🔹 PhraseVAE introduces multi-query compression and a progressive bottleneck training strategy for high-fidelity yet compact representations.
🔹 Built on this latent space, PhraseLDM generates an entire multitrack song in a single pass—without any autoregressive components.
🔹 The framework supports up to 128 bars (~8 minutes at 64 BPM) and produces complete songs with coherent local texture, idiomatic instrument usage, and clear global structure.

Both PhraseVAE and PhraseLDM inherit the REMI-z symbolic grammar introduced in our previous NeurIPS work, which makes phrase-level compression and full-song modeling intuitive and musically grounded.

With only 45M parameters, the system can generate a full multitrack song within seconds, offering a practical and scalable alternative to note-attribute autoregressive models.

📄 Paper (arXiv): https://arxiv.org/abs/2512.11348
🎧 Demo & samples: https://www.oulongshen.xyz/midi_ldm

“Beginners learn pitch and rhythm.
Intermediates learn how notes are arranged.
Masters express meaning through phrases — and models should do the same.”
Cascaded systems remain most reliable

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

https://arxiv.org/abs/2512.16378

Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano, Gerard I. Gállego, Jorge Iranzo-Sánchez, Ahrii Kim, Dominik Macháček, Patricia Schmidtova, Maike Züfle

As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which aim to translate spoken language directly, thereby bypassing traditional transcription-based pipelines. Whether this integration improves speech-to-text translation quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable overall, while current SpeechLLMs only match cascades in selected settings and SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.
NVIDIA just released Nemotron Speech ASR:
🤖0.6B streaming cache-aware transducer
📉low latency (down to 80ms)
📈high throughput (up to 900 concurrent streams on H100)
🎮adjustable latency-throughput-accuracy trade-off without re-training
🌎English ASR
🔗https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b
LFM released

https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B

LFM2.5-Audio-1.5B is Liquid AI's updated end-to-end audio foundation model. Key improvements include a custom, LFM based audio detokenizer, llama.cpp compatible GGUFs for CPU inference, and better ASR and TTS performance.

LFM2.5-Audio is an end-to-end multimodal speech and text language model, and as such does not require separate ASR and TTS components. Designed with low latency and real time conversation in mind, at only 1.5 billion parameters LFM2.5-Audio enables seamless conversational interaction, achieving capabilities on par with much larger models. Our model consists of a pretrained LFM2.5 model as its multimodal backbone, along with a FastConformer based audio encoder to handle continuous audio inputs, and an RQ-transformer generating discrete tokens coupled with a lightweight audio detokenizer for audio output.