Speech Technology – Telegram

Speech Technology

1.6K subscribers

122 photos

4 videos

1 file

2.12K links

Download Telegram

About

Blog

Apps

Platform

Speech Technology

1.6K subscribers

Speech Technology

Spoken dialogue is an intuitive form of human-computer interaction, yet current speech language models often remain constrained to turn-based exchanges, lacking real-time adaptability such as user barge-in. We propose a novel duplex speech to speech (S2S) architecture featuring continuous user inputs and codec agent outputs with channel fusion that directly models simultaneous user and agent streams. Using a pretrained streaming encoder for user input enables the first duplex S2S model without requiring speech pretrain. Separate architectures for agent and user modeling facilitate codec fine-tuning for better agent voices and halve the bitrate (0.6 kbps) compared to previous works. Experimental results show that the proposed model outperforms previous duplex models in reasoning, turn-taking, and barge-in abilities. The model requires significantly less speech data, as speech pretrain is skipped, which markedly simplifies the process of building a duplex S2S model from any LLMs. Finally, it is the first openly available duplex S2S model with training and inference code to foster reproducibility.

SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training

https://arxiv.org/abs/2412.15649

Wenxi Chen, Ziyang Ma, Ruiqi Yan, Yuzhe Liang, Xiquan Li, Ruiyang Xu, Zhikang Niu, Yanqiao Zhu, Yifan Yang, Zhanxun Liu, Kai Yu, Yuxuan Hu, Jinyu Li, Yan Lu, Shujie Liu, Xie Chen

Recent advancements highlight the potential of end-to-end real-time spoken dialogue systems, showcasing their low latency and high quality. In this paper, we introduce SLAM-Omni, a timbre-controllable, end-to-end voice interaction system with single-stage training. SLAM-Omni achieves zero-shot timbre control by modeling spoken language with semantic tokens and decoupling speaker information to a vocoder. By predicting grouped speech semantic tokens at each step, our method significantly reduces the sequence length of audio tokens, accelerating both training and inference. Additionally, we propose historical text prompting to compress dialogue history, facilitating efficient multi-round interactions. Comprehensive evaluations reveal that SLAM-Omni outperforms prior models of similar scale, requiring only 15 hours of training on 4 GPUs with limited data. Notably, it is the first spoken dialogue system to achieve competitive performance with a single-stage training approach, eliminating the need for pre-training on TTS or ASR tasks. Further experiments validate its multilingual and multi-turn dialogue capabilities on larger datasets.

Joint Speech and Text Training for LLM-Based End-to-End Spoken...

End-to-end spoken dialogue state tracking (DST) is made difficult by the tandem of having to handle speech input and data scarcity. Combining speech foundation encoders and large language models...

1.14K views00:22

Speech Technology

Just a big DiT TTS (2.4B)

https://jordandarefsky.com/blog/2025/echo/

Echo | Jordan Darefsky

Diffusion-based text-to-speech with fast, high-fidelity voice cloning

1.14K viewsedited 02:54

Speech Technology

Nice collection of TTS datasets

https://huggingface.co/datasets/malaysia-ai/Multilingual-TTS

malaysia-ai/Multilingual-TTS · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.2K views10:10

Speech Technology

One more reminder that supervised is usually better than unsupervised. This applies to many cases in structure learning

https://arxiv.org/abs/2512.03301

Comparing Unsupervised and Supervised Semantic Speech Tokens: A Case Study of Child ASR

Mohan Shi, Natarajan Balaji Shankar, Kaiyuan Zhang, Zilai Wang, Abeer Alwan

Discrete speech tokens have gained attention for their storage efficiency and integration with Large Language Models (LLMs). They are commonly categorized into acoustic and semantic tokens, with the latter being more advantageous for Automatic Speech Recognition (ASR). Traditionally, unsupervised K-means clustering has been used to extract semantic speech tokens from Speech Foundation Models (SFMs). Recently, supervised methods, such as finite scalar quantization (FSQ) trained with ASR loss, have emerged for speech generation. Both approaches leverage pre-trained SFMs, benefiting low-resource tasks such as child ASR.
This paper systematically compares supervised and unsupervised semantic speech tokens for child ASR. Results show that supervised methods not only outperform unsupervised ones but even unexpectedly surpass continuous representations, and they perform well even in ultra-low bitrate settings. These findings highlight the advantages of supervised semantic tokens and offer insights for improving discrete speech tokenization.

Comparing Unsupervised and Supervised Semantic Speech Tokens: A...

Discrete speech tokens have gained attention for their storage efficiency and integration with Large Language Models (LLMs). They are commonly categorized into acoustic and semantic tokens, with...

1.34K views00:08

Speech Technology

https://arxiv.org/abs/2508.07315

great speedups

1.97K views23:12

Speech Technology

Some interesting ideas, but zero tests for context consistency. Without tests regressions are expected. Talk on this on Dec 11

https://poonehmousavi.github.io/rg.html

https://arxiv.org/abs/2509.00078

ChipChat: Low-Latency Cascaded Conversational Agent in MLX

Tatiana Likhomanenko, Luke Carlson, Richard He Bai, Zijin Gu, Han Tran, Zakaria Aldeneh, Yizhe Zhang, Ruixiang Zhang, Huangjie Zheng, Navdeep Jaitly

The emergence of large language models (LLMs) has transformed spoken dialog systems, yet the optimal architecture for real-time on-device voice agents remains an open question. While end-to-end approaches promise theoretical advantages, cascaded systems (CSs) continue to outperform them in language understanding tasks, despite being constrained by sequential processing latency. In this work, we introduce ChipChat, a novel low-latency CS that overcomes traditional bottlenecks...

1.53K viewsedited 23:17

Speech Technology

GLM is nice

https://github.com/zai-org/GLM-TTS

also

https://github.com/zai-org/GLM-ASR

GitHub - zai-org/GLM-TTS: GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning

GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning - zai-org/GLM-TTS

1.51K viewsedited 00:41

Speech Technology

https://alphacephei.com/nsh/2025/12/13/failure-of-ssl.html

Speech Recognition With Vosk

The recent release of the FAIR omnilingual model, LeCun news and active use of wav2vec in “semantics” made me think again about SSL in speech.

1.07K views00:24

Speech Technology

TTS：
HF：https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512
github：https://github.com/FunAudioLLM/CosyVoice

ASR：
github：https://github.com/FunAudioLLM/Fun-ASR
HF：https://huggingface.co/FunAudioLLM/Fun-ASR-Nano-2512

yet to test them

FunAudioLLM/Fun-CosyVoice3-0.5B-2512 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.24K views09:59

Speech Technology

Diffusion ASR is something that frequently raised recently. Another one is Drax
https://t.me/speechtech/2215

https://arxiv.org/abs/2509.16622

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

Mengqi Wang, Zhan Liu, Zengrui Jin, Guangzhi Sun, Chao Zhang, Philip C. Woodland

Diffusion-based large language models (DLLMs) have recently attracted growing interest as an alternative to autoregressive decoders. In this work, we present an empirical study on using the diffusion-based large language model LLaDA for automatic speech recognition (ASR). We first investigate its use as an external deliberation-based processing module for Whisper-LLaMA transcripts. By leveraging the bidirectional attention and denoising capabilities of LLaDA, we explore random masking, low-confidence masking, and semi-autoregressive strategies, showing that Whisper-LLaDA substantially reduces WER compared with the baseline. On LibriSpeech, the best cascade system achieves 2.25%/4.94% WER on test-clean/test-other, representing a 12.3% relative improvement over the Whisper-LLaMA baseline on the test-other split. In contrast, a plain-text LLaDA without acoustic features fails to improve accuracy, highlighting the importance of audio-conditioned embeddings. We further evaluate Whisper-LLaDA as a standalone decoder for ASR with diffusion-based and semi-autoregressive decoding. Most experimental configurations achieve faster inference than the Whisper-LLaMA baseline, although recognition accuracy is slightly lower. These findings offer an empirical view of diffusion-based LLMs for ASR and point to promising directions for improvements.

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

Diffusion-based large language models (DLLMs) have recently attracted growing interest as an alternative to autoregressive decoders. In this work, we present an empirical study on using the...

1.28K viewsedited 13:27

Speech Technology

Introducing Chatterbox Turbo, the fastest open source Voice AI model with emotions.

Our gift to the dev community this holiday season!

• ~6x faster RFT
• Expressive sound tags: sighs, laughs, coughs
• PerTh watermarking on every output

Here's everything you need to know 👇

https://huggingface.co/ResembleAI/chatterbox-turbo

ResembleAI/chatterbox-turbo · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.47K views18:34

Speech Technology

Google released open source medical dictation model

https://huggingface.co/google/medasr

The counter intuitive thing that this relatively small model beats Gemini 2.5 Pro by a large margin. Probably test is just biased, it is hard to imagine advanced Gemini model can't sort out important things.

google/medasr · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.48K views22:45

Speech Technology

This model seems interesting. Model trained on 1.3M hours of data.

> Through this meticulous two-stage process, we obtain approximately 1,000 hours of high-quality speech with detailed paralinguistic annotations, providing a robust foundation for expressive and context-aware speech synthesis.

https://huggingface.co/Soul-AILab/SoulX-Podcast-1.7B

https://arxiv.org/abs/2510.23541

SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity

Hanke Xie, Haopeng Lin, Wenxiao Cao, Dake Guo, Wenjie Tian, Jun Wu, Hanlin Wen, Ruixuan Shang, Hongmei Liu, Zhiqi Jiang, Yuepeng Jiang, Wenxi Chen, Ruiqi Yan, Jiale Qian, Yichao Yan, Shunshun Yin, Ming Tao, Xie Chen, Lei Xie, Xinsheng Wang

Recent advances in text-to-speech (TTS) synthesis have significantly improved speech expressiveness and naturalness. However, most existing systems are tailored for single-speaker synthesis and fall short in generating coherent multi-speaker conversational speech. This technical report presents SoulX-Podcast, a system designed for podcast-style multi-turn, multi-speaker dialogic speech generation, while also achieving state-of-the-art performance in conventional TTS tasks.
To meet the higher naturalness demands of multi-turn spoken dialogue, SoulX-Podcast integrates a range of paralinguistic controls and supports both Mandarin and English, as well as several Chinese dialects, including Sichuanese, Henanese, and Cantonese, enabling more personalized podcast-style speech generation. Experimental results demonstrate that SoulX-Podcast can continuously produce over 90 minutes of conversation with stable speaker timbre and smooth speaker transitions. Moreover, speakers exhibit contextually adaptive prosody, reflecting natural rhythm and intonation changes as dialogues progress. Across multiple evaluation metrics, SoulX-Podcast achieves state-of-the-art performance in both monologue TTS and multi-turn conversational speech synthesis.

Soul-AILab/SoulX-Podcast-1.7B · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.09K viewsedited 20:08

Speech Technology

Long-term modeling in speech also close

https://www.linkedin.com/posts/longshen-ou_phrasevae-and-phraseldm-latent-diffusion-activity-7408804594631270400-a_Ur/

We solved the long-sequence problem in symbolic music generation 🎶 Here is how we reduce sequence length from 10k+ to just 512, and directly model an entire song, not musical excerpts.

🎧 Demo: https://lnkd.in/g6Y7V9_8

Full-song symbolic music generation has long been constrained by extremely long token sequences, limited context length, and weak support for global structure. Most existing models still operate at the note-attribute level and generate music autoregressively, note by note, and segment by segment.

In our new technical report, we introduce PhraseVAE and PhraseLDM—a phrase-level latent diffusion framework for full-song multitrack symbolic music generation.

Key ideas:
🔹 Shift the modeling unit from note-attribute tokens to musically meaningful phrases, reducing full-song context from 10k+ tokens to 512 latents.
🔹 PhraseVAE compresses variable-length polyphonic note sequences (with instrument identity) into compact 64-D phrase-level latents, achieving near-perfect reconstruction (99.0% F1_op).
🔹 PhraseVAE introduces multi-query compression and a progressive bottleneck training strategy for high-fidelity yet compact representations.
🔹 Built on this latent space, PhraseLDM generates an entire multitrack song in a single pass—without any autoregressive components.
🔹 The framework supports up to 128 bars (~8 minutes at 64 BPM) and produces complete songs with coherent local texture, idiomatic instrument usage, and clear global structure.

Both PhraseVAE and PhraseLDM inherit the REMI-z symbolic grammar introduced in our previous NeurIPS work, which makes phrase-level compression and full-song modeling intuitive and musically grounded.

With only 45M parameters, the system can generate a full multitrack song within seconds, offering a practical and scalable alternative to note-attribute autoregressive models.

📄 Paper (arXiv): https://arxiv.org/abs/2512.11348
🎧 Demo & samples: https://www.oulongshen.xyz/midi_ldm

“Beginners learn pitch and rhythm.
Intermediates learn how notes are arranged.
Masters express meaning through phrases — and models should do the same.”

PhraseVAE and PhraseLDM: Latent Diffusion for Full-Song Multitrack Symbolic Music Generation | Longshen Ou

We solved the long-sequence problem in symbolic music generation 🎶 Here is how we reduce sequence length from 10k+ to just 512, and directly model an entire song, not musical excerpts.

🎧 Demo: https://lnkd.in/g6Y7V9_8

Full-song symbolic music generation…

1.16K views15:33

Speech Technology

Small autoregressive TTS is doing pretty well recently. Examples are

https://github.com/taylorchu/2cent-tts (100M params)

https://github.com/ekwek1/soprano (80M params)

https://github.com/neuphonic/neutts-air (500M params)

GitHub - taylorchu/2cent-tts

Contribute to taylorchu/2cent-tts development by creating an account on GitHub.

1.45K views18:18

Speech Technology

Another Alibaba release

https://github.com/FunAudioLLM/Fun-Audio-Chat

GitHub - FunAudioLLM/Fun-Audio-Chat: Fun-Audio-Chat is a Large Audio Language Model built for natural, low-latency voice interactions.

Fun-Audio-Chat is a Large Audio Language Model built for natural, low-latency voice interactions. - FunAudioLLM/Fun-Audio-Chat

1.66K views18:21

Speech Technology

Cascaded systems remain most reliable

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

https://arxiv.org/abs/2512.16378

Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano, Gerard I. Gállego, Jorge Iranzo-Sánchez, Ahrii Kim, Dominik Macháček, Patricia Schmidtova, Maike Züfle

As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which aim to translate spoken language directly, thereby bypassing traditional transcription-based pipelines. Whether this integration improves speech-to-text translation quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable overall, while current SpeechLLMs only match cascades in selected settings and SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.

Hearing to Translate: The Effectiveness of Speech Modality...

As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which directly process spoken language and enable speech-to-text...

1.4K views23:39

Speech Technology

ASR-optimized noise cancellation becomes a hot topic, demonstrating the gap in ASR system training

https://ai-coustics.com/2025/11/20/quail-stt-asr-transcription/

another one

https://www.sanas.ai/blog/inside-sanas-asr-optimized-noise-cancellation-for-agentic-ai

Meet Quail: Improving transcription in every condition | ai-coustics | Audio intelligence

Real-world audio breaks many STT systems. Quail is a speech enhancement model built to improve transcription accuracy across noisy environments.

1.2K views18:41

Speech Technology

NVIDIA just released Nemotron Speech ASR:
🤖0.6B streaming cache-aware transducer
📉low latency (down to 80ms)
📈high throughput (up to 900 concurrent streams on H100)
🎮adjustable latency-throughput-accuracy trade-off without re-training
🌎English ASR
🔗https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b

nvidia/nemotron-speech-streaming-en-0.6b · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.17K views23:28

Speech Technology

LFM released

https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B

LFM2.5-Audio-1.5B is Liquid AI's updated end-to-end audio foundation model. Key improvements include a custom, LFM based audio detokenizer, llama.cpp compatible GGUFs for CPU inference, and better ASR and TTS performance.

LFM2.5-Audio is an end-to-end multimodal speech and text language model, and as such does not require separate ASR and TTS components. Designed with low latency and real time conversation in mind, at only 1.5 billion parameters LFM2.5-Audio enables seamless conversational interaction, achieving capabilities on par with much larger models. Our model consists of a pretrained LFM2.5 model as its multimodal backbone, along with a FastConformer based audio encoder to handle continuous audio inputs, and an RQ-transformer generating discrete tokens coupled with a lightweight audio detokenizer for audio output.

LiquidAI/LFM2.5-Audio-1.5B · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.3K views00:08

Speech Technology

Nice effort and dataset, Russian is not great somehow despite 27k hours training data

https://lemas-project.github.io/LEMAS-Project/

lemas-project.github.io

LEMAS Project Page

LEMAS: A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models

1.13K views14:08