Diffusion in ASR too. No code yet, hopefully will be there soon. Nice benchmarks, Gemini tops on speech (confirmed by our tests too).
https://arxiv.org/abs/2507.18452
DIFFA: Large Language Diffusion Models Can Listen and Understand
Jiaming Zhou, Hongjie Chen, Shiwan Zhao, Jian Kang, Jie Li, Enzhi Wang, Yujie Guo, Haoqin Sun, Hui Wang, Aobo Kong, Yong Qin, Xuelong Li
Recent advances in large language models (LLMs) have shown remarkable capabilities across textual and multimodal domains. In parallel, diffusion-based language models have emerged as a promising alternative to the autoregressive paradigm, offering improved controllability, bidirectional context modeling, and robust generation. However, their application to the audio modality remains underexplored. In this work, we introduce \textbf{DIFFA}, the first diffusion-based large audio-language model designed to perform spoken language understanding. DIFFA integrates a frozen diffusion language model with a lightweight dual-adapter architecture that bridges speech understanding and natural language reasoning. We employ a two-stage training pipeline: first, aligning semantic representations via an ASR objective; then, learning instruction-following abilities through synthetic audio-caption pairs automatically generated by prompting LLMs. Despite being trained on only 960 hours of ASR and 127 hours of synthetic instruction data, DIFFA demonstrates competitive performance on major benchmarks, including MMSU, MMAU, and VoiceBench, outperforming several autoregressive open-source baselines. Our results reveal the potential of diffusion-based language models for efficient and scalable audio understanding, opening a new direction for speech-driven AI. Our code will be available at this https URL.
https://arxiv.org/abs/2507.18452
DIFFA: Large Language Diffusion Models Can Listen and Understand
Jiaming Zhou, Hongjie Chen, Shiwan Zhao, Jian Kang, Jie Li, Enzhi Wang, Yujie Guo, Haoqin Sun, Hui Wang, Aobo Kong, Yong Qin, Xuelong Li
Recent advances in large language models (LLMs) have shown remarkable capabilities across textual and multimodal domains. In parallel, diffusion-based language models have emerged as a promising alternative to the autoregressive paradigm, offering improved controllability, bidirectional context modeling, and robust generation. However, their application to the audio modality remains underexplored. In this work, we introduce \textbf{DIFFA}, the first diffusion-based large audio-language model designed to perform spoken language understanding. DIFFA integrates a frozen diffusion language model with a lightweight dual-adapter architecture that bridges speech understanding and natural language reasoning. We employ a two-stage training pipeline: first, aligning semantic representations via an ASR objective; then, learning instruction-following abilities through synthetic audio-caption pairs automatically generated by prompting LLMs. Despite being trained on only 960 hours of ASR and 127 hours of synthetic instruction data, DIFFA demonstrates competitive performance on major benchmarks, including MMSU, MMAU, and VoiceBench, outperforming several autoregressive open-source baselines. Our results reveal the potential of diffusion-based language models for efficient and scalable audio understanding, opening a new direction for speech-driven AI. Our code will be available at this https URL.
arXiv.org
DIFFA: Large Language Diffusion Models Can Listen and Understand
Recent advances in large language models (LLMs) have shown remarkable capabilities across textual and multimodal domains. In parallel, diffusion-based language models have emerged as a promising...
While some things are questionable, return back to phonemes is nice
https://github.com/tabahi/contexless-phonemes-CUPE
https://github.com/tabahi/bournemouth-forced-aligner
https://github.com/tabahi/contexless-phonemes-CUPE
https://github.com/tabahi/bournemouth-forced-aligner
GitHub
GitHub - tabahi/contexless-phonemes-CUPE: pytorch model for contexless-phoneme prediction from speech audio
pytorch model for contexless-phoneme prediction from speech audio - tabahi/contexless-phonemes-CUPE
For us flow matching guys
https://github.com/primepake/F5-TTS-meanflow
https://arxiv.org/abs/2505.13447
Mean Flows for One-step Generative Modeling
Zhengyang Geng, Mingyang Deng, Xingjian Bai, J. Zico Kolter, Kaiming He
We propose a principled and effective framework for one-step generative modeling. We introduce the notion of average velocity to characterize flow fields, in contrast to instantaneous velocity modeled by Flow Matching methods. A well-defined identity between average and instantaneous velocities is derived and used to guide neural network training. Our method, termed the MeanFlow model, is self-contained and requires no pre-training, distillation, or curriculum learning. MeanFlow demonstrates strong empirical performance: it achieves an FID of 3.43 with a single function evaluation (1-NFE) on ImageNet 256x256 trained from scratch, significantly outperforming previous state-of-the-art one-step diffusion/flow models. Our study substantially narrows the gap between one-step diffusion/flow models and their multi-step predecessors, and we hope it will motivate future research to revisit the foundations of these powerful models.
https://github.com/primepake/F5-TTS-meanflow
https://arxiv.org/abs/2505.13447
Mean Flows for One-step Generative Modeling
Zhengyang Geng, Mingyang Deng, Xingjian Bai, J. Zico Kolter, Kaiming He
We propose a principled and effective framework for one-step generative modeling. We introduce the notion of average velocity to characterize flow fields, in contrast to instantaneous velocity modeled by Flow Matching methods. A well-defined identity between average and instantaneous velocities is derived and used to guide neural network training. Our method, termed the MeanFlow model, is self-contained and requires no pre-training, distillation, or curriculum learning. MeanFlow demonstrates strong empirical performance: it achieves an FID of 3.43 with a single function evaluation (1-NFE) on ImageNet 256x256 trained from scratch, significantly outperforming previous state-of-the-art one-step diffusion/flow models. Our study substantially narrows the gap between one-step diffusion/flow models and their multi-step predecessors, and we hope it will motivate future research to revisit the foundations of these powerful models.
GitHub
GitHub - primepake/F5-TTS-meanflow: Meanflow for F5-TTS model
Meanflow for F5-TTS model. Contribute to primepake/F5-TTS-meanflow development by creating an account on GitHub.
CoLMbo is a Speaker Language Model (SLM) designed to go beyond traditional speaker recognition. While most systems stop at identifying “who” the speaker is, CoLMbo answers “what is this speaker like?” by generating context-rich, descriptive captions from speaker embeddings including gender, age, personality, and dialect.
https://github.com/massabaali7/CoLMbo
https://github.com/massabaali7/CoLMbo
GitHub
GitHub - massabaali7/CoLMbo: Speaker Language Model
Speaker Language Model. Contribute to massabaali7/CoLMbo development by creating an account on GitHub.
A guy proposed a model for hf asr leaderboard. Average WER 3.1% compared to previous best 6.1%
https://github.com/huggingface/open_asr_leaderboard/pull/92#issuecomment-3239312224
WER on librispeech test-clean 0.71, quite a bold claim
This suggests the importance of closed source tests.
https://github.com/huggingface/open_asr_leaderboard/pull/92#issuecomment-3239312224
WER on librispeech test-clean 0.71, quite a bold claim
This suggests the importance of closed source tests.
GitHub
Add Whisper-based SOTA model (record-breaking WER) by vivek-shunyalabs · Pull Request #92 · huggingface/open_asr_leaderboard
Hello Open-ASR team,
This PR adds my Whisper-based ASR model to the leaderboard. The modification is minimal yet it represents a model that has achieved record-breaking WER in evaluation.
This mode...
This PR adds my Whisper-based ASR model to the leaderboard. The modification is minimal yet it represents a model that has achieved record-breaking WER in evaluation.
This mode...
Everyone talks about smart VAD these days. Backchannel actions are also important
https://github.com/Linyx1125/MM-F2F
https://github.com/Linyx1125/MM-F2F
GitHub
GitHub - Linyx1125/MM-F2F: [ACL 2025] Predicting Turn-Taking and Backchannel in Human-Machine Conversations Using Linguistic, Acoustic…
[ACL 2025] Predicting Turn-Taking and Backchannel in Human-Machine Conversations Using Linguistic, Acoustic, and Visual Signals - GitHub - Linyx1125/MM-F2F: [ACL 2025] Predicting Turn-Taking and B...
https://arxiv.org/abs/2506.21619
IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech
Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, Jingchen Shu
Existing autoregressive large-scale text-to-speech (TTS) models have advantages in speech naturalness, but their token-by-token generation mechanism makes it difficult to precisely control the duration of synthesized speech. This becomes a significant limitation in applications requiring strict audio-visual synchronization, such as video dubbing. This paper introduces IndexTTS2, which proposes a novel, general, and autoregressive model-friendly method for speech duration control. The method supports two generation modes: one explicitly specifies the number of generated tokens to precisely control speech duration; the other freely generates speech in an autoregressive manner without specifying the number of tokens, while faithfully reproducing the prosodic features of the input prompt. Furthermore, IndexTTS2 achieves disentanglement between emotional expression and speaker identity, enabling independent control over timbre and emotion. In the zero-shot setting, the model can accurately reconstruct the target timbre (from the timbre prompt) while perfectly reproducing the specified emotional tone (from the style prompt). To enhance speech clarity in highly emotional expressions, we incorporate GPT latent representations and design a novel three-stage training paradigm to improve the stability of the generated speech. Additionally, to lower the barrier for emotional control, we designed a soft instruction mechanism based on text descriptions by fine-tuning Qwen3, effectively guiding the generation of speech with the desired emotional orientation. Finally, experimental results on multiple datasets show that IndexTTS2 outperforms state-of-the-art zero-shot TTS models in terms of word error rate, speaker similarity, and emotional fidelity. Audio samples are available at: this https URL
IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech
Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, Jingchen Shu
Existing autoregressive large-scale text-to-speech (TTS) models have advantages in speech naturalness, but their token-by-token generation mechanism makes it difficult to precisely control the duration of synthesized speech. This becomes a significant limitation in applications requiring strict audio-visual synchronization, such as video dubbing. This paper introduces IndexTTS2, which proposes a novel, general, and autoregressive model-friendly method for speech duration control. The method supports two generation modes: one explicitly specifies the number of generated tokens to precisely control speech duration; the other freely generates speech in an autoregressive manner without specifying the number of tokens, while faithfully reproducing the prosodic features of the input prompt. Furthermore, IndexTTS2 achieves disentanglement between emotional expression and speaker identity, enabling independent control over timbre and emotion. In the zero-shot setting, the model can accurately reconstruct the target timbre (from the timbre prompt) while perfectly reproducing the specified emotional tone (from the style prompt). To enhance speech clarity in highly emotional expressions, we incorporate GPT latent representations and design a novel three-stage training paradigm to improve the stability of the generated speech. Additionally, to lower the barrier for emotional control, we designed a soft instruction mechanism based on text descriptions by fine-tuning Qwen3, effectively guiding the generation of speech with the desired emotional orientation. Finally, experimental results on multiple datasets show that IndexTTS2 outperforms state-of-the-art zero-shot TTS models in terms of word error rate, speaker similarity, and emotional fidelity. Audio samples are available at: this https URL
arXiv.org
IndexTTS2: A Breakthrough in Emotionally Expressive and...
Existing autoregressive large-scale text-to-speech (TTS) models have advantages in speech naturalness, but their token-by-token generation mechanism makes it difficult to precisely control the...
Meet Chatterbox Multilingual! 🔥
Production grade. Open source. Voice Cloning in 23 languages. Emotion and intensity control. PerTh watermarking on by default. MIT license. Free forever.
You asked for this, we delivered.
Chatterbox Multilingual adds zero-shot voice cloning in 23 languages from Arabic and Hindi to Chinese and Swahili.
https://github.com/resemble-ai/chatterbox
Arabic (ar) • Danish (da) • German (de) • Greek (el) • English (en) • Spanish (es) • Finnish (fi) • French (fr) • Hebrew (he) • Hindi (hi) • Italian (it) • Japanese (ja) • Korean (ko) • Malay (ms) • Dutch (nl) • Norwegian (no) • Polish (pl) • Portuguese (pt) • Russian (ru) • Swedish (sv) • Swahili (sw) • Turkish (tr) • Chinese (zh)
Production grade. Open source. Voice Cloning in 23 languages. Emotion and intensity control. PerTh watermarking on by default. MIT license. Free forever.
You asked for this, we delivered.
Chatterbox Multilingual adds zero-shot voice cloning in 23 languages from Arabic and Hindi to Chinese and Swahili.
https://github.com/resemble-ai/chatterbox
Arabic (ar) • Danish (da) • German (de) • Greek (el) • English (en) • Spanish (es) • Finnish (fi) • French (fr) • Hebrew (he) • Hindi (hi) • Italian (it) • Japanese (ja) • Korean (ko) • Malay (ms) • Dutch (nl) • Norwegian (no) • Polish (pl) • Portuguese (pt) • Russian (ru) • Swedish (sv) • Swahili (sw) • Turkish (tr) • Chinese (zh)
GitHub
GitHub - resemble-ai/chatterbox: SoTA open-source TTS
SoTA open-source TTS. Contribute to resemble-ai/chatterbox development by creating an account on GitHub.
https://github.com/Tobertz-max/DiFlow-TTS
DiFlow-TTS delivers low-latency, zero-shot text-to-speech through discrete flow matching and factorized speech tokens. It combines a compact token representation with a flow-based sampler to produce natural speech quickly, even for unseen speakers and languages
DiFlow-TTS delivers low-latency, zero-shot text-to-speech through discrete flow matching and factorized speech tokens. It combines a compact token representation with a flow-based sampler to produce natural speech quickly, even for unseen speakers and languages
GitHub
GitHub - Tobertz-max/DiFlow-TTS: DiFlow-TTS delivers low-latency zero-shot TTS via discrete flow matching and factorized speech…
DiFlow-TTS delivers low-latency zero-shot TTS via discrete flow matching and factorized speech tokens. A compact, open framework for fast voice synthesis.🐙 - Tobertz-max/DiFlow-TTS
From DeepMind
https://www.arxiv.org/abs/2509.05256
Recomposer: Event-roll-guided generative audio editing
Daniel P. W. Ellis, Eduardo Fonseca, Ron J. Weiss, Kevin Wilson, Scott Wisdom, Hakan Erdogan, John R. Hershey, Aren Jansen, R. Channing Moore, Manoj Plakal
Editing complex real-world sound scenes is difficult because individual sound sources overlap in time. Generative models can fill-in missing or corrupted details based on their strong prior understanding of the data domain. We present a system for editing individual sound events within complex scenes able to delete, insert, and enhance individual sound events based on textual edit descriptions (e.g., ``enhance Door'') and a graphical representation of the event timing derived from an ``event roll'' transcription. We present an encoder-decoder transformer working on SoundStream representations, trained on synthetic (input, desired output) audio example pairs formed by adding isolated sound events to dense, real-world backgrounds. Evaluation reveals the importance of each part of the edit descriptions -- action, class, timing. Our work demonstrates ``recomposition'' is an important and practical application.
https://www.arxiv.org/abs/2509.05256
Recomposer: Event-roll-guided generative audio editing
Daniel P. W. Ellis, Eduardo Fonseca, Ron J. Weiss, Kevin Wilson, Scott Wisdom, Hakan Erdogan, John R. Hershey, Aren Jansen, R. Channing Moore, Manoj Plakal
Editing complex real-world sound scenes is difficult because individual sound sources overlap in time. Generative models can fill-in missing or corrupted details based on their strong prior understanding of the data domain. We present a system for editing individual sound events within complex scenes able to delete, insert, and enhance individual sound events based on textual edit descriptions (e.g., ``enhance Door'') and a graphical representation of the event timing derived from an ``event roll'' transcription. We present an encoder-decoder transformer working on SoundStream representations, trained on synthetic (input, desired output) audio example pairs formed by adding isolated sound events to dense, real-world backgrounds. Evaluation reveals the importance of each part of the edit descriptions -- action, class, timing. Our work demonstrates ``recomposition'' is an important and practical application.
arXiv.org
Recomposer: Event-roll-guided generative audio editing
Editing complex real-world sound scenes is difficult because individual sound sources overlap in time. Generative models can fill-in missing or corrupted details based on their strong prior...
Nice interview with some details on 11labs
https://www.youtube.com/watch?v=whVdDLtkiKs
"Narration has platoed" they say
https://www.youtube.com/watch?v=whVdDLtkiKs
"Narration has platoed" they say
YouTube
ElevenLabs CEO/Co-Founder, Mati Staniszewski:The Untold Story of Europe’s Fastest Growing AI Startup
Mati Staniszewski is the Co-Founder and CEO of ElevenLabs, the world’s leading AI voice platform. Since launching in 2022, ElevenLabs has raised over $350M, most recently at a $3.3BN valuation, making it one of Europe’s fastest AI unicorns. The company counts…
https://github.com/OpenBMB/VoxCPM
VoxCPM is a novel tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in a continuous space, it overcomes the limitations of discrete tokenization and enables two flagship capabilities: context-aware speech generation and true-to-life zero-shot voice cloning.
Unlike mainstream approaches that convert speech to discrete tokens, VoxCPM uses an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierachical language modeling and FSQ constraints, greatly enhancing both expressiveness and generation stability.
VoxCPM is a novel tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in a continuous space, it overcomes the limitations of discrete tokenization and enables two flagship capabilities: context-aware speech generation and true-to-life zero-shot voice cloning.
Unlike mainstream approaches that convert speech to discrete tokens, VoxCPM uses an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierachical language modeling and FSQ constraints, greatly enhancing both expressiveness and generation stability.
GitHub
GitHub - OpenBMB/VoxCPM: VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life…
VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning - OpenBMB/VoxCPM
Conversational AI Reading Group (led by MousaviPooneh) resumes tomorrow!
https://poonehmousavi.github.io/rg
[Sep 18th, 2025]
Discrete Audio Tokens: More Than a Survey!
Presenter:Pooneh Mousavi Mila - Concordia
https://poonehmousavi.github.io/dates-website/
https://poonehmousavi.github.io/rg
[Sep 18th, 2025]
Discrete Audio Tokens: More Than a Survey!
Presenter:Pooneh Mousavi Mila - Concordia
https://poonehmousavi.github.io/dates-website/
poonehmousavi.github.io
Pooneh Mousavi
Homepage of Pooneh Mousavi
https://github.com/lavendery/UUG
https://arxiv.org/abs/2508.08961
DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models
Yuanyuan Wang, Dongchao Yang, Yiwen Shao, Hangting Chen, Jiankun Zhao, Zhiyong Wu, Helen Meng, Xixin Wu
Extending pre-trained Large Language Models (LLMs)'s speech understanding or generation abilities by introducing various effective speech tokens has attracted great attention in the speech community. However, building a unified speech understanding and generation model still faces the following challenges: (1) Due to the huge modality gap between speech tokens and text tokens, extending text LLMs to unified speech LLMs relies on large-scale paired data for fine-tuning, and (2) Generation and understanding tasks prefer information at different levels, e.g., generation benefits from detailed acoustic features, while understanding favors high-level semantics. This divergence leads to difficult performance optimization in one unified model. To solve these challenges, in this paper, we present two key insights in speech tokenization and speech language modeling. Specifically, we first propose an Understanding-driven Speech Tokenizer (USTokenizer), which extracts high-level semantic information essential for accomplishing understanding tasks using text LLMs. In this way, USToken enjoys better modality commonality with text, which reduces the difficulty of modality alignment in adapting text LLMs to speech LLMs. Secondly, we present DualSpeechLM, a dual-token modeling framework that concurrently models USToken as input and acoustic token as output within a unified, end-to-end framework, seamlessly integrating speech understanding and generation capabilities. Furthermore, we propose a novel semantic supervision loss and a Chain-of-Condition (CoC) strategy to stabilize model training and enhance speech generation performance. Experimental results demonstrate that our proposed approach effectively fosters a complementary relationship between understanding and generation tasks, highlighting the promising strategy of mutually enhancing both tasks in one unified model.
https://arxiv.org/abs/2508.08961
DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models
Yuanyuan Wang, Dongchao Yang, Yiwen Shao, Hangting Chen, Jiankun Zhao, Zhiyong Wu, Helen Meng, Xixin Wu
Extending pre-trained Large Language Models (LLMs)'s speech understanding or generation abilities by introducing various effective speech tokens has attracted great attention in the speech community. However, building a unified speech understanding and generation model still faces the following challenges: (1) Due to the huge modality gap between speech tokens and text tokens, extending text LLMs to unified speech LLMs relies on large-scale paired data for fine-tuning, and (2) Generation and understanding tasks prefer information at different levels, e.g., generation benefits from detailed acoustic features, while understanding favors high-level semantics. This divergence leads to difficult performance optimization in one unified model. To solve these challenges, in this paper, we present two key insights in speech tokenization and speech language modeling. Specifically, we first propose an Understanding-driven Speech Tokenizer (USTokenizer), which extracts high-level semantic information essential for accomplishing understanding tasks using text LLMs. In this way, USToken enjoys better modality commonality with text, which reduces the difficulty of modality alignment in adapting text LLMs to speech LLMs. Secondly, we present DualSpeechLM, a dual-token modeling framework that concurrently models USToken as input and acoustic token as output within a unified, end-to-end framework, seamlessly integrating speech understanding and generation capabilities. Furthermore, we propose a novel semantic supervision loss and a Chain-of-Condition (CoC) strategy to stabilize model training and enhance speech generation performance. Experimental results demonstrate that our proposed approach effectively fosters a complementary relationship between understanding and generation tasks, highlighting the promising strategy of mutually enhancing both tasks in one unified model.
GitHub
GitHub - lavendery/UUG
Contribute to lavendery/UUG development by creating an account on GitHub.
Chat/Supervisor model for voice agents from OpenAI
https://github.com/openai/openai-realtime-agents
https://x.com/noahmacca/status/1927014156152058075
Basically real-time model produces fillers while slow model thinks
https://github.com/openai/openai-realtime-agents
https://x.com/noahmacca/status/1927014156152058075
Basically real-time model produces fillers while slow model thinks
This was a big challenge with interesting results
https://arxiv.org/abs/2509.13785
Summary on The Multilingual Conversational Speech Language Model Challenge: Datasets, Tasks, Baselines, and Methods
Bingshen Mu, Pengcheng Guo, Zhaokai Sun, Shuai Wang, Hexin Liu, Mingchen Shao, Lei Xie, Eng Siong Chng, Longshuai Xiao, Qiangze Feng, Daliang Wang
This paper summarizes the Interspeech2025 Multilingual Conversational Speech Language Model (MLC-SLM) challenge, which aims to advance the exploration of building effective multilingual conversational speech LLMs (SLLMs). We provide a detailed description of the task settings for the MLC-SLM challenge, the released real-world multilingual conversational speech dataset totaling approximately 1,604 hours, and the baseline systems for participants. The MLC-SLM challenge attracts 78 teams from 13 countries to participate, with 489 valid leaderboard results and 14 technical reports for the two tasks. We distill valuable insights on building multilingual conversational SLLMs based on submissions from participants, aiming to contribute to the advancement of the community.
One of the best systems
https://arxiv.org/abs/2507.18051
The TEA-ASLP System for Multilingual Conversational Speech Recognition and Speech Diarization in MLC-SLM 2025 Challenge
Hongfei Xue, Kaixun Huang, Zhikai Zhou, Shen Huang, Shidong Shang
This paper presents the TEA-ASLP's system submitted to the MLC-SLM 2025 Challenge, addressing multilingual conversational automatic speech recognition (ASR) in Task I and speech diarization ASR in Task II. For Task I, we enhance Ideal-LLM model by integrating known language identification and a multilingual MOE LoRA structure, along with using CTC-predicted tokens as prompts to improve autoregressive generation. The model is trained on approximately 180k hours of multilingual ASR data. In Task II, we replace the baseline English-Chinese speaker diarization model with a more suitable English-only version. Our approach achieves a 30.8% reduction in word error rate (WER) compared to the baseline speech language model, resulting in a final WER of 9.60% in Task I and a time-constrained minimum-permutation WER of 17.49% in Task II, earning first and second place in the respective challenge tasks.
https://arxiv.org/abs/2509.13785
Summary on The Multilingual Conversational Speech Language Model Challenge: Datasets, Tasks, Baselines, and Methods
Bingshen Mu, Pengcheng Guo, Zhaokai Sun, Shuai Wang, Hexin Liu, Mingchen Shao, Lei Xie, Eng Siong Chng, Longshuai Xiao, Qiangze Feng, Daliang Wang
This paper summarizes the Interspeech2025 Multilingual Conversational Speech Language Model (MLC-SLM) challenge, which aims to advance the exploration of building effective multilingual conversational speech LLMs (SLLMs). We provide a detailed description of the task settings for the MLC-SLM challenge, the released real-world multilingual conversational speech dataset totaling approximately 1,604 hours, and the baseline systems for participants. The MLC-SLM challenge attracts 78 teams from 13 countries to participate, with 489 valid leaderboard results and 14 technical reports for the two tasks. We distill valuable insights on building multilingual conversational SLLMs based on submissions from participants, aiming to contribute to the advancement of the community.
One of the best systems
https://arxiv.org/abs/2507.18051
The TEA-ASLP System for Multilingual Conversational Speech Recognition and Speech Diarization in MLC-SLM 2025 Challenge
Hongfei Xue, Kaixun Huang, Zhikai Zhou, Shen Huang, Shidong Shang
This paper presents the TEA-ASLP's system submitted to the MLC-SLM 2025 Challenge, addressing multilingual conversational automatic speech recognition (ASR) in Task I and speech diarization ASR in Task II. For Task I, we enhance Ideal-LLM model by integrating known language identification and a multilingual MOE LoRA structure, along with using CTC-predicted tokens as prompts to improve autoregressive generation. The model is trained on approximately 180k hours of multilingual ASR data. In Task II, we replace the baseline English-Chinese speaker diarization model with a more suitable English-only version. Our approach achieves a 30.8% reduction in word error rate (WER) compared to the baseline speech language model, resulting in a final WER of 9.60% in Task I and a time-constrained minimum-permutation WER of 17.49% in Task II, earning first and second place in the respective challenge tasks.
arXiv.org
Summary on The Multilingual Conversational Speech Language Model...
This paper summarizes the Interspeech2025 Multilingual Conversational Speech Language Model (MLC-SLM) challenge, which aims to advance the exploration of building effective multilingual...
People advised me
https://herimor.github.io/voxtream/
https://arxiv.org/abs/2509.15969
https://github.com/herimor/voxtream
VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency
Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze
We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a dynamic look-ahead that does not delay onset. Built around an incremental phoneme transformer, a temporal transformer predicting semantic and duration tokens, and a depth transformer producing acoustic tokens, VoXtream achieves, to our knowledge, the lowest initial delay among publicly available streaming TTS: 102 ms on GPU. Despite being trained on a mid-scale 9k-hour corpus, it matches or surpasses larger baselines on several metrics, while delivering competitive quality in both output- and full-streaming settings. Demo and code are available at this https URL.
I read the paper and listened for samples on webpage.
The idea to model discrete codec tokens at low framerate (12frames/second here with Mimi codec) gonna be actively used in recent systems, however, I think this rate is too coarse to model proper voice. One can easily demonstrate that with duration metrics. And you can hear it listening or the samples too, the calm voice is ok but any emotional voice will be bad. Too uniform for real speech. Again, it would be nice to test systems with intonation/duration metrics, at least pitch correlation / FAD / duration distance. Very sad most modern system just report WER and speaker similiarty. WER of course will be good as speech is very clean. Between, speaker similarity of this system is also lower mostly due to that uniformity issue I think. Most LLM-TTS based on coarse tokens should expose this too.
At least 40 frames per second is required for proper speech model, maybe in hierarchical way (coarse/fine tokens). https://github.com/hubertsiuzdak/snac makes sense here. Overall, the absence of hierarchy in tokens is weakest thing in modern LLMs too.
Too bad systematic evaluation of TTS systems is not performed, so many systems to evaluate and very questionable reports.
https://herimor.github.io/voxtream/
https://arxiv.org/abs/2509.15969
https://github.com/herimor/voxtream
VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency
Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze
We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a dynamic look-ahead that does not delay onset. Built around an incremental phoneme transformer, a temporal transformer predicting semantic and duration tokens, and a depth transformer producing acoustic tokens, VoXtream achieves, to our knowledge, the lowest initial delay among publicly available streaming TTS: 102 ms on GPU. Despite being trained on a mid-scale 9k-hour corpus, it matches or surpasses larger baselines on several metrics, while delivering competitive quality in both output- and full-streaming settings. Demo and code are available at this https URL.
I read the paper and listened for samples on webpage.
The idea to model discrete codec tokens at low framerate (12frames/second here with Mimi codec) gonna be actively used in recent systems, however, I think this rate is too coarse to model proper voice. One can easily demonstrate that with duration metrics. And you can hear it listening or the samples too, the calm voice is ok but any emotional voice will be bad. Too uniform for real speech. Again, it would be nice to test systems with intonation/duration metrics, at least pitch correlation / FAD / duration distance. Very sad most modern system just report WER and speaker similiarty. WER of course will be good as speech is very clean. Between, speaker similarity of this system is also lower mostly due to that uniformity issue I think. Most LLM-TTS based on coarse tokens should expose this too.
At least 40 frames per second is required for proper speech model, maybe in hierarchical way (coarse/fine tokens). https://github.com/hubertsiuzdak/snac makes sense here. Overall, the absence of hierarchy in tokens is weakest thing in modern LLMs too.
Too bad systematic evaluation of TTS systems is not performed, so many systems to evaluate and very questionable reports.
arXiv.org
VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency
We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to...