https://github.com/lavendery/UUG
https://arxiv.org/abs/2508.08961
DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models
Yuanyuan Wang, Dongchao Yang, Yiwen Shao, Hangting Chen, Jiankun Zhao, Zhiyong Wu, Helen Meng, Xixin Wu
Extending pre-trained Large Language Models (LLMs)'s speech understanding or generation abilities by introducing various effective speech tokens has attracted great attention in the speech community. However, building a unified speech understanding and generation model still faces the following challenges: (1) Due to the huge modality gap between speech tokens and text tokens, extending text LLMs to unified speech LLMs relies on large-scale paired data for fine-tuning, and (2) Generation and understanding tasks prefer information at different levels, e.g., generation benefits from detailed acoustic features, while understanding favors high-level semantics. This divergence leads to difficult performance optimization in one unified model. To solve these challenges, in this paper, we present two key insights in speech tokenization and speech language modeling. Specifically, we first propose an Understanding-driven Speech Tokenizer (USTokenizer), which extracts high-level semantic information essential for accomplishing understanding tasks using text LLMs. In this way, USToken enjoys better modality commonality with text, which reduces the difficulty of modality alignment in adapting text LLMs to speech LLMs. Secondly, we present DualSpeechLM, a dual-token modeling framework that concurrently models USToken as input and acoustic token as output within a unified, end-to-end framework, seamlessly integrating speech understanding and generation capabilities. Furthermore, we propose a novel semantic supervision loss and a Chain-of-Condition (CoC) strategy to stabilize model training and enhance speech generation performance. Experimental results demonstrate that our proposed approach effectively fosters a complementary relationship between understanding and generation tasks, highlighting the promising strategy of mutually enhancing both tasks in one unified model.
https://arxiv.org/abs/2508.08961
DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models
Yuanyuan Wang, Dongchao Yang, Yiwen Shao, Hangting Chen, Jiankun Zhao, Zhiyong Wu, Helen Meng, Xixin Wu
Extending pre-trained Large Language Models (LLMs)'s speech understanding or generation abilities by introducing various effective speech tokens has attracted great attention in the speech community. However, building a unified speech understanding and generation model still faces the following challenges: (1) Due to the huge modality gap between speech tokens and text tokens, extending text LLMs to unified speech LLMs relies on large-scale paired data for fine-tuning, and (2) Generation and understanding tasks prefer information at different levels, e.g., generation benefits from detailed acoustic features, while understanding favors high-level semantics. This divergence leads to difficult performance optimization in one unified model. To solve these challenges, in this paper, we present two key insights in speech tokenization and speech language modeling. Specifically, we first propose an Understanding-driven Speech Tokenizer (USTokenizer), which extracts high-level semantic information essential for accomplishing understanding tasks using text LLMs. In this way, USToken enjoys better modality commonality with text, which reduces the difficulty of modality alignment in adapting text LLMs to speech LLMs. Secondly, we present DualSpeechLM, a dual-token modeling framework that concurrently models USToken as input and acoustic token as output within a unified, end-to-end framework, seamlessly integrating speech understanding and generation capabilities. Furthermore, we propose a novel semantic supervision loss and a Chain-of-Condition (CoC) strategy to stabilize model training and enhance speech generation performance. Experimental results demonstrate that our proposed approach effectively fosters a complementary relationship between understanding and generation tasks, highlighting the promising strategy of mutually enhancing both tasks in one unified model.
GitHub
GitHub - lavendery/UUG
Contribute to lavendery/UUG development by creating an account on GitHub.
Chat/Supervisor model for voice agents from OpenAI
https://github.com/openai/openai-realtime-agents
https://x.com/noahmacca/status/1927014156152058075
Basically real-time model produces fillers while slow model thinks
https://github.com/openai/openai-realtime-agents
https://x.com/noahmacca/status/1927014156152058075
Basically real-time model produces fillers while slow model thinks
This was a big challenge with interesting results
https://arxiv.org/abs/2509.13785
Summary on The Multilingual Conversational Speech Language Model Challenge: Datasets, Tasks, Baselines, and Methods
Bingshen Mu, Pengcheng Guo, Zhaokai Sun, Shuai Wang, Hexin Liu, Mingchen Shao, Lei Xie, Eng Siong Chng, Longshuai Xiao, Qiangze Feng, Daliang Wang
This paper summarizes the Interspeech2025 Multilingual Conversational Speech Language Model (MLC-SLM) challenge, which aims to advance the exploration of building effective multilingual conversational speech LLMs (SLLMs). We provide a detailed description of the task settings for the MLC-SLM challenge, the released real-world multilingual conversational speech dataset totaling approximately 1,604 hours, and the baseline systems for participants. The MLC-SLM challenge attracts 78 teams from 13 countries to participate, with 489 valid leaderboard results and 14 technical reports for the two tasks. We distill valuable insights on building multilingual conversational SLLMs based on submissions from participants, aiming to contribute to the advancement of the community.
One of the best systems
https://arxiv.org/abs/2507.18051
The TEA-ASLP System for Multilingual Conversational Speech Recognition and Speech Diarization in MLC-SLM 2025 Challenge
Hongfei Xue, Kaixun Huang, Zhikai Zhou, Shen Huang, Shidong Shang
This paper presents the TEA-ASLP's system submitted to the MLC-SLM 2025 Challenge, addressing multilingual conversational automatic speech recognition (ASR) in Task I and speech diarization ASR in Task II. For Task I, we enhance Ideal-LLM model by integrating known language identification and a multilingual MOE LoRA structure, along with using CTC-predicted tokens as prompts to improve autoregressive generation. The model is trained on approximately 180k hours of multilingual ASR data. In Task II, we replace the baseline English-Chinese speaker diarization model with a more suitable English-only version. Our approach achieves a 30.8% reduction in word error rate (WER) compared to the baseline speech language model, resulting in a final WER of 9.60% in Task I and a time-constrained minimum-permutation WER of 17.49% in Task II, earning first and second place in the respective challenge tasks.
https://arxiv.org/abs/2509.13785
Summary on The Multilingual Conversational Speech Language Model Challenge: Datasets, Tasks, Baselines, and Methods
Bingshen Mu, Pengcheng Guo, Zhaokai Sun, Shuai Wang, Hexin Liu, Mingchen Shao, Lei Xie, Eng Siong Chng, Longshuai Xiao, Qiangze Feng, Daliang Wang
This paper summarizes the Interspeech2025 Multilingual Conversational Speech Language Model (MLC-SLM) challenge, which aims to advance the exploration of building effective multilingual conversational speech LLMs (SLLMs). We provide a detailed description of the task settings for the MLC-SLM challenge, the released real-world multilingual conversational speech dataset totaling approximately 1,604 hours, and the baseline systems for participants. The MLC-SLM challenge attracts 78 teams from 13 countries to participate, with 489 valid leaderboard results and 14 technical reports for the two tasks. We distill valuable insights on building multilingual conversational SLLMs based on submissions from participants, aiming to contribute to the advancement of the community.
One of the best systems
https://arxiv.org/abs/2507.18051
The TEA-ASLP System for Multilingual Conversational Speech Recognition and Speech Diarization in MLC-SLM 2025 Challenge
Hongfei Xue, Kaixun Huang, Zhikai Zhou, Shen Huang, Shidong Shang
This paper presents the TEA-ASLP's system submitted to the MLC-SLM 2025 Challenge, addressing multilingual conversational automatic speech recognition (ASR) in Task I and speech diarization ASR in Task II. For Task I, we enhance Ideal-LLM model by integrating known language identification and a multilingual MOE LoRA structure, along with using CTC-predicted tokens as prompts to improve autoregressive generation. The model is trained on approximately 180k hours of multilingual ASR data. In Task II, we replace the baseline English-Chinese speaker diarization model with a more suitable English-only version. Our approach achieves a 30.8% reduction in word error rate (WER) compared to the baseline speech language model, resulting in a final WER of 9.60% in Task I and a time-constrained minimum-permutation WER of 17.49% in Task II, earning first and second place in the respective challenge tasks.
arXiv.org
Summary on The Multilingual Conversational Speech Language Model...
This paper summarizes the Interspeech2025 Multilingual Conversational Speech Language Model (MLC-SLM) challenge, which aims to advance the exploration of building effective multilingual...
People advised me
https://herimor.github.io/voxtream/
https://arxiv.org/abs/2509.15969
https://github.com/herimor/voxtream
VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency
Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze
We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a dynamic look-ahead that does not delay onset. Built around an incremental phoneme transformer, a temporal transformer predicting semantic and duration tokens, and a depth transformer producing acoustic tokens, VoXtream achieves, to our knowledge, the lowest initial delay among publicly available streaming TTS: 102 ms on GPU. Despite being trained on a mid-scale 9k-hour corpus, it matches or surpasses larger baselines on several metrics, while delivering competitive quality in both output- and full-streaming settings. Demo and code are available at this https URL.
I read the paper and listened for samples on webpage.
The idea to model discrete codec tokens at low framerate (12frames/second here with Mimi codec) gonna be actively used in recent systems, however, I think this rate is too coarse to model proper voice. One can easily demonstrate that with duration metrics. And you can hear it listening or the samples too, the calm voice is ok but any emotional voice will be bad. Too uniform for real speech. Again, it would be nice to test systems with intonation/duration metrics, at least pitch correlation / FAD / duration distance. Very sad most modern system just report WER and speaker similiarty. WER of course will be good as speech is very clean. Between, speaker similarity of this system is also lower mostly due to that uniformity issue I think. Most LLM-TTS based on coarse tokens should expose this too.
At least 40 frames per second is required for proper speech model, maybe in hierarchical way (coarse/fine tokens). https://github.com/hubertsiuzdak/snac makes sense here. Overall, the absence of hierarchy in tokens is weakest thing in modern LLMs too.
Too bad systematic evaluation of TTS systems is not performed, so many systems to evaluate and very questionable reports.
https://herimor.github.io/voxtream/
https://arxiv.org/abs/2509.15969
https://github.com/herimor/voxtream
VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency
Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze
We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a dynamic look-ahead that does not delay onset. Built around an incremental phoneme transformer, a temporal transformer predicting semantic and duration tokens, and a depth transformer producing acoustic tokens, VoXtream achieves, to our knowledge, the lowest initial delay among publicly available streaming TTS: 102 ms on GPU. Despite being trained on a mid-scale 9k-hour corpus, it matches or surpasses larger baselines on several metrics, while delivering competitive quality in both output- and full-streaming settings. Demo and code are available at this https URL.
I read the paper and listened for samples on webpage.
The idea to model discrete codec tokens at low framerate (12frames/second here with Mimi codec) gonna be actively used in recent systems, however, I think this rate is too coarse to model proper voice. One can easily demonstrate that with duration metrics. And you can hear it listening or the samples too, the calm voice is ok but any emotional voice will be bad. Too uniform for real speech. Again, it would be nice to test systems with intonation/duration metrics, at least pitch correlation / FAD / duration distance. Very sad most modern system just report WER and speaker similiarty. WER of course will be good as speech is very clean. Between, speaker similarity of this system is also lower mostly due to that uniformity issue I think. Most LLM-TTS based on coarse tokens should expose this too.
At least 40 frames per second is required for proper speech model, maybe in hierarchical way (coarse/fine tokens). https://github.com/hubertsiuzdak/snac makes sense here. Overall, the absence of hierarchy in tokens is weakest thing in modern LLMs too.
Too bad systematic evaluation of TTS systems is not performed, so many systems to evaluate and very questionable reports.
arXiv.org
VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency
We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to...
There is still a big gap between user expectations for smart devices and homes and open source software capabilities. Proper multichannel recognition not yet popular, so all the toys built with RPI4 remain toys. Even some half-open systems built by HomeAssistant are far from being useful due to weak software.
A paper like the following one is a good direction
https://arxiv.org/abs/2509.14430v1
Multi-Channel Differential ASR for Robust Wearer Speech Recognition on Smart Glasses
Yufeng Yang, Yiteng Huang, Yong Xu, Li Wan, Suwon Shon, Yang Liu, Yifeng Fan, Zhaojun Yang, Olivier Siohan, Yue Liu, Ming Sun, Florian Metze
With the growing adoption of wearable devices such as smart glasses for AI assistants, wearer speech recognition (WSR) is becoming increasingly critical to next-generation human-computer interfaces. However, in real environments, interference from side-talk speech remains a significant challenge to WSR and may cause accumulated errors for downstream tasks such as natural language processing. In this work, we introduce a novel multi-channel differential automatic speech recognition (ASR) method for robust WSR on smart glasses. The proposed system takes differential inputs from different frontends that complement each other to improve the robustness of WSR, including a beamformer, microphone selection, and a lightweight side-talk detection model. Evaluations on both simulated and real datasets demonstrate that the proposed system outperforms the traditional approach, achieving up to an 18.0% relative reduction in word error rate.
A paper like the following one is a good direction
https://arxiv.org/abs/2509.14430v1
Multi-Channel Differential ASR for Robust Wearer Speech Recognition on Smart Glasses
Yufeng Yang, Yiteng Huang, Yong Xu, Li Wan, Suwon Shon, Yang Liu, Yifeng Fan, Zhaojun Yang, Olivier Siohan, Yue Liu, Ming Sun, Florian Metze
With the growing adoption of wearable devices such as smart glasses for AI assistants, wearer speech recognition (WSR) is becoming increasingly critical to next-generation human-computer interfaces. However, in real environments, interference from side-talk speech remains a significant challenge to WSR and may cause accumulated errors for downstream tasks such as natural language processing. In this work, we introduce a novel multi-channel differential automatic speech recognition (ASR) method for robust WSR on smart glasses. The proposed system takes differential inputs from different frontends that complement each other to improve the robustness of WSR, including a beamformer, microphone selection, and a lightweight side-talk detection model. Evaluations on both simulated and real datasets demonstrate that the proposed system outperforms the traditional approach, achieving up to an 18.0% relative reduction in word error rate.
arXiv.org
Multi-Channel Differential ASR for Robust Wearer Speech...
With the growing adoption of wearable devices such as smart glasses for AI assistants, wearer speech recognition (WSR) is becoming increasingly critical to next-generation human-computer...
As we advocate for prosody evaluations in TTS systems, this paper is important.
The metric itself is questionable though so the results (I'd experiment with CFG value in flow matching systems)
https://arxiv.org/abs/2509.19928
Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration
Yifan Yang, Bing Han, Hui Wang, Long Zhou, Wei Wang, Mingyu Cui, Xu Tan, Xie Chen
Prosody diversity is essential for achieving naturalness and expressiveness in zero-shot text-to-speech (TTS). However, frequently used acoustic metrics capture only partial views of prosodic variation and correlate poorly with human perception, leaving the problem of reliably quantifying prosody diversity underexplored. To bridge this gap, we introduce ProsodyEval, a prosody diversity assessment dataset that provides Prosody Mean Opinion Score (PMOS) alongside conventional acoustic metrics. ProsodyEval comprises 1000 speech samples derived from 7 mainstream TTS systems, with 2000 human ratings. Building on this, we propose the Discretized Speech Weighted Edit Distance (DS-WED), a new objective diversity metric that quantifies prosodic variation via weighted edit distance over semantic tokens. Experiments on ProsodyEval show that DS-WED achieves substantially higher correlation with human judgments than existing acoustic metrics, while remaining highly robust in speech tokenization from HuBERT and WavLM. Leveraging DS-WED, we benchmark state-of-the-art open-source TTS systems on LibriSpeech test-clean and Seed-TTS test-en, and further explorations uncover several factors that influence prosody diversity, including generative modeling paradigms, duration control, and reinforcement learning. Moreover, we find that current large audio language models (LALMs) remain limited in capturing prosodic variations. Audio samples are available at this https URL.
The metric itself is questionable though so the results (I'd experiment with CFG value in flow matching systems)
https://arxiv.org/abs/2509.19928
Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration
Yifan Yang, Bing Han, Hui Wang, Long Zhou, Wei Wang, Mingyu Cui, Xu Tan, Xie Chen
Prosody diversity is essential for achieving naturalness and expressiveness in zero-shot text-to-speech (TTS). However, frequently used acoustic metrics capture only partial views of prosodic variation and correlate poorly with human perception, leaving the problem of reliably quantifying prosody diversity underexplored. To bridge this gap, we introduce ProsodyEval, a prosody diversity assessment dataset that provides Prosody Mean Opinion Score (PMOS) alongside conventional acoustic metrics. ProsodyEval comprises 1000 speech samples derived from 7 mainstream TTS systems, with 2000 human ratings. Building on this, we propose the Discretized Speech Weighted Edit Distance (DS-WED), a new objective diversity metric that quantifies prosodic variation via weighted edit distance over semantic tokens. Experiments on ProsodyEval show that DS-WED achieves substantially higher correlation with human judgments than existing acoustic metrics, while remaining highly robust in speech tokenization from HuBERT and WavLM. Leveraging DS-WED, we benchmark state-of-the-art open-source TTS systems on LibriSpeech test-clean and Seed-TTS test-en, and further explorations uncover several factors that influence prosody diversity, including generative modeling paradigms, duration control, and reinforcement learning. Moreover, we find that current large audio language models (LALMs) remain limited in capturing prosodic variations. Audio samples are available at this https URL.
arXiv.org
Measuring Prosody Diversity in Zero-Shot TTS: A New Metric,...
Prosody diversity is essential for achieving naturalness and expressiveness in zero-shot text-to-speech (TTS). However, frequently used acoustic metrics capture only partial views of prosodic...
https://github.com/alexandrefrancois/noFFT
https://alexandrefrancois.org/Resonate/
https://alexandrefrancois.org/assets/publications/FrancoisARJ-ICMC2025.pdf
This paper describes Resonate, an original low latency, low memory footprint, and low computational cost algorithm to evaluate perceptually relevant spectral information from audio signals. The fundamental building block is a resonator model that accumulates the signal contribution around its resonant frequency in the time domain, using the Exponentially Weighted Moving Average (EWMA). A compact, iterative formulation of the model affords computing an update at each signal input sample, requiring no buffering and involving only a handful of arithmetic operations. Consistently with on-line perceptual signal analysis, the EWMA gives more weight to recent input values, whereas the contributions of older values decay exponentially. A single parameter governs the dynamics of the system. Banks of such resonators, independently tuned to geometrically spaced resonant frequencies, compute an instantaneous, perceptually relevant estimate of the spectral content of an input signal in real-time. Both memory and per-sample computational complexity of such a bank are linear in the number of resonators, and independent of the number of input samples processed, or duration of processed signal. Furthermore, since the resonators are independent, there is no constraint on the tuning of their resonant frequencies or time constants, and all per sample computations can be parallelized across resonators. The cumulative computational cost for a given duration increases linearly with the number of input samples processed. The low latency afforded by Resonate opens the door to real-time music and speech applications that are out of the reach of FFT-based methods. The efficiency of the approach could reduce computational costs and inspire new designs for low-level audio processing layers in machine learning systems.
https://alexandrefrancois.org/Resonate/
https://alexandrefrancois.org/assets/publications/FrancoisARJ-ICMC2025.pdf
This paper describes Resonate, an original low latency, low memory footprint, and low computational cost algorithm to evaluate perceptually relevant spectral information from audio signals. The fundamental building block is a resonator model that accumulates the signal contribution around its resonant frequency in the time domain, using the Exponentially Weighted Moving Average (EWMA). A compact, iterative formulation of the model affords computing an update at each signal input sample, requiring no buffering and involving only a handful of arithmetic operations. Consistently with on-line perceptual signal analysis, the EWMA gives more weight to recent input values, whereas the contributions of older values decay exponentially. A single parameter governs the dynamics of the system. Banks of such resonators, independently tuned to geometrically spaced resonant frequencies, compute an instantaneous, perceptually relevant estimate of the spectral content of an input signal in real-time. Both memory and per-sample computational complexity of such a bank are linear in the number of resonators, and independent of the number of input samples processed, or duration of processed signal. Furthermore, since the resonators are independent, there is no constraint on the tuning of their resonant frequencies or time constants, and all per sample computations can be parallelized across resonators. The cumulative computational cost for a given duration increases linearly with the number of input samples processed. The low latency afforded by Resonate opens the door to real-time music and speech applications that are out of the reach of FFT-based methods. The efficiency of the approach could reduce computational costs and inspire new designs for low-level audio processing layers in machine learning systems.
GitHub
GitHub - alexandrefrancois/noFFT: A reference implementation of the Resonate algorithm in C++ for Python.
A reference implementation of the Resonate algorithm in C++ for Python. - alexandrefrancois/noFFT
The principle itself is applicable not just to signal processing, but to upper layers too, something in line with https://en.wikipedia.org/wiki/Adaptive_resonance_theory
One more reminder that VAE is better than MEL
https://github.com/ZhikangNiu/Semantic-VAE
Good and simple improvement over F5TTS
https://arxiv.org/abs/2509.22167
Zhikang Niu, Shujie Hu, Jeongsoo Choi, Yushen Chen, Peining Chen, Pengcheng Zhu, Yunting Yang, Bowen Zhang, Jian Zhao, Chunhui Wang, Xie Chen
While mel-spectrograms have been widely utilized as intermediate representations in zero-shot text-to-speech (TTS), their inherent redundancy leads to inefficiency in learning text-speech alignment. Compact VAE-based latent representations have recently emerged as a stronger alternative, but they also face a fundamental optimization dilemma: higher-dimensional latent spaces improve reconstruction quality and speaker similarity, but degrade intelligibility, while lower-dimensional spaces improve intelligibility at the expense of reconstruction fidelity. To overcome this dilemma, we propose Semantic-VAE, a novel VAE framework that utilizes semantic alignment regularization in the latent space. This design alleviates the reconstruction-generation trade-off by capturing semantic structure in high-dimensional latent representations. Extensive experiments demonstrate that Semantic-VAE significantly improves synthesis quality and training efficiency. When integrated into F5-TTS, our method achieves 2.10% WER and 0.64 speaker similarity on LibriSpeech-PC, outperforming mel-based systems (2.23%, 0.60) and vanilla acoustic VAE baselines (2.65%, 0.59). We also release the code and models to facilitate further research.
https://github.com/ZhikangNiu/Semantic-VAE
Good and simple improvement over F5TTS
https://arxiv.org/abs/2509.22167
Zhikang Niu, Shujie Hu, Jeongsoo Choi, Yushen Chen, Peining Chen, Pengcheng Zhu, Yunting Yang, Bowen Zhang, Jian Zhao, Chunhui Wang, Xie Chen
While mel-spectrograms have been widely utilized as intermediate representations in zero-shot text-to-speech (TTS), their inherent redundancy leads to inefficiency in learning text-speech alignment. Compact VAE-based latent representations have recently emerged as a stronger alternative, but they also face a fundamental optimization dilemma: higher-dimensional latent spaces improve reconstruction quality and speaker similarity, but degrade intelligibility, while lower-dimensional spaces improve intelligibility at the expense of reconstruction fidelity. To overcome this dilemma, we propose Semantic-VAE, a novel VAE framework that utilizes semantic alignment regularization in the latent space. This design alleviates the reconstruction-generation trade-off by capturing semantic structure in high-dimensional latent representations. Extensive experiments demonstrate that Semantic-VAE significantly improves synthesis quality and training efficiency. When integrated into F5-TTS, our method achieves 2.10% WER and 0.64 speaker similarity on LibriSpeech-PC, outperforming mel-based systems (2.23%, 0.60) and vanilla acoustic VAE baselines (2.65%, 0.59). We also release the code and models to facilitate further research.
GitHub
GitHub - ZhikangNiu/Semantic-VAE: Official code for "Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis"
Official code for "Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis" - ZhikangNiu/Semantic-VAE
https://huggingface.co/Atotti/Qwen3-Omni-AudioTransformer
Encoder extracted from Qwen3-Omni, expected to be trained on 20m hours of data
Encoder extracted from Qwen3-Omni, expected to be trained on 20m hours of data
huggingface.co
Atotti/Qwen3-Omni-AudioTransformer · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
We released 4 new models for Kazakh and Kyrgyz languages. Models are trained for the old Vosk, they still have a lot of value for some applications where you need to quickly update the LM.
https://alphacephei.com/vosk/models
vosk-model-small-ky-0.42
WER fleurs 18.95
WER cv 16.96
vosk-model-ky-0.42
WER fleurs 13.45
WER cv 8.75
vosk-model-small-kz-0.42
WER fleurs 21.10
WER cv 30.00
WER ksc 9.70
WER ksc-other 24.86
vosk-model-kz-0.42
WER fleurs 13.09
WER cv 12.50
WER ksc 4.49
WER ksc-other 18.51
https://alphacephei.com/vosk/models
vosk-model-small-ky-0.42
WER fleurs 18.95
WER cv 16.96
vosk-model-ky-0.42
WER fleurs 13.45
WER cv 8.75
vosk-model-small-kz-0.42
WER fleurs 21.10
WER cv 30.00
WER ksc 9.70
WER ksc-other 24.86
vosk-model-kz-0.42
WER fleurs 13.09
WER cv 12.50
WER ksc 4.49
WER ksc-other 18.51
VOSK Offline Speech Recognition API
VOSK Models
Accurate speech recognition for Android, iOS, Raspberry Pi and servers with Python, Java, C#, Swift and Node.
Since everyone understood already discrete tokens doesn't work here is a continuous variant
https://github.com/inclusionAI/Ming-UniAudio
https://github.com/inclusionAI/Ming-UniAudio
GitHub
GitHub - inclusionAI/Ming-UniAudio: Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation
Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation - inclusionAI/Ming-UniAudio
BUT continues work on Whisper instead of OpenAI
https://github.com/BUTSpeechFIT/SOT-DiCoW
https://arxiv.org/abs/2510.03723
Adapting Diarization-Conditioned Whisper for End-to-End Multi-Talker Speech Recognition
Martin Kocour, Martin Karafiat, Alexander Polok, Dominik Klement, Lukáš Burget, Jan Černocký
We propose a speaker-attributed (SA) Whisper-based model for multi-talker speech recognition that combines target-speaker modeling with serialized output training (SOT). Our approach leverages a Diarization-Conditioned Whisper (DiCoW) encoder to extract target-speaker embeddings, which are concatenated into a single representation and passed to a shared decoder. This enables the model to transcribe overlapping speech as a serialized output stream with speaker tags and timestamps. In contrast to target-speaker ASR systems such as DiCoW, which decode each speaker separately, our approach performs joint decoding, allowing the decoder to condition on the context of all speakers simultaneously. Experiments show that the model outperforms existing SOT-based approaches and surpasses DiCoW on multi-talker mixtures (e.g., LibriMix).
https://github.com/BUTSpeechFIT/SOT-DiCoW
https://arxiv.org/abs/2510.03723
Adapting Diarization-Conditioned Whisper for End-to-End Multi-Talker Speech Recognition
Martin Kocour, Martin Karafiat, Alexander Polok, Dominik Klement, Lukáš Burget, Jan Černocký
We propose a speaker-attributed (SA) Whisper-based model for multi-talker speech recognition that combines target-speaker modeling with serialized output training (SOT). Our approach leverages a Diarization-Conditioned Whisper (DiCoW) encoder to extract target-speaker embeddings, which are concatenated into a single representation and passed to a shared decoder. This enables the model to transcribe overlapping speech as a serialized output stream with speaker tags and timestamps. In contrast to target-speaker ASR systems such as DiCoW, which decode each speaker separately, our approach performs joint decoding, allowing the decoder to condition on the context of all speakers simultaneously. Experiments show that the model outperforms existing SOT-based approaches and surpasses DiCoW on multi-talker mixtures (e.g., LibriMix).
GitHub
GitHub - BUTSpeechFIT/SOT-DiCoW: Multi-talker ASR based on DiCoW with Serialized Output Training
Multi-talker ASR based on DiCoW with Serialized Output Training - BUTSpeechFIT/SOT-DiCoW
Very true. Good read for those who chase 100ms response time
https://www.speechmatics.com/company/articles-and-news/why-fastest-voice-tech-is-a-trap
https://www.speechmatics.com/company/articles-and-news/why-fastest-voice-tech-is-a-trap
Speechmatics
Why “fastest” voice tech is a trap
Why chasing the 'fastest' speech-to-text breaks voice agents. Discover how Speechmatics balances speed and accuracy for real-world conversations.
Somewhat advanced recent TTS, basically modern F5 with VibeVoice latents at 7.5Hz and distillation from diffusion model
https://github.com/smallbraineng/smalltts
https://github.com/smallbraineng/smalltts
GitHub
GitHub - smallbraineng/smalltts: superfast text to speech in any voice
superfast text to speech in any voice. Contribute to smallbraineng/smalltts development by creating an account on GitHub.
Everyone talks about https://github.com/neuphonic/neucodec
Neuphonhic itself celebrates funding round as a promising British startup.
Not sure why, codec is really huge 800M parameters, must be very context-dependent.
Neuphonhic itself celebrates funding round as a promising British startup.
Not sure why, codec is really huge 800M parameters, must be very context-dependent.
GitHub
GitHub - neuphonic/neucodec: A package for NeuCodec: a 50hz, 0.8kbps, 24kHz audio codec.
A package for NeuCodec: a 50hz, 0.8kbps, 24kHz audio codec. - neuphonic/neucodec
Meta's take on audio LLMs
https://arxiv.org/abs/2510.06195
Latent Speech-Text Transformer
Yen-Ju Lu, Yashesh Gaur, Wei Zhou, Benjamin Muller, Jesus Villalba, Najim Dehak, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Srinivasan Iyer, Duc Le
Auto-regressive speech-text models are typically pre-trained on a large number of interleaved sequences of text tokens and raw speech encoded as speech tokens using vector quantization. These models have demonstrated state-of-the-art performance in speech-to-speech understanding and generation benchmarks, together with promising scaling laws, primarily enabled by the representational alignment between text and speech. Nevertheless, they suffer from shortcomings, partly owing to the disproportionately longer sequences of speech tokens in contrast to textual tokens. This results in a large compute imbalance between modalities during pre-training as well as during inference, and a potential hindrance to effectively aligning speech and text, ultimately translating to several orders of magnitude slower scaling laws. We introduce the Latent Speech-Text Transformer (LST), which makes pre-training speech-text models more data-efficient by dynamically and inexpensively aggregating speech tokens into latent speech patches. These patches serve as higher-level units that can either align with corresponding textual units to aid capability transfer or even encapsulate common speech sequences like silences to be more compute-efficient. We show that LST outperforms vanilla approaches on speech-to-speech as well as text-to-text benchmarks in both data- and compute-controlled settings, the former indicating more effective representational alignment and the latter indicating steeper scaling laws for speech-text models. On HellaSwag story completion, LST achieves 6.5% absolute gain in speech accuracy under compute-controlled training and 5.3% under data-controlled training, while also improving text performance. We will release our models, code, and the evaluation data to facilitate further research.
https://arxiv.org/abs/2510.06195
Latent Speech-Text Transformer
Yen-Ju Lu, Yashesh Gaur, Wei Zhou, Benjamin Muller, Jesus Villalba, Najim Dehak, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Srinivasan Iyer, Duc Le
Auto-regressive speech-text models are typically pre-trained on a large number of interleaved sequences of text tokens and raw speech encoded as speech tokens using vector quantization. These models have demonstrated state-of-the-art performance in speech-to-speech understanding and generation benchmarks, together with promising scaling laws, primarily enabled by the representational alignment between text and speech. Nevertheless, they suffer from shortcomings, partly owing to the disproportionately longer sequences of speech tokens in contrast to textual tokens. This results in a large compute imbalance between modalities during pre-training as well as during inference, and a potential hindrance to effectively aligning speech and text, ultimately translating to several orders of magnitude slower scaling laws. We introduce the Latent Speech-Text Transformer (LST), which makes pre-training speech-text models more data-efficient by dynamically and inexpensively aggregating speech tokens into latent speech patches. These patches serve as higher-level units that can either align with corresponding textual units to aid capability transfer or even encapsulate common speech sequences like silences to be more compute-efficient. We show that LST outperforms vanilla approaches on speech-to-speech as well as text-to-text benchmarks in both data- and compute-controlled settings, the former indicating more effective representational alignment and the latter indicating steeper scaling laws for speech-text models. On HellaSwag story completion, LST achieves 6.5% absolute gain in speech accuracy under compute-controlled training and 5.3% under data-controlled training, while also improving text performance. We will release our models, code, and the evaluation data to facilitate further research.
arXiv.org
Latent Speech-Text Transformer
Auto-regressive speech-text models pre-trained on interleaved text tokens and discretized speech tokens demonstrate strong speech understanding and generation, yet remain substantially less...
CoT has to come
https://arxiv.org/abs/2510.07497
Can Speech LLMs Think while Listening?
Yi-Jen Shih, Desh Raj, Chunyang Wu, Wei Zhou, SK Bong, Yashesh Gaur, Jay Mahadeokar, Ozlem Kalinli, Mike Seltzer
Recent advances in speech large language models (speech LLMs) have enabled seamless spoken interactions, but these systems still struggle with complex reasoning tasks. Previously, chain-of-thought (CoT) prompting or fine-tuning has been to shown to significantly improve the reasoning abilities of text-based LLMs. In this work, we investigate the effect of CoT fine-tuning for multi-stream speech LLMs, demonstrating that reasoning in text space improves the accuracy of speech LLMs by 2.4x, on average, over a suite of spoken reasoning tasks. Beyond accuracy, the latency of the spoken response is a crucial factor for interacting with voice-based agents. Inspired by the human behavior of "thinking while listening," we propose methods to reduce the additional latency from reasoning by allowing the model to start reasoning before the user query has ended. To achieve this, we introduce an entropy-based metric, "question completeness," which acts as an indicator to guide the model on the optimal time to start reasoning. This method provides greater control over the accuracy-latency trade-off compared with heuristic-based approaches and, under equivalent latency conditions, yields a 4% accuracy gain on ARC-Easy. Finally, we use Direct Preference Optimization (DPO) on preference data created using rejection sampling to push the accuracy-latency pareto frontier further, resulting in a 70% reduction in latency without loss in accuracy.
https://arxiv.org/abs/2510.07497
Can Speech LLMs Think while Listening?
Yi-Jen Shih, Desh Raj, Chunyang Wu, Wei Zhou, SK Bong, Yashesh Gaur, Jay Mahadeokar, Ozlem Kalinli, Mike Seltzer
Recent advances in speech large language models (speech LLMs) have enabled seamless spoken interactions, but these systems still struggle with complex reasoning tasks. Previously, chain-of-thought (CoT) prompting or fine-tuning has been to shown to significantly improve the reasoning abilities of text-based LLMs. In this work, we investigate the effect of CoT fine-tuning for multi-stream speech LLMs, demonstrating that reasoning in text space improves the accuracy of speech LLMs by 2.4x, on average, over a suite of spoken reasoning tasks. Beyond accuracy, the latency of the spoken response is a crucial factor for interacting with voice-based agents. Inspired by the human behavior of "thinking while listening," we propose methods to reduce the additional latency from reasoning by allowing the model to start reasoning before the user query has ended. To achieve this, we introduce an entropy-based metric, "question completeness," which acts as an indicator to guide the model on the optimal time to start reasoning. This method provides greater control over the accuracy-latency trade-off compared with heuristic-based approaches and, under equivalent latency conditions, yields a 4% accuracy gain on ARC-Easy. Finally, we use Direct Preference Optimization (DPO) on preference data created using rejection sampling to push the accuracy-latency pareto frontier further, resulting in a 70% reduction in latency without loss in accuracy.
arXiv.org
Can Speech LLMs Think while Listening?
Recent advances in speech large language models (speech LLMs) have enabled seamless spoken interactions, but these systems still struggle with complex reasoning tasks. Previously, chain-of-thought...