Improved Matcha. Adversarial learning really helps FM to improve MOS.
https://github.com/naver-ai/RapFlow-TTS
https://www.arxiv.org/abs/2506.16741
RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching
Hyun Joon Park, Jeongmin Liu, Jin Sob Kim, Jeong Yeol Yang, Sung Won Han, Eunwoo Song
We introduce RapFlow-TTS, a rapid and high-fidelity TTS acoustic model that leverages velocity consistency constraints in flow matching (FM) training. Although ordinary differential equation (ODE)-based TTS generation achieves natural-quality speech, it typically requires a large number of generation steps, resulting in a trade-off between quality and inference speed. To address this challenge, RapFlow-TTS enforces consistency in the velocity field along the FM-straightened ODE trajectory, enabling consistent synthetic quality with fewer generation steps. Additionally, we introduce techniques such as time interval scheduling and adversarial learning to further enhance the quality of the few-step synthesis. Experimental results show that RapFlow-TTS achieves high-fidelity speech synthesis with a 5-
https://github.com/naver-ai/RapFlow-TTS
https://www.arxiv.org/abs/2506.16741
RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching
Hyun Joon Park, Jeongmin Liu, Jin Sob Kim, Jeong Yeol Yang, Sung Won Han, Eunwoo Song
We introduce RapFlow-TTS, a rapid and high-fidelity TTS acoustic model that leverages velocity consistency constraints in flow matching (FM) training. Although ordinary differential equation (ODE)-based TTS generation achieves natural-quality speech, it typically requires a large number of generation steps, resulting in a trade-off between quality and inference speed. To address this challenge, RapFlow-TTS enforces consistency in the velocity field along the FM-straightened ODE trajectory, enabling consistent synthetic quality with fewer generation steps. Additionally, we introduce techniques such as time interval scheduling and adversarial learning to further enhance the quality of the few-step synthesis. Experimental results show that RapFlow-TTS achieves high-fidelity speech synthesis with a 5-
GitHub
GitHub - naver-ai/RapFlow-TTS
Contribute to naver-ai/RapFlow-TTS development by creating an account on GitHub.
Given that adversarial learning really helps with MOS, the same alternative would probably be to use different latent space than simple Mels. Encodec latents as in StableTTS? Sad authors didn't explore that path.
ZipvoiceDialog is released, much better than Dia. Less publicity though
https://arxiv.org/abs/2507.09318
ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching
Han Zhu, Wei Kang, Liyong Guo, Zengwei Yao, Fangjun Kuang, Weiji Zhuang, Zhaoqing Li, Zhifeng Han, Dong Zhang, Xin Zhang, Xingchen Song, Long Lin, Daniel Povey
Generating spoken dialogue is more challenging than monologue text-to-speech (TTS) due to the need for realistic turn-taking and distinct speaker timbres. Existing spoken dialogue generation models, being auto-regressive, suffer from slow and unstable inference. To overcome these limitations, we introduce ZipVoice-Dialog, a non-autoregressive zero-shot spoken dialogue generation model built upon flow matching. Key designs include: 1) speaker-turn embeddings for precise speaker turn-taking; 2) a curriculum learning strategy for stable speech-text alignment; 3) specialized strategies to enable stereo dialogue generation. Additionally, recognizing the lack of open-source large-scale spoken dialogue datasets, we curated OpenDialog, a 6.8k-hour spoken dialogue dataset from in-the-wild speech data. Furthermore, we established a benchmark to comprehensively evaluate various models. Experimental results demonstrate that ZipVoice-Dialog achieves superior performance in intelligibility, speaker turn-taking accuracy, speaker similarity, and inference speed. Our codes, model checkpoints, demo samples, and the OpenDialog dataset are all publicly available at this https URL.
https://arxiv.org/abs/2507.09318
ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching
Han Zhu, Wei Kang, Liyong Guo, Zengwei Yao, Fangjun Kuang, Weiji Zhuang, Zhaoqing Li, Zhifeng Han, Dong Zhang, Xin Zhang, Xingchen Song, Long Lin, Daniel Povey
Generating spoken dialogue is more challenging than monologue text-to-speech (TTS) due to the need for realistic turn-taking and distinct speaker timbres. Existing spoken dialogue generation models, being auto-regressive, suffer from slow and unstable inference. To overcome these limitations, we introduce ZipVoice-Dialog, a non-autoregressive zero-shot spoken dialogue generation model built upon flow matching. Key designs include: 1) speaker-turn embeddings for precise speaker turn-taking; 2) a curriculum learning strategy for stable speech-text alignment; 3) specialized strategies to enable stereo dialogue generation. Additionally, recognizing the lack of open-source large-scale spoken dialogue datasets, we curated OpenDialog, a 6.8k-hour spoken dialogue dataset from in-the-wild speech data. Furthermore, we established a benchmark to comprehensively evaluate various models. Experimental results demonstrate that ZipVoice-Dialog achieves superior performance in intelligibility, speaker turn-taking accuracy, speaker similarity, and inference speed. Our codes, model checkpoints, demo samples, and the OpenDialog dataset are all publicly available at this https URL.
arXiv.org
ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation...
Generating spoken dialogue is more challenging than monologue text-to-speech (TTS) due to the need for realistic turn-taking and distinct speaker timbres. Existing spoken dialogue generation...
Introducing STITCH: our new method to make Spoken Language Models (SLMs) think and talk at the same time.
http://arxiv.org/abs/2507.15375
https://x.com/dcml0714/status/1947493948358070783
http://arxiv.org/abs/2507.15375
https://x.com/dcml0714/status/1947493948358070783
MegaTTS 3 voice cloning is here!
For context: a while back, ByteDance released MegaTTS 3 (with exceptional voice cloning capabilities), but for various reasons, they decided not to release the WavVAE encoder necessary for voice cloning to work.
Recently, a WavVAE encoder compatible with MegaTTS 3 was released by ACoderPassBy on ModelScope: https://modelscope.cn/models/ACoderPassBy/MegaTTS-SFT with quite promising results.
I reuploaded the weights to Hugging Face: https://huggingface.co/mrfakename/MegaTTS3-VoiceCloning
And put up a quick Gradio demo to try it out: https://huggingface.co/spaces/mrfakename/MegaTTS3-Voice-Cloning
Overall looks quite impressive - excited to see that we can finally do voice cloning with MegaTTS 3!
h/t to MysteryShack on the StyleTTS 2 Discord for info about the WavVAE encoder
https://www.reddit.com/r/LocalLLaMA/comments/1m641zg/megatts_3_voice_cloning_is_here/
For context: a while back, ByteDance released MegaTTS 3 (with exceptional voice cloning capabilities), but for various reasons, they decided not to release the WavVAE encoder necessary for voice cloning to work.
Recently, a WavVAE encoder compatible with MegaTTS 3 was released by ACoderPassBy on ModelScope: https://modelscope.cn/models/ACoderPassBy/MegaTTS-SFT with quite promising results.
I reuploaded the weights to Hugging Face: https://huggingface.co/mrfakename/MegaTTS3-VoiceCloning
And put up a quick Gradio demo to try it out: https://huggingface.co/spaces/mrfakename/MegaTTS3-Voice-Cloning
Overall looks quite impressive - excited to see that we can finally do voice cloning with MegaTTS 3!
h/t to MysteryShack on the StyleTTS 2 Discord for info about the WavVAE encoder
https://www.reddit.com/r/LocalLLaMA/comments/1m641zg/megatts_3_voice_cloning_is_here/
modelscope.cn
ModelScope 魔搭社区
ModelScope——汇聚各领域先进的机器学习模型,提供模型探索体验、推理、训练、部署和应用的一站式服务。在这里,共建模型开源社区,发现、学习、定制和分享心仪的模型。
This part is interesting, objective evaluation (well, with Gemini 2.5) of modern TTS
https://github.com/boson-ai/EmergentTTS-Eval-public
https://github.com/boson-ai/EmergentTTS-Eval-public
GitHub
GitHub - boson-ai/EmergentTTS-Eval-public: [NeurIPS' 25] Benchmark for evaluating TTS models on complex prosodic, expressiveness…
[NeurIPS' 25] Benchmark for evaluating TTS models on complex prosodic, expressiveness, and linguistic challenges. - boson-ai/EmergentTTS-Eval-public
It is a valid task statement to operate on long speech. Highlights issues of default transformers.
Although SSMs might not be optimal solution.
https://arxiv.org/abs/2412.18603v2
Long-Form Speech Generation with Spoken Language Models
Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, RJ Skerry-Ryan
We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, textless spoken language models struggle to generate plausible speech past tens of seconds, due to high temporal resolution of speech tokens causing loss of coherence, architectural issues with long-sequence training or extrapolation, and memory costs at inference time. From these considerations we derive SpeechSSM, the first speech language model family to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates. SpeechSSMs leverage recent advances in linear-time sequence modeling to greatly surpass current Transformer spoken LMs in coherence and efficiency on multi-minute generations while still matching them at the utterance level. As we found current spoken language evaluations uninformative, especially in this new long-form setting, we also introduce: LibriSpeech-Long, a benchmark for long-form speech evaluation; new embedding-based and LLM-judged metrics; and quality measurements over length and time. Speech samples, the LibriSpeech-Long dataset, and any future code or model releases can be found at this https URL.
Although SSMs might not be optimal solution.
https://arxiv.org/abs/2412.18603v2
Long-Form Speech Generation with Spoken Language Models
Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, RJ Skerry-Ryan
We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, textless spoken language models struggle to generate plausible speech past tens of seconds, due to high temporal resolution of speech tokens causing loss of coherence, architectural issues with long-sequence training or extrapolation, and memory costs at inference time. From these considerations we derive SpeechSSM, the first speech language model family to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates. SpeechSSMs leverage recent advances in linear-time sequence modeling to greatly surpass current Transformer spoken LMs in coherence and efficiency on multi-minute generations while still matching them at the utterance level. As we found current spoken language evaluations uninformative, especially in this new long-form setting, we also introduce: LibriSpeech-Long, a benchmark for long-form speech evaluation; new embedding-based and LLM-judged metrics; and quality measurements over length and time. Speech samples, the LibriSpeech-Long dataset, and any future code or model releases can be found at this https URL.
arXiv.org
Long-Form Speech Generation with Spoken Language Models
We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, textless spoken language models...
Streaming diarization with Sortformer lands in Nemo
https://www.arxiv.org/abs/2507.18446
https://github.com/NVIDIA/NeMo/pull/13201
Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering
Ivan Medennikov, Taejin Park, Weiqing Wang, He Huang, Kunal Dhawan, Jinhan Wang, Jagadeesh Balam, Boris Ginsburg
This paper presents a streaming extension for the Sortformer speaker diarization framework, whose key property is the arrival-time ordering of output speakers. The proposed approach employs an Arrival-Order Speaker Cache (AOSC) to store frame-level acoustic embeddings of previously observed speakers. Unlike conventional speaker-tracing buffers, AOSC orders embeddings by speaker index corresponding to their arrival time order, and is dynamically updated by selecting frames with the highest scores based on the model's past predictions. Notably, the number of stored embeddings per speaker is determined dynamically by the update mechanism, ensuring efficient cache utilization and precise speaker tracking. Experiments on benchmark datasets confirm the effectiveness and flexibility of our approach, even in low-latency setups. These results establish Streaming Sortformer as a robust solution for real-time multi-speaker tracking and a foundation for streaming multi-talker speech processing.
https://www.arxiv.org/abs/2507.18446
https://github.com/NVIDIA/NeMo/pull/13201
Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering
Ivan Medennikov, Taejin Park, Weiqing Wang, He Huang, Kunal Dhawan, Jinhan Wang, Jagadeesh Balam, Boris Ginsburg
This paper presents a streaming extension for the Sortformer speaker diarization framework, whose key property is the arrival-time ordering of output speakers. The proposed approach employs an Arrival-Order Speaker Cache (AOSC) to store frame-level acoustic embeddings of previously observed speakers. Unlike conventional speaker-tracing buffers, AOSC orders embeddings by speaker index corresponding to their arrival time order, and is dynamically updated by selecting frames with the highest scores based on the model's past predictions. Notably, the number of stored embeddings per speaker is determined dynamically by the update mechanism, ensuring efficient cache utilization and precise speaker tracking. Experiments on benchmark datasets confirm the effectiveness and flexibility of our approach, even in low-latency setups. These results establish Streaming Sortformer as a robust solution for real-time multi-speaker tracking and a foundation for streaming multi-talker speech processing.
arXiv.org
Streaming Sortformer: Speaker Cache-Based Online Speaker...
This paper presents a streaming extension for the Sortformer speaker diarization framework, whose key property is the arrival-time ordering of output speakers. The proposed approach employs an...
Matcha and followup repos combine duration network training with flow matching decoder training. Deepseek comment on it above.
RapFlow paper correctly suggests to freeze the encoder while training the FM
https://arxiv.org/abs/2506.16741
Fun that DeepSeek cites non-existent paper as confirmation source though:
"Gradient Conflicts in Multi-Objective Generative Modeling" (Zhang et al., ICML 2023) [arXiv:2302.08954]
RapFlow paper correctly suggests to freeze the encoder while training the FM
https://arxiv.org/abs/2506.16741
Fun that DeepSeek cites non-existent paper as confirmation source though:
"Gradient Conflicts in Multi-Objective Generative Modeling" (Zhang et al., ICML 2023) [arXiv:2302.08954]
NonVerbalSpeech-38K:
A Scalable Pipeline for Enabling Non-Verbal Speech Generation and Understanding
Anonymous submission
https://huggingface.co/datasets/nonverbalspeech/nonverbalspeech38k
Abstract Human spoken communication involves not only lexical content but also non-verbal vocalizations (NVs) such as laughter, sighs, and coughs, which convey emotions, intentions, and social signals. However, most existing speech systems focus solely on verbal content and lack the ability to understand and generate such non-verbal cues, reducing the emotional intelligence and communicative richness of spoken interfaces. In this work, we introduce NonVerbalSpeech-38K, a large and diverse dataset for non-verbal speech generation and understanding, collected from real-world media and annotated using an automatic pipeline. The dataset contains 38,718 samples (about 131 hours) with 10 categories of non-verbal cues, such as laughter, sniff, and throat clearing. We further validate the dataset by fine-tuning state-of-the-art models, including F5-TTS and Qwen2-Audio, demonstrating its effectiveness in non-verbal speech generation and understanding tasks. Our contributions are threefold: (1) We propose a practical pipeline for building natural and diverse non-verbal speech datasets; (2) We release a large-scale dataset to advance research on non-verbal speech generation and understanding; (3) We validate the dataset’s effectiveness by demonstrating improvements in both non-verbal speech synthesis and captioning, thereby facilitating richer human-computer interaction..
https://nonverbalspeech38k.github.io/nonverspeech38k/
A Scalable Pipeline for Enabling Non-Verbal Speech Generation and Understanding
Anonymous submission
https://huggingface.co/datasets/nonverbalspeech/nonverbalspeech38k
Abstract Human spoken communication involves not only lexical content but also non-verbal vocalizations (NVs) such as laughter, sighs, and coughs, which convey emotions, intentions, and social signals. However, most existing speech systems focus solely on verbal content and lack the ability to understand and generate such non-verbal cues, reducing the emotional intelligence and communicative richness of spoken interfaces. In this work, we introduce NonVerbalSpeech-38K, a large and diverse dataset for non-verbal speech generation and understanding, collected from real-world media and annotated using an automatic pipeline. The dataset contains 38,718 samples (about 131 hours) with 10 categories of non-verbal cues, such as laughter, sniff, and throat clearing. We further validate the dataset by fine-tuning state-of-the-art models, including F5-TTS and Qwen2-Audio, demonstrating its effectiveness in non-verbal speech generation and understanding tasks. Our contributions are threefold: (1) We propose a practical pipeline for building natural and diverse non-verbal speech datasets; (2) We release a large-scale dataset to advance research on non-verbal speech generation and understanding; (3) We validate the dataset’s effectiveness by demonstrating improvements in both non-verbal speech synthesis and captioning, thereby facilitating richer human-computer interaction..
https://nonverbalspeech38k.github.io/nonverspeech38k/
Somewhat interesting tech
liquid-audio-nets implements Liquid Neural Networks (LNNs) optimized for ultra-low-power audio processing on edge devices. Based on 2025 field tests showing 10× power reduction compared to CNNs, this library enables always-on audio sensing for battery-powered IoT devices.
https://github.com/danieleschmidt/liquid-audio-nets
Key Innovations
Continuous-Time Dynamics: ODEs instead of discrete layers
Adaptive Computation: Timestep scales with signal complexity
Sparse Activation: Only necessary neurons fire
State Persistence: Temporal memory without explicit recurrence
liquid-audio-nets implements Liquid Neural Networks (LNNs) optimized for ultra-low-power audio processing on edge devices. Based on 2025 field tests showing 10× power reduction compared to CNNs, this library enables always-on audio sensing for battery-powered IoT devices.
https://github.com/danieleschmidt/liquid-audio-nets
Key Innovations
Continuous-Time Dynamics: ODEs instead of discrete layers
Adaptive Computation: Timestep scales with signal complexity
Sparse Activation: Only necessary neurons fire
State Persistence: Temporal memory without explicit recurrence
GitHub
GitHub - danieleschmidt/liquid-audio-nets: liquid-audio-nets implements Liquid Neural Networks (LNNs) optimized for ultra-low-power…
liquid-audio-nets implements Liquid Neural Networks (LNNs) optimized for ultra-low-power audio processing on edge devices. Based on 2025 field tests showing 10× power reduction compared to CNNs, th...
So the situation in LLM world is that they basically indexed all available internet and now try to maximize the effect with test-time compute.
People say GPT-5 for example traded less layers for more test-time tokens. A paper on the subject from DeepMind:
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
https://arxiv.org/abs/2408.03314
Speech is few years behind as usual, and not many test-time compute papers yet (although MAP adaptation was a thing long time ago). But sure it's going to be popular soon.
Chain-of-Thought Training for Open E2E Spoken Dialogue Systems
https://arxiv.org/abs/2506.00722
People say GPT-5 for example traded less layers for more test-time tokens. A paper on the subject from DeepMind:
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
https://arxiv.org/abs/2408.03314
Speech is few years behind as usual, and not many test-time compute papers yet (although MAP adaptation was a thing long time ago). But sure it's going to be popular soon.
Chain-of-Thought Training for Open E2E Spoken Dialogue Systems
https://arxiv.org/abs/2506.00722
arXiv.org
Scaling LLM Test-Time Compute Optimally can be More Effective than...
Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In...
https://seed.bytedance.com/en/seed_liveinterpret
https://arxiv.org/abs/2507.17527
Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice
Shanbo Cheng at all
Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry, with product-level automatic systems long plagued by intractable challenges: subpar transcription and translation quality, lack of real-time speech generation, multi-speaker confusion, and translated speech inflation, especially in long-form discourses. In this study, we introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities. As a fully operational product-level solution, Seed-LiveInterpret 2.0 tackles these challenges head-on through our novel duplex speech-to-speech understanding-generating framework...
https://arxiv.org/abs/2507.17527
Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice
Shanbo Cheng at all
Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry, with product-level automatic systems long plagued by intractable challenges: subpar transcription and translation quality, lack of real-time speech generation, multi-speaker confusion, and translated speech inflation, especially in long-form discourses. In this study, we introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities. As a fully operational product-level solution, Seed-LiveInterpret 2.0 tackles these challenges head-on through our novel duplex speech-to-speech understanding-generating framework...
Interspeech 2025 starts tomorrow, yet to read the papers.
Interesting that some guys leave speech, mesolitica developer for example said he released the last model
https://x.com/huseinzol05/status/1956638778367578265
Just learned Alan Black retired to Alaska some time ago:
https://www.cs.cmu.edu/~awb/
Not many familiar names in IS papers too, so many people gone.
Interesting that some guys leave speech, mesolitica developer for example said he released the last model
https://x.com/huseinzol05/status/1956638778367578265
Just learned Alan Black retired to Alaska some time ago:
https://www.cs.cmu.edu/~awb/
Not many familiar names in IS papers too, so many people gone.
X (formerly Twitter)
husein (@huseinzol05) on X
Thank u everyone, bbye!
Comprehensive google survey on lightweight keyword spotting
https://github.com/google-research/google-research/tree/master/kws_streaming#streamable-and-non-streamable-models
This model is recommended on our Reddit. Just 10k params:
https://github.com/Qualcomm-AI-research/bcresnet
From our reddit:
https://www.reddit.com/r/speechtech/comments/1mmrc3b/comment/n93hm1h/
https://github.com/google-research/google-research/tree/master/kws_streaming#streamable-and-non-streamable-models
This model is recommended on our Reddit. Just 10k params:
https://github.com/Qualcomm-AI-research/bcresnet
From our reddit:
https://www.reddit.com/r/speechtech/comments/1mmrc3b/comment/n93hm1h/
GitHub
google-research/kws_streaming at master · google-research/google-research
Google Research. Contribute to google-research/google-research development by creating an account on GitHub.
Diffusion in ASR too. No code yet, hopefully will be there soon. Nice benchmarks, Gemini tops on speech (confirmed by our tests too).
https://arxiv.org/abs/2507.18452
DIFFA: Large Language Diffusion Models Can Listen and Understand
Jiaming Zhou, Hongjie Chen, Shiwan Zhao, Jian Kang, Jie Li, Enzhi Wang, Yujie Guo, Haoqin Sun, Hui Wang, Aobo Kong, Yong Qin, Xuelong Li
Recent advances in large language models (LLMs) have shown remarkable capabilities across textual and multimodal domains. In parallel, diffusion-based language models have emerged as a promising alternative to the autoregressive paradigm, offering improved controllability, bidirectional context modeling, and robust generation. However, their application to the audio modality remains underexplored. In this work, we introduce \textbf{DIFFA}, the first diffusion-based large audio-language model designed to perform spoken language understanding. DIFFA integrates a frozen diffusion language model with a lightweight dual-adapter architecture that bridges speech understanding and natural language reasoning. We employ a two-stage training pipeline: first, aligning semantic representations via an ASR objective; then, learning instruction-following abilities through synthetic audio-caption pairs automatically generated by prompting LLMs. Despite being trained on only 960 hours of ASR and 127 hours of synthetic instruction data, DIFFA demonstrates competitive performance on major benchmarks, including MMSU, MMAU, and VoiceBench, outperforming several autoregressive open-source baselines. Our results reveal the potential of diffusion-based language models for efficient and scalable audio understanding, opening a new direction for speech-driven AI. Our code will be available at this https URL.
https://arxiv.org/abs/2507.18452
DIFFA: Large Language Diffusion Models Can Listen and Understand
Jiaming Zhou, Hongjie Chen, Shiwan Zhao, Jian Kang, Jie Li, Enzhi Wang, Yujie Guo, Haoqin Sun, Hui Wang, Aobo Kong, Yong Qin, Xuelong Li
Recent advances in large language models (LLMs) have shown remarkable capabilities across textual and multimodal domains. In parallel, diffusion-based language models have emerged as a promising alternative to the autoregressive paradigm, offering improved controllability, bidirectional context modeling, and robust generation. However, their application to the audio modality remains underexplored. In this work, we introduce \textbf{DIFFA}, the first diffusion-based large audio-language model designed to perform spoken language understanding. DIFFA integrates a frozen diffusion language model with a lightweight dual-adapter architecture that bridges speech understanding and natural language reasoning. We employ a two-stage training pipeline: first, aligning semantic representations via an ASR objective; then, learning instruction-following abilities through synthetic audio-caption pairs automatically generated by prompting LLMs. Despite being trained on only 960 hours of ASR and 127 hours of synthetic instruction data, DIFFA demonstrates competitive performance on major benchmarks, including MMSU, MMAU, and VoiceBench, outperforming several autoregressive open-source baselines. Our results reveal the potential of diffusion-based language models for efficient and scalable audio understanding, opening a new direction for speech-driven AI. Our code will be available at this https URL.
arXiv.org
DIFFA: Large Language Diffusion Models Can Listen and Understand
Recent advances in large language models (LLMs) have shown remarkable capabilities across textual and multimodal domains. In parallel, diffusion-based language models have emerged as a promising...
While some things are questionable, return back to phonemes is nice
https://github.com/tabahi/contexless-phonemes-CUPE
https://github.com/tabahi/bournemouth-forced-aligner
https://github.com/tabahi/contexless-phonemes-CUPE
https://github.com/tabahi/bournemouth-forced-aligner
GitHub
GitHub - tabahi/contexless-phonemes-CUPE: pytorch model for contexless-phoneme prediction from speech audio
pytorch model for contexless-phoneme prediction from speech audio - tabahi/contexless-phonemes-CUPE