Interesting repo of the day, whisper adaptation on texts
https://github.com/hon9kon9ize/whistle
https://arxiv.org/abs/2509.10452
WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers
Akshat Pandey, Karun Kumar, Raphael Tang
Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four out-of-domain datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by 12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines in 27 of 32 scenarios.
https://github.com/hon9kon9ize/whistle
https://arxiv.org/abs/2509.10452
WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers
Akshat Pandey, Karun Kumar, Raphael Tang
Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four out-of-domain datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by 12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines in 27 of 32 scenarios.
GitHub
GitHub - hon9kon9ize/whistle: Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers
Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers - hon9kon9ize/whistle
https://www.linkedin.com/posts/jlqueguiner_life-update-ive-officially-moved-to-new-activity-7386778452181405696-BlEw
We recently learned that Gladia's CEO @JiliJeanlouis moves to NYC. Congratulations!
I think its kinda important move and tells more about Europe.
We recently learned that Gladia's CEO @JiliJeanlouis moves to NYC. Congratulations!
I think its kinda important move and tells more about Europe.
LinkedIn
*Life update* I’ve officially moved to New York City to lead Gladia Inc. — and I can’t wait to celebrate this new chapter with…
*Life update* I’ve officially moved to New York City to lead Gladia Inc. — and I can’t wait to celebrate this new chapter with you all next week! 🇺🇸🎉
What a ride it’s been. My wife and I packed up our three kids (losing their favorite toys in the process…
What a ride it’s been. My wife and I packed up our three kids (losing their favorite toys in the process…
As technology advances proper evaluation becomes more and more complex. This is a great example
https://arxiv.org/abs/2510.16567
Hallucination Benchmark for Speech Foundation Models
Alkis Koudounas, Moreno La Quatra, Manuel Giollo, Sabato Marco Siniscalchi, Elena Baralis
Hallucinations in automatic speech recognition (ASR) systems refer to fluent and coherent transcriptions produced by neural ASR models that are completely unrelated to the underlying acoustic input (i.e., the speech signal). While similar to conventional decoding errors in potentially compromising the usability of transcriptions for downstream applications, hallucinations can be more detrimental due to their preservation of syntactically and semantically plausible structure. This apparent coherence can mislead subsequent processing stages and introduce serious risks, particularly in critical domains such as healthcare and law. Conventional evaluation metrics are primarily centered on error-based metrics and fail to distinguish between phonetic inaccuracies and hallucinations. Consequently, there is a critical need for new evaluation frameworks that can effectively identify and assess models with a heightened propensity for generating hallucinated content. To this end, we introduce SHALLOW, the first benchmark framework that systematically categorizes and quantifies hallucination phenomena in ASR along four complementary axes: lexical, phonetic, morphological, and semantic. We define targeted metrics within each category to produce interpretable profiles of model behavior. Through evaluation across various architectures and speech domains, we have found that SHALLOW metrics correlate strongly with word error rate (WER) when recognition quality is high (i.e., low WER). Still, this correlation weakens substantially as WER increases. SHALLOW, therefore, captures fine-grained error patterns that WER fails to distinguish under degraded and challenging conditions. Our framework supports specific diagnosis of model weaknesses and provides feedback for model improvement beyond what aggregate error rates can offer.
https://arxiv.org/abs/2510.16567
Hallucination Benchmark for Speech Foundation Models
Alkis Koudounas, Moreno La Quatra, Manuel Giollo, Sabato Marco Siniscalchi, Elena Baralis
Hallucinations in automatic speech recognition (ASR) systems refer to fluent and coherent transcriptions produced by neural ASR models that are completely unrelated to the underlying acoustic input (i.e., the speech signal). While similar to conventional decoding errors in potentially compromising the usability of transcriptions for downstream applications, hallucinations can be more detrimental due to their preservation of syntactically and semantically plausible structure. This apparent coherence can mislead subsequent processing stages and introduce serious risks, particularly in critical domains such as healthcare and law. Conventional evaluation metrics are primarily centered on error-based metrics and fail to distinguish between phonetic inaccuracies and hallucinations. Consequently, there is a critical need for new evaluation frameworks that can effectively identify and assess models with a heightened propensity for generating hallucinated content. To this end, we introduce SHALLOW, the first benchmark framework that systematically categorizes and quantifies hallucination phenomena in ASR along four complementary axes: lexical, phonetic, morphological, and semantic. We define targeted metrics within each category to produce interpretable profiles of model behavior. Through evaluation across various architectures and speech domains, we have found that SHALLOW metrics correlate strongly with word error rate (WER) when recognition quality is high (i.e., low WER). Still, this correlation weakens substantially as WER increases. SHALLOW, therefore, captures fine-grained error patterns that WER fails to distinguish under degraded and challenging conditions. Our framework supports specific diagnosis of model weaknesses and provides feedback for model improvement beyond what aggregate error rates can offer.
arXiv.org
Hallucination Benchmark for Speech Foundation Models
Hallucinations in automatic speech recognition (ASR) systems refer to fluent and coherent transcriptions produced by neural ASR models that are completely unrelated to the underlying acoustic...
Played with Qwen3-Omni a bit. Full version requires 90Gb of RAM, 4-bit quantization fits 24. 4-bit version only runs with VLLM and doesn't support audio output yet.
Speech recognition accuracy in HF space is OK but intelligence is below expectation. Video understanding is not really required for us.
My impression that video part makes this model too big for practical speech cases as it requires huge compute. A pure audio model might be more light and accurate.
Speech recognition accuracy in HF space is OK but intelligence is below expectation. Video understanding is not really required for us.
My impression that video part makes this model too big for practical speech cases as it requires huge compute. A pure audio model might be more light and accurate.
People still use whisperx for speaker separation and recognition, pyannote4 patch is pending
https://github.com/m-bain/whisperX/pull/1243
https://github.com/m-bain/whisperX/pull/1243
GitHub
Upgrade to pyannote-audio 4 by borgoat · Pull Request #1243 · m-bain/whisperX
There's a couple of new pyannote models:1 pyannote/speaker-diarization-community-1 (offline) and pyannote/speaker-diarization-precision-2 (hosted by pyannote)
I did a minimal upgrade to pya...
I did a minimal upgrade to pya...
This is an interesting talk, we also recommend to participate online since Google and DeepMind frequently doesn't allow recordings, there were many cases like that.
[Oct 30th, 2025]
Gemini Voice Agent: A Natively Multimodal Dialog Model with Advanced Reasoning and Tool Use
Presenter:Michael Han Google DeepMind
https://poonehmousavi.github.io/rg.html
https://concordia-ca.zoom.us/j/81004805542
[Oct 30th, 2025]
Gemini Voice Agent: A Natively Multimodal Dialog Model with Advanced Reasoning and Tool Use
Presenter:Michael Han Google DeepMind
https://poonehmousavi.github.io/rg.html
https://concordia-ca.zoom.us/j/81004805542
poonehmousavi.github.io
Pooneh Mousavi
Homepage of Pooneh Mousavi
Some emotion work from LAION, Emolia dataset with finegrained emotion annotation for Emlia data
https://huggingface.co/datasets/laion/Emolia
EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection
https://arxiv.org/abs/2506.09827
EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection
Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby, Felix Friedrich, Maurice Kraus, Kourosh Nadi, Huu Nguyen, Kristian Kersting, Sören Auer
The advancement of text-to-speech and audio generation models necessitates robust benchmarks for evaluating the emotional understanding capabilities of AI systems. Current speech emotion recognition (SER) datasets often exhibit limitations in emotional granularity, privacy concerns, or reliance on acted portrayals. This paper introduces EmoNet-Voice, a new resource for speech emotion detection, which includes EmoNet-Voice Big, a large-scale pre-training dataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions, and 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with human expert annotations. EmoNet-Voice is designed to evaluate SER models on a fine-grained spectrum of 40 emotion categories with different levels of intensities. Leveraging state-of-the-art voice generation, we curated synthetic audio snippets simulating actors portraying scenes designed to evoke specific emotions. Crucially, we conducted rigorous validation by psychology experts who assigned perceived intensity labels. This synthetic, privacy-preserving approach allows for the inclusion of sensitive emotional states often absent in existing datasets. Lastly, we introduce Empathic Insight Voice models that set a new standard in speech emotion recognition with high agreement with human experts. Our evaluations across the current model landscape exhibit valuable findings, such as high-arousal emotions like anger being much easier to detect than low-arousal states like concentration.
https://huggingface.co/datasets/laion/Emolia
EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection
https://arxiv.org/abs/2506.09827
EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection
Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby, Felix Friedrich, Maurice Kraus, Kourosh Nadi, Huu Nguyen, Kristian Kersting, Sören Auer
The advancement of text-to-speech and audio generation models necessitates robust benchmarks for evaluating the emotional understanding capabilities of AI systems. Current speech emotion recognition (SER) datasets often exhibit limitations in emotional granularity, privacy concerns, or reliance on acted portrayals. This paper introduces EmoNet-Voice, a new resource for speech emotion detection, which includes EmoNet-Voice Big, a large-scale pre-training dataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions, and 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with human expert annotations. EmoNet-Voice is designed to evaluate SER models on a fine-grained spectrum of 40 emotion categories with different levels of intensities. Leveraging state-of-the-art voice generation, we curated synthetic audio snippets simulating actors portraying scenes designed to evoke specific emotions. Crucially, we conducted rigorous validation by psychology experts who assigned perceived intensity labels. This synthetic, privacy-preserving approach allows for the inclusion of sensitive emotional states often absent in existing datasets. Lastly, we introduce Empathic Insight Voice models that set a new standard in speech emotion recognition with high agreement with human experts. Our evaluations across the current model landscape exhibit valuable findings, such as high-arousal emotions like anger being much easier to detect than low-arousal states like concentration.
arXiv.org
EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech...
Speech emotion recognition (SER) systems are constrained by existing datasets that typically cover only 6-10 basic emotions, lack scale and diversity, and face ethical challenges when collecting...
From comments to KanyTTS release
https://www.reddit.com/r/LocalLLaMA/comments/1oitanf/just_dropped_kani_tts_english_a_400m_tts_model/
Nice quick evaluation of TTS engines. Kokoro leads due to stability, many other systems expose issues
https://paper2audio.com/posts/review-of-text-to-speech-models-for-reading-research-papers
https://www.reddit.com/r/LocalLLaMA/comments/1oitanf/just_dropped_kani_tts_english_a_400m_tts_model/
Nice quick evaluation of TTS engines. Kokoro leads due to stability, many other systems expose issues
https://paper2audio.com/posts/review-of-text-to-speech-models-for-reading-research-papers
https://github.com/pykeio/earshot
Very fast voice activity detection in Rust, 10 times faster than TEN VAD
Very fast voice activity detection in Rust, 10 times faster than TEN VAD
GitHub
GitHub - pykeio/earshot: Ridiculously fast & accurate voice activity detection in pure Rust
Ridiculously fast & accurate voice activity detection in pure Rust - pykeio/earshot
The attention patterns in speech definitely have potential
https://github.com/smulelabs/windowed-roformer
Efficient Vocal Source Separation Through Windowed Sink Attention
State-of-the-art vocal separation models like Mel-Band-Roformer rely on full temporal self-attention mechanisms, where each temporal frame interacts with every other frames. This incurs heavy computational costs that scales quadratically with input audio length, motivating chunking and windowing approaches. Through analysis of a pre-trained vocal separation model, we discovered that temporal attention patterns are highly localized. Building on this insight, we replaced full attention with windowed sink attention (WSA) with small temporal attention window and attention sinks. We show empirically that fine-tuning from the original checkpoint recovers 92% of the original SDR performance while reducing FLOPs by 44.5x.
Related is
https://github.com/SamsungLabs/SummaryMixing
SummaryMixing is a linear-time alternative to self-attention (SA) for speech processing models such as Transformers, Conformers or Branchformers. Instead of computing pair-wise scores between tokens (leading to quadratic-time complexity for SA), it summarises a whole utterance with mean over vectors for all time steps. SummaryMixing is based on the recent findings demonstrating that self-attention could be useless for speech recognition as the attention weights of trained ASR systems are almost uniformly distributed accross the tokens composing a sequence. SummaryMixing also is a generalisation of the recent HyperMixer and HyperConformer to better and simpler mixing functions. In a SummaryMixing cell, that takes the same inputs and produces the same outputs than self-attention, contributions from each time step are first transformed and then averaged globally before being fed back to each time step. This is visible in Figure 1 in the article. Therefore, the time-complexity is reduced to linear.
https://github.com/smulelabs/windowed-roformer
Efficient Vocal Source Separation Through Windowed Sink Attention
State-of-the-art vocal separation models like Mel-Band-Roformer rely on full temporal self-attention mechanisms, where each temporal frame interacts with every other frames. This incurs heavy computational costs that scales quadratically with input audio length, motivating chunking and windowing approaches. Through analysis of a pre-trained vocal separation model, we discovered that temporal attention patterns are highly localized. Building on this insight, we replaced full attention with windowed sink attention (WSA) with small temporal attention window and attention sinks. We show empirically that fine-tuning from the original checkpoint recovers 92% of the original SDR performance while reducing FLOPs by 44.5x.
Related is
https://github.com/SamsungLabs/SummaryMixing
SummaryMixing is a linear-time alternative to self-attention (SA) for speech processing models such as Transformers, Conformers or Branchformers. Instead of computing pair-wise scores between tokens (leading to quadratic-time complexity for SA), it summarises a whole utterance with mean over vectors for all time steps. SummaryMixing is based on the recent findings demonstrating that self-attention could be useless for speech recognition as the attention weights of trained ASR systems are almost uniformly distributed accross the tokens composing a sequence. SummaryMixing also is a generalisation of the recent HyperMixer and HyperConformer to better and simpler mixing functions. In a SummaryMixing cell, that takes the same inputs and produces the same outputs than self-attention, contributions from each time step are first transformed and then averaged globally before being fed back to each time step. This is visible in Figure 1 in the article. Therefore, the time-complexity is reduced to linear.
GitHub
GitHub - smulelabs/windowed-roformer: Official Repository for "Efficient Vocal Source Separation Through Windowed RoFormer"
Official Repository for "Efficient Vocal Source Separation Through Windowed RoFormer" - smulelabs/windowed-roformer
News from other universe
LongCat-Flash-Omni is open sourced: Multimodal + Low-Latency
* ScMoE architecture on LongCat-Flash: 560B Parameters, 27B Active
* Leading Performance among Open-Source Omni-modal models
* Training: Novel Early-Fusion Omni-modal training paradigm -> No Single Modality Left Behind
* Real-time Spoken Interaction: Millisecond-level E2E latency
* 128K context + Supports > 8min real-time AV interaction
* Multimodal I/O: Arbitrary Combination of Text/Image/Audio/Video Input → Text/Speech Output (w/ LongCat-Audio-Codec)
* Efficient Infrastructure: With optimized modality-decoupled parallel training, Omni sustains >90% throughput of pure-text training efficiency.
https://github.com/meituan-longcat/LongCat-Flash-Omni
LongCat-Flash-Omni is open sourced: Multimodal + Low-Latency
* ScMoE architecture on LongCat-Flash: 560B Parameters, 27B Active
* Leading Performance among Open-Source Omni-modal models
* Training: Novel Early-Fusion Omni-modal training paradigm -> No Single Modality Left Behind
* Real-time Spoken Interaction: Millisecond-level E2E latency
* 128K context + Supports > 8min real-time AV interaction
* Multimodal I/O: Arbitrary Combination of Text/Image/Audio/Video Input → Text/Speech Output (w/ LongCat-Audio-Codec)
* Efficient Infrastructure: With optimized modality-decoupled parallel training, Omni sustains >90% throughput of pure-text training efficiency.
https://github.com/meituan-longcat/LongCat-Flash-Omni
GitHub
GitHub - meituan-longcat/LongCat-Flash-Omni: This is the official repo for the paper "LongCat-Flash-Omni Technical Report"
This is the official repo for the paper "LongCat-Flash-Omni Technical Report" - meituan-longcat/LongCat-Flash-Omni
We like reviews. People still use ngram rescoring + LSTM for best accuracy. Most effective system just ensemble everything, kaggle-style.
https://arxiv.org/abs/2507.18161
Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges
Samuele Cornell, Christoph Boeddeker, Taejin Park, He Huang, Desh Raj, Matthew Wiesner, Yoshiki Masuyama, Xuankai Chang, Zhong-Qiu Wang, Stefano Squartini, Paola Garcia, Shinji Watanabe
The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech. With participation from 9 teams submitting 32 diverse systems, these challenges have contributed to state-of-the-art research in the field. This paper outlines the challenges' design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions. From this analysis it emerges that: 1) Most participants use end-to-end (e2e) ASR systems, whereas hybrid systems were prevalent in previous CHiME challenges. This transition is mainly due to the availability of robust large-scale pre-trained models, which lowers the data burden for e2e-ASR. 2) Despite recent advances in neural speech separation and enhancement (SSE), all teams still heavily rely on guided source separation, suggesting that current neural SSE techniques are still unable to reliably deal with complex scenarios and different recording setups. 3) All best systems employ diarization refinement via target-speaker diarization techniques. Accurate speaker counting in the first diarization pass is thus crucial to avoid compounding errors and CHiME-8 DASR participants especially focused on this part. 4) Downstream evaluation via meeting summarization can correlate weakly with transcription quality due to the remarkable effectiveness of large-language models in handling errors. On the NOTSOFAR-1 scenario, even systems with over 50% time-constrained minimum permutation WER can perform roughly on par with the most effective ones (around 11%). 5) Despite recent progress, accurately transcribing spontaneous speech in challenging acoustic environments remains difficult, even when using computationally intensive system ensembles.
https://arxiv.org/abs/2507.18161
Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges
Samuele Cornell, Christoph Boeddeker, Taejin Park, He Huang, Desh Raj, Matthew Wiesner, Yoshiki Masuyama, Xuankai Chang, Zhong-Qiu Wang, Stefano Squartini, Paola Garcia, Shinji Watanabe
The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech. With participation from 9 teams submitting 32 diverse systems, these challenges have contributed to state-of-the-art research in the field. This paper outlines the challenges' design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions. From this analysis it emerges that: 1) Most participants use end-to-end (e2e) ASR systems, whereas hybrid systems were prevalent in previous CHiME challenges. This transition is mainly due to the availability of robust large-scale pre-trained models, which lowers the data burden for e2e-ASR. 2) Despite recent advances in neural speech separation and enhancement (SSE), all teams still heavily rely on guided source separation, suggesting that current neural SSE techniques are still unable to reliably deal with complex scenarios and different recording setups. 3) All best systems employ diarization refinement via target-speaker diarization techniques. Accurate speaker counting in the first diarization pass is thus crucial to avoid compounding errors and CHiME-8 DASR participants especially focused on this part. 4) Downstream evaluation via meeting summarization can correlate weakly with transcription quality due to the remarkable effectiveness of large-language models in handling errors. On the NOTSOFAR-1 scenario, even systems with over 50% time-constrained minimum permutation WER can perform roughly on par with the most effective ones (around 11%). 5) Despite recent progress, accurately transcribing spontaneous speech in challenging acoustic environments remains difficult, even when using computationally intensive system ensembles.
arXiv.org
Recent Trends in Distant Conversational Speech Recognition: A...
The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech. With...
We like some in-depth evaluations in this research
https://github.com/Anuttacon/speech_drame
https://arxiv.org/abs/2511.01261
Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play
Jiatong Shi, Jionghao Han, Yichen Lu, Santiago Pascual, Pengfei Wu, Chenye Cui, Shinji Watanabe, Chao Weng, Cong Zhou
Role-play has become a key testbed for generative models, expanding from text-only dialogue to multimodal interaction. Extending role-play to speech captures prosody, emotion, and delivery, but also poses new evaluation challenges. Current pipelines often use audio large language models (ALLMs) as zero-shot judges, which miss paralinguistic cues, collapse multiple aspects into coarse scores, and rely on synthetic speech references that fail to reflect real-world roles. We present Speech-DRAME, a unified framework that contributes at three levels: (i) Speech-DRAME-EvalBench, an evaluation benchmark with bilingual human-annotated data and protocols for training and testing speech evaluation models (SEMs), (ii) DRAME-Eval, a fine-tuned evaluation model, which substantially outperforms zero-shot and few-shot ALLMs, and (iii) Speech-DRAME-RoleBench, a speech role-play benchmark that leverages DRAME-Eval as an automatic judge to compare speech foundation models (SFMs). Speech-DRAME distinguishes between two complementary evaluation strategies: Archetype Evaluation, a top-down approach measuring adherence to broad role archetypes, and Realism Evaluation, a bottom-up approach grounded in real human speech that emphasizes nuanced role quality. Compared to zero-shot ALLM judges, DRAME-Eval achieves stronger agreement with human ratings (Pearson correlation from 0.480 to 0.629 in archetypes, and 0.390 to 0.625 in realism). By integrating transparent benchmark resources, modeling approaches, and system-level evaluation, Speech-DRAME provides the first comprehensive, reproducible foundation for assessing spoken role-play.
https://github.com/Anuttacon/speech_drame
https://arxiv.org/abs/2511.01261
Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play
Jiatong Shi, Jionghao Han, Yichen Lu, Santiago Pascual, Pengfei Wu, Chenye Cui, Shinji Watanabe, Chao Weng, Cong Zhou
Role-play has become a key testbed for generative models, expanding from text-only dialogue to multimodal interaction. Extending role-play to speech captures prosody, emotion, and delivery, but also poses new evaluation challenges. Current pipelines often use audio large language models (ALLMs) as zero-shot judges, which miss paralinguistic cues, collapse multiple aspects into coarse scores, and rely on synthetic speech references that fail to reflect real-world roles. We present Speech-DRAME, a unified framework that contributes at three levels: (i) Speech-DRAME-EvalBench, an evaluation benchmark with bilingual human-annotated data and protocols for training and testing speech evaluation models (SEMs), (ii) DRAME-Eval, a fine-tuned evaluation model, which substantially outperforms zero-shot and few-shot ALLMs, and (iii) Speech-DRAME-RoleBench, a speech role-play benchmark that leverages DRAME-Eval as an automatic judge to compare speech foundation models (SFMs). Speech-DRAME distinguishes between two complementary evaluation strategies: Archetype Evaluation, a top-down approach measuring adherence to broad role archetypes, and Realism Evaluation, a bottom-up approach grounded in real human speech that emphasizes nuanced role quality. Compared to zero-shot ALLM judges, DRAME-Eval achieves stronger agreement with human ratings (Pearson correlation from 0.480 to 0.629 in archetypes, and 0.390 to 0.625 in realism). By integrating transparent benchmark resources, modeling approaches, and system-level evaluation, Speech-DRAME provides the first comprehensive, reproducible foundation for assessing spoken role-play.
GitHub
GitHub - Anuttacon/speech_drame
Contribute to Anuttacon/speech_drame development by creating an account on GitHub.
Greetings from Voice Tech For All team!
We are pleased to announce the launch of the Voice Tech for All Challenge — a Text-to-Speech (TTS) innovation challenge hosted by IISc and SPIRE Lab, powered by Bhashini, GIZ’s FAIR Forward, ARMMAN, and ARTPARK, along with Google for Developers as our Community Partner.
This challenge invites startups, developers, researchers, students and faculty members to build the next generation of multilingual, expressive Text-to-Speech (TTS) systems, making voice technology accessible to community health workers, especially for low-resource Indian languages.
Why Join?
Access high-quality open datasets in 11 Indian languages (SYSPIN + SPICOR)
Build the SOTA open source multi-speaker, multilingual TTS with accent & style transfer
Winning model to be deployed in maternal health assistant (ARMMAN)
🏆 Prizes worth ₹8.5 Lakhs await!
🔗 Registration link: https://syspin.iisc.ac.in/register
🌐Learn more: https://syspin.iisc.ac.in/voicetechforall
Warm regards,
Team Voice Tech For All
IISc (Indian Institute of Science)
We are pleased to announce the launch of the Voice Tech for All Challenge — a Text-to-Speech (TTS) innovation challenge hosted by IISc and SPIRE Lab, powered by Bhashini, GIZ’s FAIR Forward, ARMMAN, and ARTPARK, along with Google for Developers as our Community Partner.
This challenge invites startups, developers, researchers, students and faculty members to build the next generation of multilingual, expressive Text-to-Speech (TTS) systems, making voice technology accessible to community health workers, especially for low-resource Indian languages.
Why Join?
Access high-quality open datasets in 11 Indian languages (SYSPIN + SPICOR)
Build the SOTA open source multi-speaker, multilingual TTS with accent & style transfer
Winning model to be deployed in maternal health assistant (ARMMAN)
🏆 Prizes worth ₹8.5 Lakhs await!
🔗 Registration link: https://syspin.iisc.ac.in/register
🌐Learn more: https://syspin.iisc.ac.in/voicetechforall
Warm regards,
Team Voice Tech For All
IISc (Indian Institute of Science)
This should have nice properties
https://huggingface.co/aiola/drax-v1
https://github.com/aiola-lab/drax
https://arxiv.org/abs/2510.04162
Drax: Speech Recognition with Discrete Flow Matching
Aviv Navon, Aviv Shamsian, Neta Glazer, Yael Segal-Feldman, Gill Hetz, Joseph Keshet, Ethan Fetaya
Diffusion and flow-based non-autoregressive (NAR) models have shown strong promise in large language modeling, however, their potential for automatic speech recognition (ASR) remains largely unexplored. We propose Drax, a discrete flow matching framework for ASR that enables efficient parallel decoding. To better align training with inference, we construct an audio-conditioned probability path that guides the model through trajectories resembling likely intermediate inference errors, rather than direct random noise to target transitions. Our theoretical analysis links the generalization gap to divergences between training and inference occupancies, controlled by cumulative velocity errors, thereby motivating our design choice. Empirical evaluation demonstrates that our approach attains recognition accuracy on par with state-of-the-art speech models while offering improved accuracy-efficiency trade-offs, highlighting discrete flow matching as a promising direction for advancing NAR ASR.
https://huggingface.co/aiola/drax-v1
https://github.com/aiola-lab/drax
https://arxiv.org/abs/2510.04162
Drax: Speech Recognition with Discrete Flow Matching
Aviv Navon, Aviv Shamsian, Neta Glazer, Yael Segal-Feldman, Gill Hetz, Joseph Keshet, Ethan Fetaya
Diffusion and flow-based non-autoregressive (NAR) models have shown strong promise in large language modeling, however, their potential for automatic speech recognition (ASR) remains largely unexplored. We propose Drax, a discrete flow matching framework for ASR that enables efficient parallel decoding. To better align training with inference, we construct an audio-conditioned probability path that guides the model through trajectories resembling likely intermediate inference errors, rather than direct random noise to target transitions. Our theoretical analysis links the generalization gap to divergences between training and inference occupancies, controlled by cumulative velocity errors, thereby motivating our design choice. Empirical evaluation demonstrates that our approach attains recognition accuracy on par with state-of-the-art speech models while offering improved accuracy-efficiency trade-offs, highlighting discrete flow matching as a promising direction for advancing NAR ASR.
Sounds reasonable for TTS
https://github.com/auspicious3000/ProsodyLM
ProsodyLM — a speech language model
→ With novel prosody tokenization (not audio tokenization)
→ Achieves superior prosody capabilities with pre-training only (no alignment)
https://arxiv.org/abs/2507.20091
ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models
Kaizhi Qian, Xulin Fan, Junrui Ni, Slava Shechtman, Mark Hasegawa-Johnson, Chuang Gan, Yang Zhang
Speech language models refer to language models with speech processing and understanding capabilities. One key desirable capability for speech language models is the ability to capture the intricate interdependency between content and prosody. The existing mainstream paradigm of training speech language models, which converts speech into discrete tokens before feeding them into LLMs, is sub-optimal in learning prosody information -- we find that the resulting LLMs do not exhibit obvious emerging prosody processing capabilities via pre-training alone. To overcome this, we propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody. Each speech utterance is first transcribed into text, followed by a sequence of word-level prosody tokens. Compared with conventional speech tokenization schemes, the proposed tokenization scheme retains more complete prosody information, and is more understandable to text-based LLMs. We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone, ranging from harnessing the prosody nuances in generated speech, such as contrastive focus, understanding emotion and stress in an utterance, to maintaining prosody consistency in long contexts.
https://github.com/auspicious3000/ProsodyLM
ProsodyLM — a speech language model
→ With novel prosody tokenization (not audio tokenization)
→ Achieves superior prosody capabilities with pre-training only (no alignment)
https://arxiv.org/abs/2507.20091
ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models
Kaizhi Qian, Xulin Fan, Junrui Ni, Slava Shechtman, Mark Hasegawa-Johnson, Chuang Gan, Yang Zhang
Speech language models refer to language models with speech processing and understanding capabilities. One key desirable capability for speech language models is the ability to capture the intricate interdependency between content and prosody. The existing mainstream paradigm of training speech language models, which converts speech into discrete tokens before feeding them into LLMs, is sub-optimal in learning prosody information -- we find that the resulting LLMs do not exhibit obvious emerging prosody processing capabilities via pre-training alone. To overcome this, we propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody. Each speech utterance is first transcribed into text, followed by a sequence of word-level prosody tokens. Compared with conventional speech tokenization schemes, the proposed tokenization scheme retains more complete prosody information, and is more understandable to text-based LLMs. We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone, ranging from harnessing the prosody nuances in generated speech, such as contrastive focus, understanding emotion and stress in an utterance, to maintaining prosody consistency in long contexts.
GitHub
GitHub - auspicious3000/ProsodyLM: ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models
ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models - auspicious3000/ProsodyLM
It's important to have the means to adjust network behaviour, so methods like below are very interesting
https://arxiv.org/abs/2505.12973
https://arxiv.org/abs/2505.12973
arXiv.org
Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models
Homograph disambiguation remains a significant challenge in grapheme-to-phoneme (G2P) conversion, especially for low-resource languages. This challenge is twofold: (1) creating balanced and...
Also
Combining Autoregressive Models and Phonological Knowledge Bases for Improved Accuracy in Korean Grapheme-to-Phoneme Conversion
https://ieeexplore.ieee.org/document/11045935
Combining Autoregressive Models and Phonological Knowledge Bases for Improved Accuracy in Korean Grapheme-to-Phoneme Conversion
https://ieeexplore.ieee.org/document/11045935