BUT continues work on Whisper instead of OpenAI
https://github.com/BUTSpeechFIT/SOT-DiCoW
https://arxiv.org/abs/2510.03723
Adapting Diarization-Conditioned Whisper for End-to-End Multi-Talker Speech Recognition
Martin Kocour, Martin Karafiat, Alexander Polok, Dominik Klement, Lukáš Burget, Jan Černocký
We propose a speaker-attributed (SA) Whisper-based model for multi-talker speech recognition that combines target-speaker modeling with serialized output training (SOT). Our approach leverages a Diarization-Conditioned Whisper (DiCoW) encoder to extract target-speaker embeddings, which are concatenated into a single representation and passed to a shared decoder. This enables the model to transcribe overlapping speech as a serialized output stream with speaker tags and timestamps. In contrast to target-speaker ASR systems such as DiCoW, which decode each speaker separately, our approach performs joint decoding, allowing the decoder to condition on the context of all speakers simultaneously. Experiments show that the model outperforms existing SOT-based approaches and surpasses DiCoW on multi-talker mixtures (e.g., LibriMix).
https://github.com/BUTSpeechFIT/SOT-DiCoW
https://arxiv.org/abs/2510.03723
Adapting Diarization-Conditioned Whisper for End-to-End Multi-Talker Speech Recognition
Martin Kocour, Martin Karafiat, Alexander Polok, Dominik Klement, Lukáš Burget, Jan Černocký
We propose a speaker-attributed (SA) Whisper-based model for multi-talker speech recognition that combines target-speaker modeling with serialized output training (SOT). Our approach leverages a Diarization-Conditioned Whisper (DiCoW) encoder to extract target-speaker embeddings, which are concatenated into a single representation and passed to a shared decoder. This enables the model to transcribe overlapping speech as a serialized output stream with speaker tags and timestamps. In contrast to target-speaker ASR systems such as DiCoW, which decode each speaker separately, our approach performs joint decoding, allowing the decoder to condition on the context of all speakers simultaneously. Experiments show that the model outperforms existing SOT-based approaches and surpasses DiCoW on multi-talker mixtures (e.g., LibriMix).
GitHub
GitHub - BUTSpeechFIT/SOT-DiCoW: Multi-talker ASR based on DiCoW with Serialized Output Training
Multi-talker ASR based on DiCoW with Serialized Output Training - BUTSpeechFIT/SOT-DiCoW
Very true. Good read for those who chase 100ms response time
https://www.speechmatics.com/company/articles-and-news/why-fastest-voice-tech-is-a-trap
https://www.speechmatics.com/company/articles-and-news/why-fastest-voice-tech-is-a-trap
Speechmatics
Why “fastest” voice tech is a trap
Why chasing the 'fastest' speech-to-text breaks voice agents. Discover how Speechmatics balances speed and accuracy for real-world conversations.
Somewhat advanced recent TTS, basically modern F5 with VibeVoice latents at 7.5Hz and distillation from diffusion model
https://github.com/smallbraineng/smalltts
https://github.com/smallbraineng/smalltts
GitHub
GitHub - smallbraineng/smalltts: superfast text to speech in any voice
superfast text to speech in any voice. Contribute to smallbraineng/smalltts development by creating an account on GitHub.
Everyone talks about https://github.com/neuphonic/neucodec
Neuphonhic itself celebrates funding round as a promising British startup.
Not sure why, codec is really huge 800M parameters, must be very context-dependent.
Neuphonhic itself celebrates funding round as a promising British startup.
Not sure why, codec is really huge 800M parameters, must be very context-dependent.
GitHub
GitHub - neuphonic/neucodec: A package for NeuCodec: a 50hz, 0.8kbps, 24kHz audio codec.
A package for NeuCodec: a 50hz, 0.8kbps, 24kHz audio codec. - neuphonic/neucodec
Meta's take on audio LLMs
https://arxiv.org/abs/2510.06195
Latent Speech-Text Transformer
Yen-Ju Lu, Yashesh Gaur, Wei Zhou, Benjamin Muller, Jesus Villalba, Najim Dehak, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Srinivasan Iyer, Duc Le
Auto-regressive speech-text models are typically pre-trained on a large number of interleaved sequences of text tokens and raw speech encoded as speech tokens using vector quantization. These models have demonstrated state-of-the-art performance in speech-to-speech understanding and generation benchmarks, together with promising scaling laws, primarily enabled by the representational alignment between text and speech. Nevertheless, they suffer from shortcomings, partly owing to the disproportionately longer sequences of speech tokens in contrast to textual tokens. This results in a large compute imbalance between modalities during pre-training as well as during inference, and a potential hindrance to effectively aligning speech and text, ultimately translating to several orders of magnitude slower scaling laws. We introduce the Latent Speech-Text Transformer (LST), which makes pre-training speech-text models more data-efficient by dynamically and inexpensively aggregating speech tokens into latent speech patches. These patches serve as higher-level units that can either align with corresponding textual units to aid capability transfer or even encapsulate common speech sequences like silences to be more compute-efficient. We show that LST outperforms vanilla approaches on speech-to-speech as well as text-to-text benchmarks in both data- and compute-controlled settings, the former indicating more effective representational alignment and the latter indicating steeper scaling laws for speech-text models. On HellaSwag story completion, LST achieves 6.5% absolute gain in speech accuracy under compute-controlled training and 5.3% under data-controlled training, while also improving text performance. We will release our models, code, and the evaluation data to facilitate further research.
https://arxiv.org/abs/2510.06195
Latent Speech-Text Transformer
Yen-Ju Lu, Yashesh Gaur, Wei Zhou, Benjamin Muller, Jesus Villalba, Najim Dehak, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Srinivasan Iyer, Duc Le
Auto-regressive speech-text models are typically pre-trained on a large number of interleaved sequences of text tokens and raw speech encoded as speech tokens using vector quantization. These models have demonstrated state-of-the-art performance in speech-to-speech understanding and generation benchmarks, together with promising scaling laws, primarily enabled by the representational alignment between text and speech. Nevertheless, they suffer from shortcomings, partly owing to the disproportionately longer sequences of speech tokens in contrast to textual tokens. This results in a large compute imbalance between modalities during pre-training as well as during inference, and a potential hindrance to effectively aligning speech and text, ultimately translating to several orders of magnitude slower scaling laws. We introduce the Latent Speech-Text Transformer (LST), which makes pre-training speech-text models more data-efficient by dynamically and inexpensively aggregating speech tokens into latent speech patches. These patches serve as higher-level units that can either align with corresponding textual units to aid capability transfer or even encapsulate common speech sequences like silences to be more compute-efficient. We show that LST outperforms vanilla approaches on speech-to-speech as well as text-to-text benchmarks in both data- and compute-controlled settings, the former indicating more effective representational alignment and the latter indicating steeper scaling laws for speech-text models. On HellaSwag story completion, LST achieves 6.5% absolute gain in speech accuracy under compute-controlled training and 5.3% under data-controlled training, while also improving text performance. We will release our models, code, and the evaluation data to facilitate further research.
arXiv.org
Latent Speech-Text Transformer
Auto-regressive speech-text models pre-trained on interleaved text tokens and discretized speech tokens demonstrate strong speech understanding and generation, yet remain substantially less...
CoT has to come
https://arxiv.org/abs/2510.07497
Can Speech LLMs Think while Listening?
Yi-Jen Shih, Desh Raj, Chunyang Wu, Wei Zhou, SK Bong, Yashesh Gaur, Jay Mahadeokar, Ozlem Kalinli, Mike Seltzer
Recent advances in speech large language models (speech LLMs) have enabled seamless spoken interactions, but these systems still struggle with complex reasoning tasks. Previously, chain-of-thought (CoT) prompting or fine-tuning has been to shown to significantly improve the reasoning abilities of text-based LLMs. In this work, we investigate the effect of CoT fine-tuning for multi-stream speech LLMs, demonstrating that reasoning in text space improves the accuracy of speech LLMs by 2.4x, on average, over a suite of spoken reasoning tasks. Beyond accuracy, the latency of the spoken response is a crucial factor for interacting with voice-based agents. Inspired by the human behavior of "thinking while listening," we propose methods to reduce the additional latency from reasoning by allowing the model to start reasoning before the user query has ended. To achieve this, we introduce an entropy-based metric, "question completeness," which acts as an indicator to guide the model on the optimal time to start reasoning. This method provides greater control over the accuracy-latency trade-off compared with heuristic-based approaches and, under equivalent latency conditions, yields a 4% accuracy gain on ARC-Easy. Finally, we use Direct Preference Optimization (DPO) on preference data created using rejection sampling to push the accuracy-latency pareto frontier further, resulting in a 70% reduction in latency without loss in accuracy.
https://arxiv.org/abs/2510.07497
Can Speech LLMs Think while Listening?
Yi-Jen Shih, Desh Raj, Chunyang Wu, Wei Zhou, SK Bong, Yashesh Gaur, Jay Mahadeokar, Ozlem Kalinli, Mike Seltzer
Recent advances in speech large language models (speech LLMs) have enabled seamless spoken interactions, but these systems still struggle with complex reasoning tasks. Previously, chain-of-thought (CoT) prompting or fine-tuning has been to shown to significantly improve the reasoning abilities of text-based LLMs. In this work, we investigate the effect of CoT fine-tuning for multi-stream speech LLMs, demonstrating that reasoning in text space improves the accuracy of speech LLMs by 2.4x, on average, over a suite of spoken reasoning tasks. Beyond accuracy, the latency of the spoken response is a crucial factor for interacting with voice-based agents. Inspired by the human behavior of "thinking while listening," we propose methods to reduce the additional latency from reasoning by allowing the model to start reasoning before the user query has ended. To achieve this, we introduce an entropy-based metric, "question completeness," which acts as an indicator to guide the model on the optimal time to start reasoning. This method provides greater control over the accuracy-latency trade-off compared with heuristic-based approaches and, under equivalent latency conditions, yields a 4% accuracy gain on ARC-Easy. Finally, we use Direct Preference Optimization (DPO) on preference data created using rejection sampling to push the accuracy-latency pareto frontier further, resulting in a 70% reduction in latency without loss in accuracy.
arXiv.org
Can Speech LLMs Think while Listening?
Recent advances in speech large language models (speech LLMs) have enabled seamless spoken interactions, but these systems still struggle with complex reasoning tasks. Previously, chain-of-thought...
Next talk is on 17 Oct at 1pm (UTC+0). Alexander Polok from Brno University of Technology is going to talk about multi-talker ASR!
Please also note that the talk is slightly earlier than usual.
Below is the link to the talk.
https://ed-ac-uk.zoom.us/j/88650204315
Meeting ID: 886 5020 4315
Passcode: sigml2011
✉️ Don't forget to subscribe to our mailing list https://groups.google.com/g/isca-sigml
🎦 Previous talks and recordings can be found at https://homepages.inf.ed.ac.uk/htang2/sigml/seminar/
Adapting Single-Speaker ASR to Handle Conversations
State-of-the-art ASR systems perform exceptionally well in single-speaker scenarios, but they often struggle with conversations that feature significant speech overlap. Traditional target-speaker ASR methods, which rely on speaker embeddings or enrollment, face challenges in generalization and typically require prior knowledge of the speakers. To overcome these limitations, this talk introduces DiCoW (Diarization-Conditioned Whisper), which conditions ASR on diarization outputs to achieve robust multi-talker transcription with minimal training data. DiCoW has already powered the award-winning CHiME-8 and MLC-SLM systems.
Building on this success, I will present SE-DiCoW (Self-Enrolled DiCoW), an improved version that automatically resolves speaker ambiguities by selecting enrollments from long-form recordings. The talk will also highlight EMMA MT-ASR, the first unified benchmark for multi-talker ASR, alongside recent DiCoW extensions developed during JSALT 2025, demonstrating the evolving capabilities of diarization-conditioned approaches.
Bio: Alexander Polok is a Junior Researcher and PhD student at the Faculty of Information Technology, Brno University of Technology (BUT). His research focuses on speech recognition, with an emphasis on practical and efficient methods for applying ASR models in conversational settings. He has received several honors, including the Brno PhD Talent Scholarship, the Jury Award for CHiME 8, and the MLC-SLM Best Reproducibility Award. He also participated in the JSALT workshops in 2023 and 2025.
Please also note that the talk is slightly earlier than usual.
Below is the link to the talk.
https://ed-ac-uk.zoom.us/j/88650204315
Meeting ID: 886 5020 4315
Passcode: sigml2011
✉️ Don't forget to subscribe to our mailing list https://groups.google.com/g/isca-sigml
🎦 Previous talks and recordings can be found at https://homepages.inf.ed.ac.uk/htang2/sigml/seminar/
Adapting Single-Speaker ASR to Handle Conversations
State-of-the-art ASR systems perform exceptionally well in single-speaker scenarios, but they often struggle with conversations that feature significant speech overlap. Traditional target-speaker ASR methods, which rely on speaker embeddings or enrollment, face challenges in generalization and typically require prior knowledge of the speakers. To overcome these limitations, this talk introduces DiCoW (Diarization-Conditioned Whisper), which conditions ASR on diarization outputs to achieve robust multi-talker transcription with minimal training data. DiCoW has already powered the award-winning CHiME-8 and MLC-SLM systems.
Building on this success, I will present SE-DiCoW (Self-Enrolled DiCoW), an improved version that automatically resolves speaker ambiguities by selecting enrollments from long-form recordings. The talk will also highlight EMMA MT-ASR, the first unified benchmark for multi-talker ASR, alongside recent DiCoW extensions developed during JSALT 2025, demonstrating the evolving capabilities of diarization-conditioned approaches.
Bio: Alexander Polok is a Junior Researcher and PhD student at the Faculty of Information Technology, Brno University of Technology (BUT). His research focuses on speech recognition, with an emphasis on practical and efficient methods for applying ASR models in conversational settings. He has received several honors, including the Brno PhD Talent Scholarship, the Jury Award for CHiME 8, and the MLC-SLM Best Reproducibility Award. He also participated in the JSALT workshops in 2023 and 2025.
Zoom
Join our Cloud HD Video Meeting
Zoom is the leader in modern enterprise cloud communications.
Interesting repo of the day, whisper adaptation on texts
https://github.com/hon9kon9ize/whistle
https://arxiv.org/abs/2509.10452
WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers
Akshat Pandey, Karun Kumar, Raphael Tang
Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four out-of-domain datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by 12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines in 27 of 32 scenarios.
https://github.com/hon9kon9ize/whistle
https://arxiv.org/abs/2509.10452
WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers
Akshat Pandey, Karun Kumar, Raphael Tang
Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four out-of-domain datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by 12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines in 27 of 32 scenarios.
GitHub
GitHub - hon9kon9ize/whistle: Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers
Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers - hon9kon9ize/whistle
https://www.linkedin.com/posts/jlqueguiner_life-update-ive-officially-moved-to-new-activity-7386778452181405696-BlEw
We recently learned that Gladia's CEO @JiliJeanlouis moves to NYC. Congratulations!
I think its kinda important move and tells more about Europe.
We recently learned that Gladia's CEO @JiliJeanlouis moves to NYC. Congratulations!
I think its kinda important move and tells more about Europe.
LinkedIn
*Life update* I’ve officially moved to New York City to lead Gladia Inc. — and I can’t wait to celebrate this new chapter with…
*Life update* I’ve officially moved to New York City to lead Gladia Inc. — and I can’t wait to celebrate this new chapter with you all next week! 🇺🇸🎉
What a ride it’s been. My wife and I packed up our three kids (losing their favorite toys in the process…
What a ride it’s been. My wife and I packed up our three kids (losing their favorite toys in the process…
As technology advances proper evaluation becomes more and more complex. This is a great example
https://arxiv.org/abs/2510.16567
Hallucination Benchmark for Speech Foundation Models
Alkis Koudounas, Moreno La Quatra, Manuel Giollo, Sabato Marco Siniscalchi, Elena Baralis
Hallucinations in automatic speech recognition (ASR) systems refer to fluent and coherent transcriptions produced by neural ASR models that are completely unrelated to the underlying acoustic input (i.e., the speech signal). While similar to conventional decoding errors in potentially compromising the usability of transcriptions for downstream applications, hallucinations can be more detrimental due to their preservation of syntactically and semantically plausible structure. This apparent coherence can mislead subsequent processing stages and introduce serious risks, particularly in critical domains such as healthcare and law. Conventional evaluation metrics are primarily centered on error-based metrics and fail to distinguish between phonetic inaccuracies and hallucinations. Consequently, there is a critical need for new evaluation frameworks that can effectively identify and assess models with a heightened propensity for generating hallucinated content. To this end, we introduce SHALLOW, the first benchmark framework that systematically categorizes and quantifies hallucination phenomena in ASR along four complementary axes: lexical, phonetic, morphological, and semantic. We define targeted metrics within each category to produce interpretable profiles of model behavior. Through evaluation across various architectures and speech domains, we have found that SHALLOW metrics correlate strongly with word error rate (WER) when recognition quality is high (i.e., low WER). Still, this correlation weakens substantially as WER increases. SHALLOW, therefore, captures fine-grained error patterns that WER fails to distinguish under degraded and challenging conditions. Our framework supports specific diagnosis of model weaknesses and provides feedback for model improvement beyond what aggregate error rates can offer.
https://arxiv.org/abs/2510.16567
Hallucination Benchmark for Speech Foundation Models
Alkis Koudounas, Moreno La Quatra, Manuel Giollo, Sabato Marco Siniscalchi, Elena Baralis
Hallucinations in automatic speech recognition (ASR) systems refer to fluent and coherent transcriptions produced by neural ASR models that are completely unrelated to the underlying acoustic input (i.e., the speech signal). While similar to conventional decoding errors in potentially compromising the usability of transcriptions for downstream applications, hallucinations can be more detrimental due to their preservation of syntactically and semantically plausible structure. This apparent coherence can mislead subsequent processing stages and introduce serious risks, particularly in critical domains such as healthcare and law. Conventional evaluation metrics are primarily centered on error-based metrics and fail to distinguish between phonetic inaccuracies and hallucinations. Consequently, there is a critical need for new evaluation frameworks that can effectively identify and assess models with a heightened propensity for generating hallucinated content. To this end, we introduce SHALLOW, the first benchmark framework that systematically categorizes and quantifies hallucination phenomena in ASR along four complementary axes: lexical, phonetic, morphological, and semantic. We define targeted metrics within each category to produce interpretable profiles of model behavior. Through evaluation across various architectures and speech domains, we have found that SHALLOW metrics correlate strongly with word error rate (WER) when recognition quality is high (i.e., low WER). Still, this correlation weakens substantially as WER increases. SHALLOW, therefore, captures fine-grained error patterns that WER fails to distinguish under degraded and challenging conditions. Our framework supports specific diagnosis of model weaknesses and provides feedback for model improvement beyond what aggregate error rates can offer.
arXiv.org
Hallucination Benchmark for Speech Foundation Models
Hallucinations in automatic speech recognition (ASR) systems refer to fluent and coherent transcriptions produced by neural ASR models that are completely unrelated to the underlying acoustic...
Played with Qwen3-Omni a bit. Full version requires 90Gb of RAM, 4-bit quantization fits 24. 4-bit version only runs with VLLM and doesn't support audio output yet.
Speech recognition accuracy in HF space is OK but intelligence is below expectation. Video understanding is not really required for us.
My impression that video part makes this model too big for practical speech cases as it requires huge compute. A pure audio model might be more light and accurate.
Speech recognition accuracy in HF space is OK but intelligence is below expectation. Video understanding is not really required for us.
My impression that video part makes this model too big for practical speech cases as it requires huge compute. A pure audio model might be more light and accurate.
People still use whisperx for speaker separation and recognition, pyannote4 patch is pending
https://github.com/m-bain/whisperX/pull/1243
https://github.com/m-bain/whisperX/pull/1243
GitHub
Upgrade to pyannote-audio 4 by borgoat · Pull Request #1243 · m-bain/whisperX
There's a couple of new pyannote models:1 pyannote/speaker-diarization-community-1 (offline) and pyannote/speaker-diarization-precision-2 (hosted by pyannote)
I did a minimal upgrade to pya...
I did a minimal upgrade to pya...
This is an interesting talk, we also recommend to participate online since Google and DeepMind frequently doesn't allow recordings, there were many cases like that.
[Oct 30th, 2025]
Gemini Voice Agent: A Natively Multimodal Dialog Model with Advanced Reasoning and Tool Use
Presenter:Michael Han Google DeepMind
https://poonehmousavi.github.io/rg.html
https://concordia-ca.zoom.us/j/81004805542
[Oct 30th, 2025]
Gemini Voice Agent: A Natively Multimodal Dialog Model with Advanced Reasoning and Tool Use
Presenter:Michael Han Google DeepMind
https://poonehmousavi.github.io/rg.html
https://concordia-ca.zoom.us/j/81004805542
poonehmousavi.github.io
Pooneh Mousavi
Homepage of Pooneh Mousavi
Some emotion work from LAION, Emolia dataset with finegrained emotion annotation for Emlia data
https://huggingface.co/datasets/laion/Emolia
EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection
https://arxiv.org/abs/2506.09827
EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection
Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby, Felix Friedrich, Maurice Kraus, Kourosh Nadi, Huu Nguyen, Kristian Kersting, Sören Auer
The advancement of text-to-speech and audio generation models necessitates robust benchmarks for evaluating the emotional understanding capabilities of AI systems. Current speech emotion recognition (SER) datasets often exhibit limitations in emotional granularity, privacy concerns, or reliance on acted portrayals. This paper introduces EmoNet-Voice, a new resource for speech emotion detection, which includes EmoNet-Voice Big, a large-scale pre-training dataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions, and 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with human expert annotations. EmoNet-Voice is designed to evaluate SER models on a fine-grained spectrum of 40 emotion categories with different levels of intensities. Leveraging state-of-the-art voice generation, we curated synthetic audio snippets simulating actors portraying scenes designed to evoke specific emotions. Crucially, we conducted rigorous validation by psychology experts who assigned perceived intensity labels. This synthetic, privacy-preserving approach allows for the inclusion of sensitive emotional states often absent in existing datasets. Lastly, we introduce Empathic Insight Voice models that set a new standard in speech emotion recognition with high agreement with human experts. Our evaluations across the current model landscape exhibit valuable findings, such as high-arousal emotions like anger being much easier to detect than low-arousal states like concentration.
https://huggingface.co/datasets/laion/Emolia
EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection
https://arxiv.org/abs/2506.09827
EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection
Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby, Felix Friedrich, Maurice Kraus, Kourosh Nadi, Huu Nguyen, Kristian Kersting, Sören Auer
The advancement of text-to-speech and audio generation models necessitates robust benchmarks for evaluating the emotional understanding capabilities of AI systems. Current speech emotion recognition (SER) datasets often exhibit limitations in emotional granularity, privacy concerns, or reliance on acted portrayals. This paper introduces EmoNet-Voice, a new resource for speech emotion detection, which includes EmoNet-Voice Big, a large-scale pre-training dataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions, and 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with human expert annotations. EmoNet-Voice is designed to evaluate SER models on a fine-grained spectrum of 40 emotion categories with different levels of intensities. Leveraging state-of-the-art voice generation, we curated synthetic audio snippets simulating actors portraying scenes designed to evoke specific emotions. Crucially, we conducted rigorous validation by psychology experts who assigned perceived intensity labels. This synthetic, privacy-preserving approach allows for the inclusion of sensitive emotional states often absent in existing datasets. Lastly, we introduce Empathic Insight Voice models that set a new standard in speech emotion recognition with high agreement with human experts. Our evaluations across the current model landscape exhibit valuable findings, such as high-arousal emotions like anger being much easier to detect than low-arousal states like concentration.
arXiv.org
EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech...
Speech emotion recognition (SER) systems are constrained by existing datasets that typically cover only 6-10 basic emotions, lack scale and diversity, and face ethical challenges when collecting...
From comments to KanyTTS release
https://www.reddit.com/r/LocalLLaMA/comments/1oitanf/just_dropped_kani_tts_english_a_400m_tts_model/
Nice quick evaluation of TTS engines. Kokoro leads due to stability, many other systems expose issues
https://paper2audio.com/posts/review-of-text-to-speech-models-for-reading-research-papers
https://www.reddit.com/r/LocalLLaMA/comments/1oitanf/just_dropped_kani_tts_english_a_400m_tts_model/
Nice quick evaluation of TTS engines. Kokoro leads due to stability, many other systems expose issues
https://paper2audio.com/posts/review-of-text-to-speech-models-for-reading-research-papers
https://github.com/pykeio/earshot
Very fast voice activity detection in Rust, 10 times faster than TEN VAD
Very fast voice activity detection in Rust, 10 times faster than TEN VAD
GitHub
GitHub - pykeio/earshot: Ridiculously fast & accurate voice activity detection in pure Rust
Ridiculously fast & accurate voice activity detection in pure Rust - pykeio/earshot
The attention patterns in speech definitely have potential
https://github.com/smulelabs/windowed-roformer
Efficient Vocal Source Separation Through Windowed Sink Attention
State-of-the-art vocal separation models like Mel-Band-Roformer rely on full temporal self-attention mechanisms, where each temporal frame interacts with every other frames. This incurs heavy computational costs that scales quadratically with input audio length, motivating chunking and windowing approaches. Through analysis of a pre-trained vocal separation model, we discovered that temporal attention patterns are highly localized. Building on this insight, we replaced full attention with windowed sink attention (WSA) with small temporal attention window and attention sinks. We show empirically that fine-tuning from the original checkpoint recovers 92% of the original SDR performance while reducing FLOPs by 44.5x.
Related is
https://github.com/SamsungLabs/SummaryMixing
SummaryMixing is a linear-time alternative to self-attention (SA) for speech processing models such as Transformers, Conformers or Branchformers. Instead of computing pair-wise scores between tokens (leading to quadratic-time complexity for SA), it summarises a whole utterance with mean over vectors for all time steps. SummaryMixing is based on the recent findings demonstrating that self-attention could be useless for speech recognition as the attention weights of trained ASR systems are almost uniformly distributed accross the tokens composing a sequence. SummaryMixing also is a generalisation of the recent HyperMixer and HyperConformer to better and simpler mixing functions. In a SummaryMixing cell, that takes the same inputs and produces the same outputs than self-attention, contributions from each time step are first transformed and then averaged globally before being fed back to each time step. This is visible in Figure 1 in the article. Therefore, the time-complexity is reduced to linear.
https://github.com/smulelabs/windowed-roformer
Efficient Vocal Source Separation Through Windowed Sink Attention
State-of-the-art vocal separation models like Mel-Band-Roformer rely on full temporal self-attention mechanisms, where each temporal frame interacts with every other frames. This incurs heavy computational costs that scales quadratically with input audio length, motivating chunking and windowing approaches. Through analysis of a pre-trained vocal separation model, we discovered that temporal attention patterns are highly localized. Building on this insight, we replaced full attention with windowed sink attention (WSA) with small temporal attention window and attention sinks. We show empirically that fine-tuning from the original checkpoint recovers 92% of the original SDR performance while reducing FLOPs by 44.5x.
Related is
https://github.com/SamsungLabs/SummaryMixing
SummaryMixing is a linear-time alternative to self-attention (SA) for speech processing models such as Transformers, Conformers or Branchformers. Instead of computing pair-wise scores between tokens (leading to quadratic-time complexity for SA), it summarises a whole utterance with mean over vectors for all time steps. SummaryMixing is based on the recent findings demonstrating that self-attention could be useless for speech recognition as the attention weights of trained ASR systems are almost uniformly distributed accross the tokens composing a sequence. SummaryMixing also is a generalisation of the recent HyperMixer and HyperConformer to better and simpler mixing functions. In a SummaryMixing cell, that takes the same inputs and produces the same outputs than self-attention, contributions from each time step are first transformed and then averaged globally before being fed back to each time step. This is visible in Figure 1 in the article. Therefore, the time-complexity is reduced to linear.
GitHub
GitHub - smulelabs/windowed-roformer: Official Repository for "Efficient Vocal Source Separation Through Windowed RoFormer"
Official Repository for "Efficient Vocal Source Separation Through Windowed RoFormer" - smulelabs/windowed-roformer
News from other universe
LongCat-Flash-Omni is open sourced: Multimodal + Low-Latency
* ScMoE architecture on LongCat-Flash: 560B Parameters, 27B Active
* Leading Performance among Open-Source Omni-modal models
* Training: Novel Early-Fusion Omni-modal training paradigm -> No Single Modality Left Behind
* Real-time Spoken Interaction: Millisecond-level E2E latency
* 128K context + Supports > 8min real-time AV interaction
* Multimodal I/O: Arbitrary Combination of Text/Image/Audio/Video Input → Text/Speech Output (w/ LongCat-Audio-Codec)
* Efficient Infrastructure: With optimized modality-decoupled parallel training, Omni sustains >90% throughput of pure-text training efficiency.
https://github.com/meituan-longcat/LongCat-Flash-Omni
LongCat-Flash-Omni is open sourced: Multimodal + Low-Latency
* ScMoE architecture on LongCat-Flash: 560B Parameters, 27B Active
* Leading Performance among Open-Source Omni-modal models
* Training: Novel Early-Fusion Omni-modal training paradigm -> No Single Modality Left Behind
* Real-time Spoken Interaction: Millisecond-level E2E latency
* 128K context + Supports > 8min real-time AV interaction
* Multimodal I/O: Arbitrary Combination of Text/Image/Audio/Video Input → Text/Speech Output (w/ LongCat-Audio-Codec)
* Efficient Infrastructure: With optimized modality-decoupled parallel training, Omni sustains >90% throughput of pure-text training efficiency.
https://github.com/meituan-longcat/LongCat-Flash-Omni
GitHub
GitHub - meituan-longcat/LongCat-Flash-Omni: This is the official repo for the paper "LongCat-Flash-Omni Technical Report"
This is the official repo for the paper "LongCat-Flash-Omni Technical Report" - meituan-longcat/LongCat-Flash-Omni