CHIME Challenge is way more dense than Interespeech. Chime 8 workshop just ended
https://www.chimechallenge.org/current/workshop/index
Congrats to STC team, as usual they demonstrate top performance on Chime tasks.
No publications yet, but even keynote talk is interesting
Teaching New Skills to Foundation Models: Insights and Experiences
Speaker: Hung-yi Lee
National Taiwan University (NTU)
https://www.chimechallenge.org/current/workshop/CHiME2024_Lee.pdf
https://www.chimechallenge.org/current/workshop/index
Congrats to STC team, as usual they demonstrate top performance on Chime tasks.
No publications yet, but even keynote talk is interesting
Teaching New Skills to Foundation Models: Insights and Experiences
Speaker: Hung-yi Lee
National Taiwan University (NTU)
https://www.chimechallenge.org/current/workshop/CHiME2024_Lee.pdf
CHiME Challenges and Workshops
CHiME 2024 Workshop
September 6, 2024, Kos International Convention Centre (KICC), Kos Island, Greece
And, an interesting NOTSOFAR task, for information to many startups trying to implement meeting transcription
To illustrate, on our newly released NOTSOFAR meeting benchmark, Whisper large-v3 with head-mounted mics achieves 9.3% WER (word-error-rate), yet on audio from a distant mic it climbs to 37.4% WER. The culprits are reverberation, noise, and overlapping speech, which interfere with the source signal.
https://arxiv.org/abs/2401.08887
NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription
Alon Vinnikov, Amir Ivry, Aviv Hurvitz, Igor Abramovski, Sharon Koubi, Ilya Gurvich, Shai Pe`er, Xiong Xiao, Benjamin Martinez Elizalde, Naoyuki Kanda, Xiaofei Wang, Shalev Shaer, Stav Yagev, Yossi Asher, Sunit Sivasankaran, Yifan Gong, Min Tang, Huaming Wang, Eyal Krupka
We introduce the first Natural Office Talkers in Settings of Far-field Audio Recordings (``NOTSOFAR-1'') Challenge alongside datasets and baseline system. The challenge focuses on distant speaker diarization and automatic speech recognition (DASR) in far-field meeting scenarios, with single-channel and known-geometry multi-channel tracks, and serves as a launch platform for two new datasets: First, a benchmarking dataset of 315 meetings, averaging 6 minutes each, capturing a broad spectrum of real-world acoustic conditions and conversational dynamics. It is recorded across 30 conference rooms, featuring 4-8 attendees and a total of 35 unique speakers. Second, a 1000-hour simulated training dataset, synthesized with enhanced authenticity for real-world generalization, incorporating 15,000 real acoustic transfer functions. The tasks focus on single-device DASR, where multi-channel devices always share the same known geometry. This is aligned with common setups in actual conference rooms, and avoids technical complexities associated with multi-device tasks. It also allows for the development of geometry-specific solutions. The NOTSOFAR-1 Challenge aims to advance research in the field of distant conversational speech recognition, providing key resources to unlock the potential of data-driven methods, which we believe are currently constrained by the absence of comprehensive high-quality training and benchmarking datasets.
To illustrate, on our newly released NOTSOFAR meeting benchmark, Whisper large-v3 with head-mounted mics achieves 9.3% WER (word-error-rate), yet on audio from a distant mic it climbs to 37.4% WER. The culprits are reverberation, noise, and overlapping speech, which interfere with the source signal.
https://arxiv.org/abs/2401.08887
NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription
Alon Vinnikov, Amir Ivry, Aviv Hurvitz, Igor Abramovski, Sharon Koubi, Ilya Gurvich, Shai Pe`er, Xiong Xiao, Benjamin Martinez Elizalde, Naoyuki Kanda, Xiaofei Wang, Shalev Shaer, Stav Yagev, Yossi Asher, Sunit Sivasankaran, Yifan Gong, Min Tang, Huaming Wang, Eyal Krupka
We introduce the first Natural Office Talkers in Settings of Far-field Audio Recordings (``NOTSOFAR-1'') Challenge alongside datasets and baseline system. The challenge focuses on distant speaker diarization and automatic speech recognition (DASR) in far-field meeting scenarios, with single-channel and known-geometry multi-channel tracks, and serves as a launch platform for two new datasets: First, a benchmarking dataset of 315 meetings, averaging 6 minutes each, capturing a broad spectrum of real-world acoustic conditions and conversational dynamics. It is recorded across 30 conference rooms, featuring 4-8 attendees and a total of 35 unique speakers. Second, a 1000-hour simulated training dataset, synthesized with enhanced authenticity for real-world generalization, incorporating 15,000 real acoustic transfer functions. The tasks focus on single-device DASR, where multi-channel devices always share the same known geometry. This is aligned with common setups in actual conference rooms, and avoids technical complexities associated with multi-device tasks. It also allows for the development of geometry-specific solutions. The NOTSOFAR-1 Challenge aims to advance research in the field of distant conversational speech recognition, providing key resources to unlock the potential of data-driven methods, which we believe are currently constrained by the absence of comprehensive high-quality training and benchmarking datasets.
arXiv.org
NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for...
We introduce the first Natural Office Talkers in Settings of Far-field Audio Recordings (``NOTSOFAR-1'') Challenge alongside datasets and baseline system. The challenge focuses on distant speaker...
Fun fact, the amount of fake content grows. For example a user just sent me an article which mentions Vosk Indonesian model and even gives a link to it. The problem is we never had one! The article is clearly autogenerated!
https://x.com/FishAudio/status/1833787529595912531
Excited to introduce Fish Speech 1.4 - now open-source and more powerful than ever! Our mission is to make cutting-edge voice tech accessible to everyone.
What's new:
- Trained on 700k hours of multilingual data (up from 200k)
- Now supports 8 languages: English, Chinese, German, Japanese, French, Spanish, Korean, and Arabic
- Fully open-source, empowering developers and researchers worldwide
Key features:
- Lightning-fast TTS with ultra-low latency
- Instant voice cloning
- Self-host or use our cloud service
- Simple, flat-rate pricing
Try it out:
- Playground: https://fish.audio
- GitHub: https://github.com/fishaudio/fish-speech
- HuggingFace Model: https://huggingface.co/fishaudio/fish-speech-1.4
- Demo: https://huggingface.co/spaces/fishaudio/fish-speech-1
- Product Hunt: https://producthunt.com/posts/fish-speech-1-4
Excited to introduce Fish Speech 1.4 - now open-source and more powerful than ever! Our mission is to make cutting-edge voice tech accessible to everyone.
What's new:
- Trained on 700k hours of multilingual data (up from 200k)
- Now supports 8 languages: English, Chinese, German, Japanese, French, Spanish, Korean, and Arabic
- Fully open-source, empowering developers and researchers worldwide
Key features:
- Lightning-fast TTS with ultra-low latency
- Instant voice cloning
- Self-host or use our cloud service
- Simple, flat-rate pricing
Try it out:
- Playground: https://fish.audio
- GitHub: https://github.com/fishaudio/fish-speech
- HuggingFace Model: https://huggingface.co/fishaudio/fish-speech-1.4
- Demo: https://huggingface.co/spaces/fishaudio/fish-speech-1
- Product Hunt: https://producthunt.com/posts/fish-speech-1-4
X (formerly Twitter)
Fish Audio (@FishAudio) on X
Excited to introduce Fish Speech 1.4 - now open-source and more powerful than ever! 🎉 Our mission is to make cutting-edge voice tech accessible to everyone.
What's new:
- Trained on 700k hours of multilingual data (up from 200k)
- Now supports 8 languages:…
What's new:
- Trained on 700k hours of multilingual data (up from 200k)
- Now supports 8 languages:…
Small LLMS are the right thing to go, here is a Chinese LLM with 26M parameters
https://github.com/jingyaogong/minimind/blob/master/README_en.md
https://github.com/jingyaogong/minimind/blob/master/README_en.md
GitHub
minimind/README_en.md at master · jingyaogong/minimind
🚀🚀 「大模型」2小时完全从0训练26M的小参数GPT!🌏 Train a 26M-parameter GPT from scratch in just 2h! - jingyaogong/minimind
New paper from StyleTTS authors. Metrics looks good, and finally proper comparison between systems! But I kind of wonder if algorithms are too focused on read speech. Hard to believe in such a great metrics for conversational dataset with proposed complex algorithms.
https://arxiv.org/abs/2409.10058
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion
Yinghao Aaron Li, Xilin Jiang, Cong Han, Nima Mesgarani
The rapid development of large-scale text-to-speech (TTS) models has led to significant advancements in modeling diverse speaker prosody and voices. However, these models often face issues such as slow inference speeds, reliance on complex pre-trained neural codec representations, and difficulties in achieving naturalness and high similarity to reference speakers. To address these challenges, this work introduces StyleTTS-ZS, an efficient zero-shot TTS model that leverages distilled time-varying style diffusion to capture diverse speaker identities and prosodies. We propose a novel approach that represents human speech using input text and fixed-length time-varying discrete style codes to capture diverse prosodic variations, trained adversarially with multi-modal discriminators. A diffusion model is then built to sample this time-varying style code for efficient latent diffusion. Using classifier-free guidance, StyleTTS-ZS achieves high similarity to the reference speaker in the style diffusion process. Furthermore, to expedite sampling, the style diffusion model is distilled with perceptual loss using only 10k samples, maintaining speech quality and similarity while reducing inference speed by 90%. Our model surpasses previous state-of-the-art large-scale zero-shot TTS models in both naturalness and similarity, offering a 10-20 faster sampling speed, making it an attractive alternative for efficient large-scale zero-shot TTS systems. The audio demo, code and models are available at this https URL.
https://arxiv.org/abs/2409.10058
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion
Yinghao Aaron Li, Xilin Jiang, Cong Han, Nima Mesgarani
The rapid development of large-scale text-to-speech (TTS) models has led to significant advancements in modeling diverse speaker prosody and voices. However, these models often face issues such as slow inference speeds, reliance on complex pre-trained neural codec representations, and difficulties in achieving naturalness and high similarity to reference speakers. To address these challenges, this work introduces StyleTTS-ZS, an efficient zero-shot TTS model that leverages distilled time-varying style diffusion to capture diverse speaker identities and prosodies. We propose a novel approach that represents human speech using input text and fixed-length time-varying discrete style codes to capture diverse prosodic variations, trained adversarially with multi-modal discriminators. A diffusion model is then built to sample this time-varying style code for efficient latent diffusion. Using classifier-free guidance, StyleTTS-ZS achieves high similarity to the reference speaker in the style diffusion process. Furthermore, to expedite sampling, the style diffusion model is distilled with perceptual loss using only 10k samples, maintaining speech quality and similarity while reducing inference speed by 90%. Our model surpasses previous state-of-the-art large-scale zero-shot TTS models in both naturalness and similarity, offering a 10-20 faster sampling speed, making it an attractive alternative for efficient large-scale zero-shot TTS systems. The audio demo, code and models are available at this https URL.
arXiv.org
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech...
The rapid development of large-scale text-to-speech (TTS) models has led to significant advancements in modeling diverse speaker prosody and voices. However, these models often face issues such as...
Everyone is doing DiT in TTS. Honestly I failed to understand the use of it, but metrics show good values. Examples are
https://github.com/KdaiP/StableTTS
https://github.com/7Xin/DPI-TTS
https://github.com/KdaiP/StableTTS
https://github.com/7Xin/DPI-TTS
GitHub
GitHub - KdaiP/StableTTS: Next-generation TTS model using flow-matching and DiT, inspired by Stable Diffusion 3
Next-generation TTS model using flow-matching and DiT, inspired by Stable Diffusion 3 - KdaiP/StableTTS
Its interesting how non-optimal decisions require much more compute. For example, speech events are clearly non-uniformly arranged in time. One has to model time offsets. Given that it is strange modern discrete codecs use large uniform shifts. A proper encoder should have blank tokens or something like that and higher sampling rate.
Interesting research on ASR robustness. Nemo Canary model is most accurate but far from being most Robust. Whisper variants and MMS are somewhat better. Training data size matters.
https://github.com/ahmedshah1494/speech_robust_bench
https://arxiv.org/pdf/2403.07937
Speech Robust Bench: A Robustness Benchmark For Speech Recognition
Muhammad A. Shah, David Solans Noguero, Mikko A. Heikkila, Bhiksha Raj, Nicolas Kourtellis
As Automatic Speech Recognition (ASR) models become ever more pervasive, it is important to ensure that they make reliable predictions under corruptions present in the physical and digital world. We propose Speech Robust Bench (SRB), a comprehensive benchmark for evaluating the robustness of ASR models to diverse corruptions. SRB is composed of 114 input perturbations which simulate an heterogeneous range of corruptions that ASR models may encounter when deployed in the wild. We use SRB to evaluate the robustness of several state-of-the-art ASR models and observe that model size and certain modeling choices such as the use of discrete representations, or self-training appear to be conducive to robustness. We extend this analysis to measure the robustness of ASR models on data from various demographic subgroups, namely English and Spanish speakers, and males and females. Our results revealed noticeable disparities in the model's robustness across subgroups. We believe that SRB will significantly facilitate future research towards robust ASR models, by making it easier to conduct comprehensive and comparable robustness evaluations.
https://github.com/ahmedshah1494/speech_robust_bench
https://arxiv.org/pdf/2403.07937
Speech Robust Bench: A Robustness Benchmark For Speech Recognition
Muhammad A. Shah, David Solans Noguero, Mikko A. Heikkila, Bhiksha Raj, Nicolas Kourtellis
As Automatic Speech Recognition (ASR) models become ever more pervasive, it is important to ensure that they make reliable predictions under corruptions present in the physical and digital world. We propose Speech Robust Bench (SRB), a comprehensive benchmark for evaluating the robustness of ASR models to diverse corruptions. SRB is composed of 114 input perturbations which simulate an heterogeneous range of corruptions that ASR models may encounter when deployed in the wild. We use SRB to evaluate the robustness of several state-of-the-art ASR models and observe that model size and certain modeling choices such as the use of discrete representations, or self-training appear to be conducive to robustness. We extend this analysis to measure the robustness of ASR models on data from various demographic subgroups, namely English and Spanish speakers, and males and females. Our results revealed noticeable disparities in the model's robustness across subgroups. We believe that SRB will significantly facilitate future research towards robust ASR models, by making it easier to conduct comprehensive and comparable robustness evaluations.
GitHub
GitHub - ahmedshah1494/speech_robust_bench
Contribute to ahmedshah1494/speech_robust_bench development by creating an account on GitHub.
https://arxiv.org/abs/2410.01036
MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages
Marco Gaido, Sara Papi, Luisa Bentivogli, Alessio Brutti, Mauro Cettolo, Roberto Gretter, Marco Matassoni, Mohamed Nabih, Matteo Negri
The rise of foundation models (FMs), coupled with regulatory efforts addressing their risks and impacts, has sparked significant interest in open-source models. However, existing speech FMs (SFMs) fall short of full compliance with the open-source principles, even if claimed otherwise, as no existing SFM has model weights, code, and training data publicly available under open-source terms. In this work, we take the first step toward filling this gap by focusing on the 24 official languages of the European Union (EU). We collect suitable training data by surveying automatic speech recognition datasets and unlabeled speech corpora under open-source compliant licenses, for a total of 950k hours. Additionally, we release automatic transcripts for 441k hours of unlabeled data under the permissive CC-BY license, thereby facilitating the creation of open-source SFMs for the EU languages.
MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages
Marco Gaido, Sara Papi, Luisa Bentivogli, Alessio Brutti, Mauro Cettolo, Roberto Gretter, Marco Matassoni, Mohamed Nabih, Matteo Negri
The rise of foundation models (FMs), coupled with regulatory efforts addressing their risks and impacts, has sparked significant interest in open-source models. However, existing speech FMs (SFMs) fall short of full compliance with the open-source principles, even if claimed otherwise, as no existing SFM has model weights, code, and training data publicly available under open-source terms. In this work, we take the first step toward filling this gap by focusing on the 24 official languages of the European Union (EU). We collect suitable training data by surveying automatic speech recognition datasets and unlabeled speech corpora under open-source compliant licenses, for a total of 950k hours. Additionally, we release automatic transcripts for 441k hours of unlabeled data under the permissive CC-BY license, thereby facilitating the creation of open-source SFMs for the EU languages.
arXiv.org
MOSEL: 950,000 Hours of Speech Data for Open-Source Speech...
The rise of foundation models (FMs), coupled with regulatory efforts addressing their risks and impacts, has sparked significant interest in open-source models. However, existing speech FMs (SFMs)...
Rev releases new very accurate English models
https://www.rev.com/blog/speech-to-text-technology/introducing-reverb-open-source-asr-diarization
https://www.rev.com/blog/speech-to-text-technology/introducing-reverb-open-source-asr-diarization
Rev
Reverb Open-Source ASR and Diarization Models | Rev
Discover how Rev’s open-source Reverb models, trained on the largest human-transcribed dataset, are pushing the boundaries of ASR and diarization technology
Somewhat interesting
https://arxiv.org/abs/2407.15835
dMel: Speech Tokenization made Simple
He Bai, Tatiana Likhomanenko, Ruixiang Zhang, Zijin Gu, Zakaria Aldeneh, Navdeep Jaitly
Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated complicated speech tokenization methods to discretize continuous speech signals so that language modeling techniques can be applied to speech data. However, existing approaches either model semantic (content) tokens, potentially losing acoustic information, or model acoustic tokens, risking the loss of semantic (content) information. Having multiple token types also complicates the architecture and requires additional pretraining. Here we show that discretizing mel-filterbank channels into discrete intensity bins produces a simple representation (dMel), that performs better than other existing speech tokenization methods. Using an LM-style transformer architecture for speech-text modeling, we comprehensively evaluate different speech tokenization methods on speech recognition (ASR) and speech synthesis (TTS). Our results demonstrate the effectiveness of dMel in achieving high performance on both tasks within a unified framework, paving the way for efficient and effective joint modeling of speech and text.
https://arxiv.org/abs/2407.15835
dMel: Speech Tokenization made Simple
He Bai, Tatiana Likhomanenko, Ruixiang Zhang, Zijin Gu, Zakaria Aldeneh, Navdeep Jaitly
Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated complicated speech tokenization methods to discretize continuous speech signals so that language modeling techniques can be applied to speech data. However, existing approaches either model semantic (content) tokens, potentially losing acoustic information, or model acoustic tokens, risking the loss of semantic (content) information. Having multiple token types also complicates the architecture and requires additional pretraining. Here we show that discretizing mel-filterbank channels into discrete intensity bins produces a simple representation (dMel), that performs better than other existing speech tokenization methods. Using an LM-style transformer architecture for speech-text modeling, we comprehensively evaluate different speech tokenization methods on speech recognition (ASR) and speech synthesis (TTS). Our results demonstrate the effectiveness of dMel in achieving high performance on both tasks within a unified framework, paving the way for efficient and effective joint modeling of speech and text.
arXiv.org
dMel: Speech Tokenization made Simple
Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated...
https://arxiv.org/abs/2408.13106
NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks
He Huang, Taejin Park, Kunal Dhawan, Ivan Medennikov, Krishna C. Puvvada, Nithin Rao Koluguri, Weiqing Wang, Jagadeesh Balam, Boris Ginsburg
Self-supervised learning has been proved to benefit a wide range of speech processing tasks, such as speech recognition/translation, speaker verification and diarization, etc. However, most of current approaches are computationally expensive. In this paper, we propose a simplified and more efficient self-supervised learning framework termed as NeMo Encoder for Speech Tasks (NEST). Specifically, we adopt the FastConformer architecture with 8x sub-sampling rate, which is faster than Transformer or Conformer architectures. Instead of clustering-based quantization, we use fixed random projection for its simplicity and effectiveness. We also implement a generalized noisy speech augmentation that teaches the model to disentangle the main speaker from noise or other speakers. Experiments show that \model improves over existing self-supervised models and achieves new state-of-the-art performance on a variety of speech processing tasks, such as speech recognition/translation, speaker diarization, spoken language understanding, etc. Code and checkpoints will be publicly available via NVIDIA NeMo framework.
NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks
He Huang, Taejin Park, Kunal Dhawan, Ivan Medennikov, Krishna C. Puvvada, Nithin Rao Koluguri, Weiqing Wang, Jagadeesh Balam, Boris Ginsburg
Self-supervised learning has been proved to benefit a wide range of speech processing tasks, such as speech recognition/translation, speaker verification and diarization, etc. However, most of current approaches are computationally expensive. In this paper, we propose a simplified and more efficient self-supervised learning framework termed as NeMo Encoder for Speech Tasks (NEST). Specifically, we adopt the FastConformer architecture with 8x sub-sampling rate, which is faster than Transformer or Conformer architectures. Instead of clustering-based quantization, we use fixed random projection for its simplicity and effectiveness. We also implement a generalized noisy speech augmentation that teaches the model to disentangle the main speaker from noise or other speakers. Experiments show that \model improves over existing self-supervised models and achieves new state-of-the-art performance on a variety of speech processing tasks, such as speech recognition/translation, speaker diarization, spoken language understanding, etc. Code and checkpoints will be publicly available via NVIDIA NeMo framework.
The more LLMs out there the more data leaks. A good example is librispeech asr accuracy of LLM is usually very good given LLM definitely saw all book texts before. Another example here
https://arxiv.org/abs/2409.04927
Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue
Junkai Wu, Xulin Fan, Bo-Ru Lu, Xilin Jiang, Nima Mesgarani, Mark Hasegawa-Johnson, Mari Ostendorf
In recent years, we have observed a rapid advancement in speech language models (SpeechLLMs), catching up with humans' listening and reasoning abilities. SpeechLLMs have demonstrated impressive spoken dialog question-answering (SQA) performance in benchmarks like Gaokao, the English listening test of the college entrance exam in China, which seemingly requires understanding both the spoken content and voice characteristics of speakers in a conversation. However, after carefully examining Gaokao's questions, we find the correct answers to many questions can be inferred from the conversation transcript alone, i.e.\ without speaker segmentation and identification. Our evaluation of state-of-the-art models Qwen-Audio and WavLLM on both Gaokao and our proposed "What Do You Like?" dataset shows a significantly higher accuracy in these context-based questions than in identity-critical questions, which can only be answered reliably with correct speaker identification. The results and analysis suggest that when solving SQA, the current SpeechLLMs exhibit limited speaker awareness from the audio and behave similarly to an LLM reasoning from the conversation transcription without sound. We propose that tasks focused on identity-critical questions could offer a more accurate evaluation framework of SpeechLLMs in SQA.
https://arxiv.org/abs/2409.04927
Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue
Junkai Wu, Xulin Fan, Bo-Ru Lu, Xilin Jiang, Nima Mesgarani, Mark Hasegawa-Johnson, Mari Ostendorf
In recent years, we have observed a rapid advancement in speech language models (SpeechLLMs), catching up with humans' listening and reasoning abilities. SpeechLLMs have demonstrated impressive spoken dialog question-answering (SQA) performance in benchmarks like Gaokao, the English listening test of the college entrance exam in China, which seemingly requires understanding both the spoken content and voice characteristics of speakers in a conversation. However, after carefully examining Gaokao's questions, we find the correct answers to many questions can be inferred from the conversation transcript alone, i.e.\ without speaker segmentation and identification. Our evaluation of state-of-the-art models Qwen-Audio and WavLLM on both Gaokao and our proposed "What Do You Like?" dataset shows a significantly higher accuracy in these context-based questions than in identity-critical questions, which can only be answered reliably with correct speaker identification. The results and analysis suggest that when solving SQA, the current SpeechLLMs exhibit limited speaker awareness from the audio and behave similarly to an LLM reasoning from the conversation transcription without sound. We propose that tasks focused on identity-critical questions could offer a more accurate evaluation framework of SpeechLLMs in SQA.
arXiv.org
Just ASR + LLM? A Study on Speech Large Language Models'...
In recent years, we have observed a rapid advancement in speech language models (SpeechLLMs), catching up with humans' listening and reasoning abilities. SpeechLLMs have demonstrated impressive...