Many releases recently, including qwen-tts and others. Another LLM
https://github.com/FlashLabs-AI-Corp/FlashLabs-Chroma
https://github.com/FlashLabs-AI-Corp/FlashLabs-Chroma
GitHub
GitHub - FlashLabs-AI-Corp/FlashLabs-Chroma: Worlds first open-source real-time end-to-end spoken dialogue model with personalized…
Worlds first open-source real-time end-to-end spoken dialogue model with personalized voice cloning. - FlashLabs-AI-Corp/FlashLabs-Chroma
Vibevoice finetune for European languages, good results compared to baseline
https://huggingface.co/kugelaudio/kugelaudio-0-open
https://huggingface.co/kugelaudio/kugelaudio-0-open
https://www.assemblyai.com/universal-3-pro new model by assembly ai, LLM based. Supposed to be free for February, so a good chance to test.
Assemblyai
Universal-3 Pro by AssemblyAI
Introducing Universal-3 Pro, a first of its kind promptable speech language model. Control transcription using natural language prompting and domain context.
Interesting effort from Shinji on phoneme recognition
https://huggingface.co/espnet/powsm
https://arxiv.org/abs/2510.24992
POWSM: A Phonetic Open Whisper-Style Speech Foundation Model
Chin-Jou Li, Kalvin Chang, Shikhar Bharadwaj, Eunjung Yeo, Kwanghee Choi, Jian Zhu, David Mortensen, Shinji Watanabe
Recent advances in spoken language processing have led to substantial progress in phonetic tasks such as automatic speech recognition (ASR), phone recognition (PR), grapheme-to-phoneme conversion (G2P), and phoneme-to-grapheme conversion (P2G). Despite their conceptual similarity, these tasks have largely been studied in isolation, each relying on task-specific architectures and datasets. In this paper, we introduce POWSM (Phonetic Open Whisper-style Speech Model), the first unified framework capable of jointly performing multiple phone-related tasks. POWSM enables seamless conversion between audio, text (graphemes), and phones, opening up new possibilities for universal and low-resource speech processing. Our model outperforms or matches specialized PR models of similar size (Wav2Vec2Phoneme and ZIPA) while jointly supporting G2P, P2G, and ASR. Our training data, code and models are released to foster open science.
https://huggingface.co/espnet/powsm
https://arxiv.org/abs/2510.24992
POWSM: A Phonetic Open Whisper-Style Speech Foundation Model
Chin-Jou Li, Kalvin Chang, Shikhar Bharadwaj, Eunjung Yeo, Kwanghee Choi, Jian Zhu, David Mortensen, Shinji Watanabe
Recent advances in spoken language processing have led to substantial progress in phonetic tasks such as automatic speech recognition (ASR), phone recognition (PR), grapheme-to-phoneme conversion (G2P), and phoneme-to-grapheme conversion (P2G). Despite their conceptual similarity, these tasks have largely been studied in isolation, each relying on task-specific architectures and datasets. In this paper, we introduce POWSM (Phonetic Open Whisper-style Speech Model), the first unified framework capable of jointly performing multiple phone-related tasks. POWSM enables seamless conversion between audio, text (graphemes), and phones, opening up new possibilities for universal and low-resource speech processing. Our model outperforms or matches specialized PR models of similar size (Wav2Vec2Phoneme and ZIPA) while jointly supporting G2P, P2G, and ASR. Our training data, code and models are released to foster open science.
huggingface.co
espnet/powsm · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Some great results in phone recognition, no code yet but probably it will appear soon
https://www.arxiv.org/abs/2602.01634
HuPER: A Human-Inspired Framework for Phonetic Perception
Chenxu Guo, Jiachen Lian, Yisi Liu, Baihe Huang, Shriyaa Narayanan, Cheol Jun Cho, Gopala Anumanchipalli
We propose HuPER, a human-inspired framework that models phonetic perception as adaptive inference over acoustic-phonetics evidence and linguistic knowledge. With only 100 hours of training data, HuPER achieves state-of-the-art phonetic error rates on five English benchmarks and strong zero-shot transfer to 95 unseen languages. HuPER is also the first framework to enable adaptive, multi-path phonetic perception under diverse acoustic conditions. All training data, models, and code are open-sourced. Code and demo avaliable at this https URL.
https://www.arxiv.org/abs/2602.01634
HuPER: A Human-Inspired Framework for Phonetic Perception
Chenxu Guo, Jiachen Lian, Yisi Liu, Baihe Huang, Shriyaa Narayanan, Cheol Jun Cho, Gopala Anumanchipalli
We propose HuPER, a human-inspired framework that models phonetic perception as adaptive inference over acoustic-phonetics evidence and linguistic knowledge. With only 100 hours of training data, HuPER achieves state-of-the-art phonetic error rates on five English benchmarks and strong zero-shot transfer to 95 unseen languages. HuPER is also the first framework to enable adaptive, multi-path phonetic perception under diverse acoustic conditions. All training data, models, and code are open-sourced. Code and demo avaliable at this https URL.
arXiv.org
HuPER: A Human-Inspired Framework for Phonetic Perception
We propose HuPER, a human-inspired framework that models phonetic perception as adaptive inference over acoustic-phonetics evidence and linguistic knowledge. With only 100 hours of training data,...
https://github.com/FireRedTeam/FireRedASR2S
Interesting things:
FireRedVAD 100+ languages, 20+ Chinese dialects/accents
FireRedLID 100+ languages, 20+ Chinese dialects/accents
FLEURS-VAD-102: We randomly selected ~100 audio files per language from FLEURS test set, resulting in 9,443 audio files with manually annotated binary VAD labels (speech=1, silence=0). This VAD testset will be open sourced (coming soon).
Interesting things:
FireRedVAD 100+ languages, 20+ Chinese dialects/accents
FireRedLID 100+ languages, 20+ Chinese dialects/accents
FLEURS-VAD-102: We randomly selected ~100 audio files per language from FLEURS test set, resulting in 9,443 audio files with manually annotated binary VAD labels (speech=1, silence=0). This VAD testset will be open sourced (coming soon).
Very true
https://x.com/KaitlynZhou/status/2023800965535789511
https://arxiv.org/abs/2602.12249
"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most
Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, James Zou
Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.
https://x.com/KaitlynZhou/status/2023800965535789511
https://arxiv.org/abs/2602.12249
"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most
Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, James Zou
Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.
X (formerly Twitter)
Kaitlyn Zhou (@KaitlynZhou) on X
Text-to-speech models can’t get your address right? Turns out you’re not the only one.
📢New preprint! State-of-the-art speech models get 44% of street names wrong — and non-English primary speakers suffer twice the error impact!
📢New preprint! State-of-the-art speech models get 44% of street names wrong — and non-English primary speakers suffer twice the error impact!
Audio Reasoning Challenge results
https://audio-reasoning-challenge.github.io/leaderboard/
some info about winner Taltech entry
https://www.linkedin.com/posts/aivo-olev-73944965_its-official-i-built-an-ai-agent-that-outperformed-ugcPost-7429801097202069504-G3U8
The task was to build an agent that can reason about audio using any open-source tools and my unique solution basically taught a deaf LLM (Kimi K2) to answer questions about 1000 audio files (music, speech, other sounds). That would be hard for a human as well. It had input from other LLMs and 35 tools that were able to pick up some unreliable info (ofter incorrect or even hallucinated) from the audio and that is what made this challenge the most exiting and why I basically worked non-stop for the 4 weeks. A normal AI agent can be pretty sure that when it reads a file or gets some other tool input that the information is correct. It might be irrelevant for the task, but mostly LLMs trust input (which is a problem in the real word with input from web search, malicious input, another agent's opinion etc). They also reason quite linearly which is a problem when you have unreliable info.
https://audio-reasoning-challenge.github.io/leaderboard/
some info about winner Taltech entry
https://www.linkedin.com/posts/aivo-olev-73944965_its-official-i-built-an-ai-agent-that-outperformed-ugcPost-7429801097202069504-G3U8
The task was to build an agent that can reason about audio using any open-source tools and my unique solution basically taught a deaf LLM (Kimi K2) to answer questions about 1000 audio files (music, speech, other sounds). That would be hard for a human as well. It had input from other LLMs and 35 tools that were able to pick up some unreliable info (ofter incorrect or even hallucinated) from the audio and that is what made this challenge the most exiting and why I basically worked non-stop for the 4 weeks. A normal AI agent can be pretty sure that when it reads a file or gets some other tool input that the information is correct. It might be irrelevant for the task, but mostly LLMs trust input (which is a problem in the real word with input from web search, malicious input, another agent's opinion etc). They also reason quite linearly which is a problem when you have unreliable info.
Audio Reasoning Challenge
Leaderboard
Audio Reasoning Challenge - Interspeech 2026
Somehow one can create multimodal embeddings from speech and text and make them useful. Some projects I've around recently:
https://github.com/facebookresearch/SONAR
Used for ASR WER approximation
On the Robust Approximation of ASR Metrics
Abdul Waheed, Hanin Atwany, Rita Singh, Bhiksha Raj
https://arxiv.org/abs/2502.12408
Another one to detect dataset quality issues
https://huggingface.co/yuriyvnv/WAVe-1B-Multimodal-PT
https://github.com/facebookresearch/SONAR
Used for ASR WER approximation
On the Robust Approximation of ASR Metrics
Abdul Waheed, Hanin Atwany, Rita Singh, Bhiksha Raj
https://arxiv.org/abs/2502.12408
Another one to detect dataset quality issues
https://huggingface.co/yuriyvnv/WAVe-1B-Multimodal-PT
GitHub
GitHub - facebookresearch/SONAR: SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite…
SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders. - facebookresearch/SONAR
No model weights, but somewhat interesting ideas.
Transfusion: Transfusion (Zhou et al., 2025) was originally proposed in computer vision to develop a model that can jointly perform generation and understanding tasks.
https://arxiv.org/abs/2602.17097
AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing
William Chen, Prem Seetharaman, Rithesh Kumar, Oriol Nieto, Shinji Watanabe, Justin Salamon, Zeyu Jin
Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple speakers and background/foreground sound effects. Compared to traditional audio processing tasks, audio stories introduce new layers of semantic, temporal, and physical complexity. To address this challenge, we propose AudioChat, a framework for developing audio foundation models that can generate, edit, and understand audio stories. AudioChat introduces a new paradigm in which LLM-based toolcalling agents simulate interactions between users and the system, and these simulated dialogues are used as training data. We also introduce a novel Audio Transfusion Forcing objective to train the AudioChat model, allowing it to simultaneously decompose high-level instructions via structured chain-of-thought reasoning and perform interactive multi-turn audio understanding/generation. To evaluate generation and editing performance, we develop three new metrics that directly measure task performance instead of relying upon distribution-based scoring. We highly encourage readers to visit our demo to better understand the capabilities of AudioChat: this https URL.
Transfusion: Transfusion (Zhou et al., 2025) was originally proposed in computer vision to develop a model that can jointly perform generation and understanding tasks.
https://arxiv.org/abs/2602.17097
AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing
William Chen, Prem Seetharaman, Rithesh Kumar, Oriol Nieto, Shinji Watanabe, Justin Salamon, Zeyu Jin
Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple speakers and background/foreground sound effects. Compared to traditional audio processing tasks, audio stories introduce new layers of semantic, temporal, and physical complexity. To address this challenge, we propose AudioChat, a framework for developing audio foundation models that can generate, edit, and understand audio stories. AudioChat introduces a new paradigm in which LLM-based toolcalling agents simulate interactions between users and the system, and these simulated dialogues are used as training data. We also introduce a novel Audio Transfusion Forcing objective to train the AudioChat model, allowing it to simultaneously decompose high-level instructions via structured chain-of-thought reasoning and perform interactive multi-turn audio understanding/generation. To evaluate generation and editing performance, we develop three new metrics that directly measure task performance instead of relying upon distribution-based scoring. We highly encourage readers to visit our demo to better understand the capabilities of AudioChat: this https URL.
arXiv.org
AudioChat: Unified Audio Storytelling, Editing, and Understanding...
Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple...
Modern flow matching
https://github.com/Aratako/Irodori-TTS
rectified flow + dacvae + text encoder with emojis
Samples of cloning demo noticable noise btw, seems like DACVAE is not that great.
https://github.com/Aratako/Irodori-TTS
rectified flow + dacvae + text encoder with emojis
Samples of cloning demo noticable noise btw, seems like DACVAE is not that great.
GitHub
GitHub - Aratako/Irodori-TTS: A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control
A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control - Aratako/Irodori-TTS
Interesting job, those are rare nowdays
Bland.ai builds AI voice agents that handle real phone calls for some of the largest companies in the world. Our software runs inside critical workflows at companies like Samsara, Gallup, TripAdvisor, Snapchat, Signant Health, Better.com, and others. We have raised $65 million from top Silicon Valley investors including Emergence Capital, Scale Venture Partners, Y Combinator, and the founders of Twilio, Affirm, and ElevenLabs.
We are expanding our research team as we train and deploy our own TTS and STT models in production. We are also investing heavily in next generation speech to speech and speech inference systems.
We are currently hiring for two roles:
Research
If you have designed and trained your own models, published papers or in depth technical writing, and are working at the leading edge of audio research, we would love to hear from you:
https://jobs.ashbyhq.com/bland/d2e08077-61f0-4810-bc72-3efd7944647b
You might be a strong fit if you have experience with:
- Large scale TTS, STT, or neural audio codec systems
- Self supervised learning, generative modeling, or multimodal modeling
- Neural audio codecs, discrete or continuous latent representations, and compression tradeoffs
- Running tight ablations and controlled experiments that move ideas from hypothesis to validation quickly
- Optimizing inference for real time, low latency production systems
Machine Learning Engineer
If you are a strong programmer who enjoys building terabyte scale datasets, designing training pipelines, and working on model inference and deployment, while staying closely connected to research, apply here:
https://jobs.ashbyhq.com/bland/05906608-0628-412c-8b01-a050d87986c5
If you have any questions please feel free to shoot me a DM!
Bland.ai builds AI voice agents that handle real phone calls for some of the largest companies in the world. Our software runs inside critical workflows at companies like Samsara, Gallup, TripAdvisor, Snapchat, Signant Health, Better.com, and others. We have raised $65 million from top Silicon Valley investors including Emergence Capital, Scale Venture Partners, Y Combinator, and the founders of Twilio, Affirm, and ElevenLabs.
We are expanding our research team as we train and deploy our own TTS and STT models in production. We are also investing heavily in next generation speech to speech and speech inference systems.
We are currently hiring for two roles:
Research
If you have designed and trained your own models, published papers or in depth technical writing, and are working at the leading edge of audio research, we would love to hear from you:
https://jobs.ashbyhq.com/bland/d2e08077-61f0-4810-bc72-3efd7944647b
You might be a strong fit if you have experience with:
- Large scale TTS, STT, or neural audio codec systems
- Self supervised learning, generative modeling, or multimodal modeling
- Neural audio codecs, discrete or continuous latent representations, and compression tradeoffs
- Running tight ablations and controlled experiments that move ideas from hypothesis to validation quickly
- Optimizing inference for real time, low latency production systems
Machine Learning Engineer
If you are a strong programmer who enjoys building terabyte scale datasets, designing training pipelines, and working on model inference and deployment, while staying closely connected to research, apply here:
https://jobs.ashbyhq.com/bland/05906608-0628-412c-8b01-a050d87986c5
If you have any questions please feel free to shoot me a DM!
Ashbyhq
Machine Learning Researcher, Audio
Or friend @vancheeck recently pushed a new generation of an outstanding speaker identification architecture
https://github.com/PalabraAI/redimnet2
It is great this project continues in Palabra https://www.palabra.ai
https://github.com/PalabraAI/redimnet2
It is great this project continues in Palabra https://www.palabra.ai
GitHub
GitHub - PalabraAI/redimnet2
Contribute to PalabraAI/redimnet2 development by creating an account on GitHub.
IWSLT 2026 has some interesting competitions (like subtitling) with data available for download
https://iwslt.org/2026/subtitling
Evaluation period starts April 1st
https://iwslt.org/2026/subtitling
Evaluation period starts April 1st
IWSLT
Subtitling track
Home of the IWSLT conference and SIGSLT.
Two talks uploaded, interesting information in both:
State of the art in AudioLLMs (no hope compared to text ones)
https://www.youtube.com/watch?v=BJ3L0Kmz7Jw
Meeting transcription. LLMs are still bad at diarization, specialized systems (Diarizen + SE-Dicow) are much better
https://www.youtube.com/watch?v=2iIXUEnVkAA
State of the art in AudioLLMs (no hope compared to text ones)
https://www.youtube.com/watch?v=BJ3L0Kmz7Jw
Meeting transcription. LLMs are still bad at diarization, specialized systems (Diarizen + SE-Dicow) are much better
https://www.youtube.com/watch?v=2iIXUEnVkAA
YouTube
Auden: Where is the “GPT moment” for audio? - Yiwen Shao
Talk 41 of the Conversational AI Reading Group "Auden: Where is the “GPT moment” for audio?" by Yiwen Shao - Tencent AI Lab.
For further information about the Reading Group, please check out https://poonehmousavi.github.io/rg
For further information about the Reading Group, please check out https://poonehmousavi.github.io/rg