https://huggingface.co/datasets/ai-coustics/dawn_chorus_en
dawn_chorus_en
An open-source evaluation dataset for accurate foreground speaker transcription.
The dataset targets mixture conditions where foreground speech remains generally transcribable by speech-to-text systems, while background speech is distinctly perceived as background. It provides around 90 minutes of foreground–background speech mixtures composed of recorded and synthesized foreground speech, along with ground truth foreground speech and corresponding transcripts.
Inspired by DAPS, which frames speech enhancement as a direct transformation from real-world device recordings to professionally produced studio speech via aligned input–output pairs, we design this dataset around an equally application-driven mapping: from realistic foreground–background speech mixtures to isolated primary-speaker speech that remains robustly transcribable by downstream STT systems. Like DAPS, our approach emphasizes time-aligned references and real recording / transmission conditions rather than purely synthetic degradations, enabling evaluation of suppression strength versus foreground speech distortion.
dawn_chorus_en
An open-source evaluation dataset for accurate foreground speaker transcription.
The dataset targets mixture conditions where foreground speech remains generally transcribable by speech-to-text systems, while background speech is distinctly perceived as background. It provides around 90 minutes of foreground–background speech mixtures composed of recorded and synthesized foreground speech, along with ground truth foreground speech and corresponding transcripts.
Inspired by DAPS, which frames speech enhancement as a direct transformation from real-world device recordings to professionally produced studio speech via aligned input–output pairs, we design this dataset around an equally application-driven mapping: from realistic foreground–background speech mixtures to isolated primary-speaker speech that remains robustly transcribable by downstream STT systems. Like DAPS, our approach emphasizes time-aligned references and real recording / transmission conditions rather than purely synthetic degradations, enabling evaluation of suppression strength versus foreground speech distortion.
huggingface.co
ai-coustics/dawn_chorus_en · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Nice upsampler - trained for music, supports upsampling from 8khz (important)
https://github.com/woongzip1/UniverSR
https://github.com/woongzip1/UniverSR
GitHub
GitHub - woongzip1/UniverSR: Official implemtation of UniverSR (ICASSP 2026)
Official implemtation of UniverSR (ICASSP 2026). Contribute to woongzip1/UniverSR development by creating an account on GitHub.
DiTs are powering modern TTS systems however one rarely mentions their issues. Longer training time, higher data requirements. Convolutions still have sense given the speech data is locally uniform. A research like this still makes sense for us GPU-poor guys
https://arxiv.org/abs/2603.09408v1
https://arxiv.org/abs/2603.09408v1
arXiv.org
Reviving ConvNeXt for Efficient Convolutional Diffusion Models
Recent diffusion models increasingly favor Transformer backbones, motivated by the remarkable scalability of fully attentional architectures. Yet the locality bias, parameter efficiency, and...
Just another reminder there is no point in ONNX
https://github.com/eschmidbauer/moonshine-c
source is pure C 825 lines of code, executable is 40kb. It runs ASR just fine.
https://github.com/eschmidbauer/moonshine-c
source is pure C 825 lines of code, executable is 40kb. It runs ASR just fine.
GitHub
GitHub - eschmidbauer/moonshine-c
Contribute to eschmidbauer/moonshine-c development by creating an account on GitHub.
Interesting community on Reddit
https://www.reddit.com/r/VoiceAutomationAI/
will host AMA session with Tony Robinson, one of the most knowledgeable person I know
Upcoming AMA with Dr Tony Robinson (Founder Speechmatics)
Excited to announce that Dr Tony Robinson will be joining Unio - The Voice AI Community powered by SLNG for a live AMA with builders & founders.
If you’re building voice AI, you already know this:
it works in demos… and breaks in production.
Dr Tony has spent 36+ years in Voice AI, starting in 1989 at Cambridge where he built one of the earliest neural network based speech recognition systems, long before deep learning became mainstream.
Today, Speechmatics powers voice AI across 50+ languages, with customers seeing 9x growth in voice agent adoption in 2025.
📅 Date: 27 March
⏰ Time: 10:30 AM PST / 11:00 PM IST
📍 Location: Reddit (r/VoiceAutomationAI)
For the next 24 hours, he’ll be answering questions about:
• What actually breaks in production voice AI (and how to fix it)
• Accents, noise, latency & real-world edge cases
• Designing reliable STT-LLM-TTS pipelines
• Lessons from 35+ years building speech systems
• Where voice AI is really heading (beyond the hype)
• What he’d do differently if starting today
If you're building in Voice AI, AI agents, or conversational automation, this is a rare opportunity to learn from someone who has been solving these problems for decades.
Join the reddit community to drop questions👇
Link in the first comment.
https://www.reddit.com/r/VoiceAutomationAI/
will host AMA session with Tony Robinson, one of the most knowledgeable person I know
Upcoming AMA with Dr Tony Robinson (Founder Speechmatics)
Excited to announce that Dr Tony Robinson will be joining Unio - The Voice AI Community powered by SLNG for a live AMA with builders & founders.
If you’re building voice AI, you already know this:
it works in demos… and breaks in production.
Dr Tony has spent 36+ years in Voice AI, starting in 1989 at Cambridge where he built one of the earliest neural network based speech recognition systems, long before deep learning became mainstream.
Today, Speechmatics powers voice AI across 50+ languages, with customers seeing 9x growth in voice agent adoption in 2025.
📅 Date: 27 March
⏰ Time: 10:30 AM PST / 11:00 PM IST
📍 Location: Reddit (r/VoiceAutomationAI)
For the next 24 hours, he’ll be answering questions about:
• What actually breaks in production voice AI (and how to fix it)
• Accents, noise, latency & real-world edge cases
• Designing reliable STT-LLM-TTS pipelines
• Lessons from 35+ years building speech systems
• Where voice AI is really heading (beyond the hype)
• What he’d do differently if starting today
If you're building in Voice AI, AI agents, or conversational automation, this is a rare opportunity to learn from someone who has been solving these problems for decades.
Join the reddit community to drop questions👇
Link in the first comment.
Reddit
r/VoiceAutomationAI
Welcome to r/VoiceAutomationAI - Unio, the Voice AI Community, powered by SLNG AI.
A community for builders, founders, engineers, product teams, and enterprises working on real world AI Agents and Voice AI systems.
Join weekly AMAs with funded founders…
A community for builders, founders, engineers, product teams, and enterprises working on real world AI Agents and Voice AI systems.
Join weekly AMAs with funded founders…
Good talk on SpeechLMs
https://www.youtube.com/watch?v=m65SiSnsZ3g
Explained the paper below. Basically at different point of time one has to pick different layers from text LM for adapters. Word boundaries require more linguistic knowledge, middle words more acoustic knowledge. Big improvements with adjusted adapters as a result.
https://arxiv.org/abs/2503.06211
Late Fusion and Multi-Level Fission Amplify Cross-Modal Transfer in Text-Speech LMs
Santiago Cuervo, Adel Moumen, Yanis Labrak, Sameer Khurana, Antoine Laurent, Mickael Rouvier, Phil Woodland, Ricard Marxer
Text-Speech Language Models (TSLMs) -- language models trained to jointly process and generate text and speech -- are commonly trained through an early modality fusion/fission approach, in which both modalities are fed and predicted from a shared backbone via linear layers. We hypothesize that this approach limits cross-modal transfer by neglecting feature compositionality -- specifically, the finer-grained nature of speech representations compared to text -- preventing the emergence of a shared feature hierarchy within model layers. In this paper, we argue that this limitation can be addressed through late fusion and fission, with a fission process that accesses both high- and low-level features for speech generation. Our models implementing these principles, SmolTolk, rival or surpass state-of-the-art TSLMs trained with orders of magnitude more compute, and achieve significantly improved cross-modal performance relative to early fusion/fission baselines. Representation analyses further suggest that our method enhances the model's ability to abstract higher-level, more semantic features from speech, and leads to increasingly shared representation spaces across layers.
https://www.youtube.com/watch?v=m65SiSnsZ3g
Explained the paper below. Basically at different point of time one has to pick different layers from text LM for adapters. Word boundaries require more linguistic knowledge, middle words more acoustic knowledge. Big improvements with adjusted adapters as a result.
https://arxiv.org/abs/2503.06211
Late Fusion and Multi-Level Fission Amplify Cross-Modal Transfer in Text-Speech LMs
Santiago Cuervo, Adel Moumen, Yanis Labrak, Sameer Khurana, Antoine Laurent, Mickael Rouvier, Phil Woodland, Ricard Marxer
Text-Speech Language Models (TSLMs) -- language models trained to jointly process and generate text and speech -- are commonly trained through an early modality fusion/fission approach, in which both modalities are fed and predicted from a shared backbone via linear layers. We hypothesize that this approach limits cross-modal transfer by neglecting feature compositionality -- specifically, the finer-grained nature of speech representations compared to text -- preventing the emergence of a shared feature hierarchy within model layers. In this paper, we argue that this limitation can be addressed through late fusion and fission, with a fission process that accesses both high- and low-level features for speech generation. Our models implementing these principles, SmolTolk, rival or surpass state-of-the-art TSLMs trained with orders of magnitude more compute, and achieve significantly improved cross-modal performance relative to early fusion/fission baselines. Representation analyses further suggest that our method enhances the model's ability to abstract higher-level, more semantic features from speech, and leads to increasingly shared representation spaces across layers.
YouTube
Advancing the Linguistic Capabilities of Speech Language Models - Ricard Marxer - ILLS, CNRS
Talk 42 of the Conversational AI Reading Group "Advancing the Linguistic Capabilities of Speech Language Models" by Ricard Marxer - Université de Toulon, ILLS, CNRS
For further information about the Reading Group, please check out https://poonehmousavi.github.io/rg
For further information about the Reading Group, please check out https://poonehmousavi.github.io/rg
Ultra-Sortformer: Extending NVIDIA Sortformer to N Speakers
https://github.com/LilDevsy0117/Ultra-Sortformer
https://github.com/LilDevsy0117/Ultra-Sortformer
GitHub
GitHub - LilDevsy0117/Ultra-Sortformer: Ultra-Sortformer for Scalable Speaker Diarization
Ultra-Sortformer for Scalable Speaker Diarization. Contribute to LilDevsy0117/Ultra-Sortformer development by creating an account on GitHub.
VoxCPM2 is the latest major release — a 2B parameter model trained on over 2 million hours of multilingual speech data, now supporting 30 languages, Voice Design, Controllable Voice Cloning, and 48kHz studio-quality audio output. Built on a MiniCPM-4 backbone.
https://github.com/OpenBMB/VoxCPM
https://github.com/OpenBMB/VoxCPM
GitHub
GitHub - OpenBMB/VoxCPM: VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life…
VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning - OpenBMB/VoxCPM
Recently I was fighting with OOM in ASR service, this was was just on time, very relevant
Omni Model Inference: How We Move Tensors Between Stages
https://x.com/GenAI_is_real/status/2041723531357196471
Omni Model Inference: How We Move Tensors Between Stages
https://x.com/GenAI_is_real/status/2041723531357196471
X (formerly Twitter)
Chayenne Zhao (@GenAI_is_real) on X
Omni Model Inference: How We Move Tensors Between Stages
A friend once asked me: what's the fundamental difference between serving Omni multimodal models and serving plain LLMs? I thought about it and the simplest way to put it is this — a regular LLM handles…
A friend once asked me: what's the fundamental difference between serving Omni multimodal models and serving plain LLMs? I thought about it and the simplest way to put it is this — a regular LLM handles…
Rissa Cao, FishAudio CEO, a bit of marketing but very valid point on data importance and lack of real high quality data for speech systems
https://www.linkedin.com/feed/update/urn:li:activity:7448399470251356160/
Early on, we made a mistake. We trained our TTS model on whatever voice data we could find online.
It sounded great on podcasts. But terrible for creation, companionship, anime dubbing. Everything fell apart.
The data distribution was wrong.
https://www.linkedin.com/feed/update/urn:li:activity:7448399470251356160/
Early on, we made a mistake. We trained our TTS model on whatever voice data we could find online.
It sounded great on podcasts. But terrible for creation, companionship, anime dubbing. Everything fell apart.
The data distribution was wrong.
LinkedIn
We're investing $2M+ in voice data this year. Early on, we made a mistake. We trained our TTS model on whatever voice data we could…
We're investing $2M+ in voice data this year. Early on, we made a mistake. We trained our TTS model on whatever voice data we could find online.
It sounded great on podcasts. But terrible for creation, companionship, anime dubbing. Everything fell apart.…
It sounded great on podcasts. But terrible for creation, companionship, anime dubbing. Everything fell apart.…
https://github.com/OpenMOSS/MOSS-TTS-Nano
0.1B params, still many languages. Russian quality is awful
0.1B params, still many languages. Russian quality is awful
GitHub
GitHub - OpenMOSS/MOSS-TTS-Nano: MOSS-TTS-Nano is an open-source multilingual tiny speech generation model from MOSI.AI and the…
MOSS-TTS-Nano is an open-source multilingual tiny speech generation model from MOSI.AI and the OpenMOSS team. With only 0.1B parameters, it is designed for realtime speech generation, can run direc...
The Beyond Transcription Challenge, an IEEE SLT 2026 shared task tackling a foundational question in audio AI: can a model reason over speech without first converting it to text?
https://betrac.github.io
The research question: Current speech models still struggle to extract meaning directly from audio, especially when the signal includes overlapping speakers, ambient sounds, and room acoustics. Clinical note generation from doctor-patient conversations is an ideal stress test for this: it demands that a model attend to who said what, filter environmental noise, and produce faithful structured output. Yet on the Synth-DoPaCo dataset, end-to-end models hallucinate at alarming rates, with 99–100% of clinical claims unsupported by the source audio, compared to just 21–23% for traditional transcribe-then-summarize pipelines. BeTraC is a shared evaluation challenge aimed at closing this gap by advancing the technology.
Two competition tracks:
- Lightweight (≤ 6B params): Single end-to-end model, one invocation. Audio in, SOAP note out.
- Heavyweight (≤ 36B params): Tools and agents allowed. Only the final model generates text from audio.
The Synth-DoPaCo dataset: 8,800 synthetic doctor-patient conversations (~1,329 hrs), 66 ambient sound classes, room reverberation, Opus compression. Available now on Hugging Face.
Key dates:
- May 4, 2026: Open-Source Inclusion Proposals Deadline
- June 24, 2026: System submission deadline
- July 8, 2026: Challenge paper due
Data is live. Baselines are posted. Team registration is open.
If you work on speech, audio understanding, or multimodal AI, we'd love to have you compete.
https://betrac.github.io
The research question: Current speech models still struggle to extract meaning directly from audio, especially when the signal includes overlapping speakers, ambient sounds, and room acoustics. Clinical note generation from doctor-patient conversations is an ideal stress test for this: it demands that a model attend to who said what, filter environmental noise, and produce faithful structured output. Yet on the Synth-DoPaCo dataset, end-to-end models hallucinate at alarming rates, with 99–100% of clinical claims unsupported by the source audio, compared to just 21–23% for traditional transcribe-then-summarize pipelines. BeTraC is a shared evaluation challenge aimed at closing this gap by advancing the technology.
Two competition tracks:
- Lightweight (≤ 6B params): Single end-to-end model, one invocation. Audio in, SOAP note out.
- Heavyweight (≤ 36B params): Tools and agents allowed. Only the final model generates text from audio.
The Synth-DoPaCo dataset: 8,800 synthetic doctor-patient conversations (~1,329 hrs), 66 ambient sound classes, room reverberation, Opus compression. Available now on Hugging Face.
Key dates:
- May 4, 2026: Open-Source Inclusion Proposals Deadline
- June 24, 2026: System submission deadline
- July 8, 2026: Challenge paper due
Data is live. Baselines are posted. Team registration is open.
If you work on speech, audio understanding, or multimodal AI, we'd love to have you compete.
https://www.deepl.com/en/press-release/deepl_launches_voice_api_for_real_time_speech_transcription_and_translation
DeepL, a global AI product and research company, today announced the general availability of DeepL Voice API. This innovative product empowers developers to integrate real-time voice transcription and translation capabilities into their applications, significantly enhancing multilingual support for businesses.
DeepL, a global AI product and research company, today announced the general availability of DeepL Voice API. This innovative product empowers developers to integrate real-time voice transcription and translation capabilities into their applications, significantly enhancing multilingual support for businesses.
For a long time AudioSet was big pain to download, finally available on HF
https://huggingface.co/datasets/agkphysics/AudioSet
Overall, even small speech models need to understand non-speech sounds better. More on this later.
https://huggingface.co/datasets/agkphysics/AudioSet
Overall, even small speech models need to understand non-speech sounds better. More on this later.
huggingface.co
agkphysics/AudioSet · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
The code for Facebook's LST finally now available
https://github.com/facebookresearch/lst
https://t.me/speechtech/2195
https://github.com/facebookresearch/lst
https://t.me/speechtech/2195
GitHub
GitHub - facebookresearch/lst: Code for Latent Speech-Text Transformer (LST)
Code for Latent Speech-Text Transformer (LST). Contribute to facebookresearch/lst development by creating an account on GitHub.