Speech Technology – Telegram

Speech Technology

1.6K subscribers

122 photos

4 videos

1 file

2.12K links

Download Telegram

About

Blog

Apps

Platform

Speech Technology

1.6K subscribers

Speech Technology

https://github.com/FireRedTeam/FireRedASR2S

Interesting things:

FireRedVAD 100+ languages, 20+ Chinese dialects/accents
FireRedLID 100+ languages, 20+ Chinese dialects/accents

FLEURS-VAD-102: We randomly selected ~100 audio files per language from FLEURS test set, resulting in 9,443 audio files with manually annotated binary VAD labels (speech=1, silence=0). This VAD testset will be open sourced (coming soon).

1.27K viewsedited 17:23

Speech Technology

8B TTS model claims to support many languages

https://github.com/OpenMOSS/MOSS-TTS

GitHub - OpenMOSS/MOSS-TTS: MOSS‑TTS Family is an open‑source speech and sound generation model family from MOSI.AI and the OpenMOSS…

MOSS‑TTS Family is an open‑source speech and sound generation model family from MOSI.AI and the OpenMOSS team. It is designed for high‑fidelity, high‑expressiveness, and complex real‑world scenario...

1.56K views23:15

Speech Technology

Very true

https://x.com/KaitlynZhou/status/2023800965535789511

https://arxiv.org/abs/2602.12249

"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most

Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, James Zou

Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.

X (formerly Twitter)

Kaitlyn Zhou (@KaitlynZhou) on X

Text-to-speech models can’t get your address right? Turns out you’re not the only one.

📢New preprint! State-of-the-art speech models get 44% of street names wrong — and non-English primary speakers suffer twice the error impact!

1.24K views22:35

Speech Technology

Audio Reasoning Challenge results

https://audio-reasoning-challenge.github.io/leaderboard/

some info about winner Taltech entry

https://www.linkedin.com/posts/aivo-olev-73944965_its-official-i-built-an-ai-agent-that-outperformed-ugcPost-7429801097202069504-G3U8

The task was to build an agent that can reason about audio using any open-source tools and my unique solution basically taught a deaf LLM (Kimi K2) to answer questions about 1000 audio files (music, speech, other sounds). That would be hard for a human as well. It had input from other LLMs and 35 tools that were able to pick up some unreliable info (ofter incorrect or even hallucinated) from the audio and that is what made this challenge the most exiting and why I basically worked non-stop for the 4 weeks. A normal AI agent can be pretty sure that when it reads a file or gets some other tool input that the information is correct. It might be irrelevant for the task, but mostly LLMs trust input (which is a problem in the real word with input from web search, malicious input, another agent's opinion etc). They also reason quite linearly which is a problem when you have unreliable info.

Audio Reasoning Challenge

Audio Reasoning Challenge - Interspeech 2026

1.34K views13:08

Speech Technology

1.4K views13:09

Speech Technology

Somehow one can create multimodal embeddings from speech and text and make them useful. Some projects I've around recently:

https://github.com/facebookresearch/SONAR

Used for ASR WER approximation

On the Robust Approximation of ASR Metrics
Abdul Waheed, Hanin Atwany, Rita Singh, Bhiksha Raj
https://arxiv.org/abs/2502.12408

Another one to detect dataset quality issues

https://huggingface.co/yuriyvnv/WAVe-1B-Multimodal-PT

GitHub - facebookresearch/SONAR: SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite…

SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders. - facebookresearch/SONAR

1.47K views19:32

Speech Technology

https://alphacephei.com/nsh/2026/02/23/am-lm-factor.html

Speech Recognition With Vosk

Factorizing E2E on acoustic and language models

While end-to-end speech recognition systems are dominating leaderboards, it’s still valuable to consider the separate acoustic and language models. This separation present in the network as the lower layers of the network handle acoustic information, filtering…

1.09K views23:27

Speech Technology

No model weights, but somewhat interesting ideas.

Transfusion: Transfusion (Zhou et al., 2025) was originally proposed in computer vision to develop a model that can jointly perform generation and understanding tasks.

https://arxiv.org/abs/2602.17097

AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing

William Chen, Prem Seetharaman, Rithesh Kumar, Oriol Nieto, Shinji Watanabe, Justin Salamon, Zeyu Jin

Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple speakers and background/foreground sound effects. Compared to traditional audio processing tasks, audio stories introduce new layers of semantic, temporal, and physical complexity. To address this challenge, we propose AudioChat, a framework for developing audio foundation models that can generate, edit, and understand audio stories. AudioChat introduces a new paradigm in which LLM-based toolcalling agents simulate interactions between users and the system, and these simulated dialogues are used as training data. We also introduce a novel Audio Transfusion Forcing objective to train the AudioChat model, allowing it to simultaneously decompose high-level instructions via structured chain-of-thought reasoning and perform interactive multi-turn audio understanding/generation. To evaluate generation and editing performance, we develop three new metrics that directly measure task performance instead of relying upon distribution-based scoring. We highly encourage readers to visit our demo to better understand the capabilities of AudioChat: this https URL.

AudioChat: Unified Audio Storytelling, Editing, and Understanding...

Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple...

1.08K views22:08

Speech Technology

Modern flow matching

https://github.com/Aratako/Irodori-TTS

rectified flow + dacvae + text encoder with emojis

Samples of cloning demo noticable noise btw, seems like DACVAE is not that great.

GitHub - Aratako/Irodori-TTS: A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control

A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control - Aratako/Irodori-TTS

1.35K views07:55

Speech Technology

Good TTS speedups

https://github.com/andimarafioti/faster-qwen3-tts

1.42K viewsedited 13:30

Speech Technology

Interesting job, those are rare nowdays

Bland.ai builds AI voice agents that handle real phone calls for some of the largest companies in the world. Our software runs inside critical workflows at companies like Samsara, Gallup, TripAdvisor, Snapchat, Signant Health, Better.com, and others. We have raised $65 million from top Silicon Valley investors including Emergence Capital, Scale Venture Partners, Y Combinator, and the founders of Twilio, Affirm, and ElevenLabs.

We are expanding our research team as we train and deploy our own TTS and STT models in production. We are also investing heavily in next generation speech to speech and speech inference systems.

We are currently hiring for two roles:

Research
If you have designed and trained your own models, published papers or in depth technical writing, and are working at the leading edge of audio research, we would love to hear from you:
https://jobs.ashbyhq.com/bland/d2e08077-61f0-4810-bc72-3efd7944647b

You might be a strong fit if you have experience with:
- Large scale TTS, STT, or neural audio codec systems
- Self supervised learning, generative modeling, or multimodal modeling
- Neural audio codecs, discrete or continuous latent representations, and compression tradeoffs
- Running tight ablations and controlled experiments that move ideas from hypothesis to validation quickly
- Optimizing inference for real time, low latency production systems

Machine Learning Engineer
If you are a strong programmer who enjoys building terabyte scale datasets, designing training pipelines, and working on model inference and deployment, while staying closely connected to research, apply here:
https://jobs.ashbyhq.com/bland/05906608-0628-412c-8b01-a050d87986c5

If you have any questions please feel free to shoot me a DM!

Machine Learning Researcher, Audio

1.07K views13:21

Speech Technology

Or friend @vancheeck recently pushed a new generation of an outstanding speaker identification architecture

https://github.com/PalabraAI/redimnet2

It is great this project continues in Palabra https://www.palabra.ai

GitHub - PalabraAI/redimnet2

Contribute to PalabraAI/redimnet2 development by creating an account on GitHub.

1.24K views00:30

Speech Technology

Fishaudio financials (and mention of S2)

https://x.com/rissa_cao/status/2029236698018914456

X (formerly Twitter)

Rissa Cao (@rissa_cao) on X

$0 → $10M ARR in 12 months.
Open source.
No sales team.
No paid ads.

Today:
• 5M users
• 1.5M+ MAU
• 2M public UGC voices (largest voice library in the world)
• ~50% revenue from enterprise

Here's the playbook behind @FishAudio🧵

1.3K views00:48

Speech Technology

IWSLT 2026 has some interesting competitions (like subtitling) with data available for download

https://iwslt.org/2026/subtitling

Evaluation period starts April 1st

Subtitling track

Home of the IWSLT conference and SIGSLT.

1.34K views00:51

Speech Technology

Two talks uploaded, interesting information in both:

State of the art in AudioLLMs (no hope compared to text ones)
https://www.youtube.com/watch?v=BJ3L0Kmz7Jw

Meeting transcription. LLMs are still bad at diarization, specialized systems (Diarizen + SE-Dicow) are much better
https://www.youtube.com/watch?v=2iIXUEnVkAA

Auden: Where is the “GPT moment” for audio? - Yiwen Shao

Talk 41 of the Conversational AI Reading Group "Auden: Where is the “GPT moment” for audio?" by Yiwen Shao - Tencent AI Lab.

For further information about the Reading Group, please check out https://poonehmousavi.github.io/rg

1.4K viewsedited 22:47

Speech Technology

Google DeepMind released African ASR/TTS data, somewhat interesting

The WAXAL dataset is a large-scale multilingual speech corpus for African languages, introduced in the paper WAXAL: A Large-Scale Multilingual African Language Speech Corpus.

https://huggingface.co/datasets/google/WaxalNLP

google/WaxalNLP · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

981 views01:48

Speech Technology

Reasoning in audio LLMs is a problem

https://github.com/Blinorot/ALARM

https://arxiv.org/abs/2603.09556

This is the official implementation of ALARM: Audio–Language Alignment for Reasoning Models, an audio reasoning language model trained in a self-generation setup that achieves state-of-the-art performance on Speech Understanding benchmarks with a 4B backbone.

Abstract: Large audio language models (ALMs) extend LLMs with auditory understanding. A common approach freezes the LLM and trains only an adapter on self-generated targets. However, this fails for reasoning LLMs (RLMs) whose built-in chain-of-thought traces expose the textual surrogate input, yielding unnatural responses. We propose self-rephrasing, converting self-generated responses into audio-understanding variants compatible with RLMs while preserving distributional alignment. We further fuse and compress multiple audio encoders for stronger representations. For training, we construct a 6M-instance multi-task corpus (2.5M unique prompts) spanning 19K hours of speech, music, and sound. Our 4B-parameter ALM outperforms similarly sized models and surpasses most larger ALMs on related audio-reasoning benchmarks, while preserving textual capabilities with a low training cost. Notably, we achieve the best open-source result on the MMAU-speech and MMSU benchmarks and rank third among all the models.

GitHub - Blinorot/ALARM: Official Implementation of "ALARM: Audio–Language Alignment for Reasoning Models"

Official Implementation of "ALARM: Audio–Language Alignment for Reasoning Models" - Blinorot/ALARM

1.06K views23:53

Speech Technology

https://huggingface.co/datasets/ai-coustics/dawn_chorus_en

dawn_chorus_en
An open-source evaluation dataset for accurate foreground speaker transcription.

The dataset targets mixture conditions where foreground speech remains generally transcribable by speech-to-text systems, while background speech is distinctly perceived as background. It provides around 90 minutes of foreground–background speech mixtures composed of recorded and synthesized foreground speech, along with ground truth foreground speech and corresponding transcripts.

Inspired by DAPS, which frames speech enhancement as a direct transformation from real-world device recordings to professionally produced studio speech via aligned input–output pairs, we design this dataset around an equally application-driven mapping: from realistic foreground–background speech mixtures to isolated primary-speaker speech that remains robustly transcribable by downstream STT systems. Like DAPS, our approach emphasizes time-aligned references and real recording / transmission conditions rather than purely synthetic degradations, enabling evaluation of suppression strength versus foreground speech distortion.

ai-coustics/dawn_chorus_en · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.3K views00:14

Speech Technology

Nice upsampler - trained for music, supports upsampling from 8khz (important)

https://github.com/woongzip1/UniverSR

GitHub - woongzip1/UniverSR: Official implemtation of UniverSR (ICASSP 2026)

Official implemtation of UniverSR (ICASSP 2026). Contribute to woongzip1/UniverSR development by creating an account on GitHub.

1.27K views15:33

Speech Technology

DiTs are powering modern TTS systems however one rarely mentions their issues. Longer training time, higher data requirements. Convolutions still have sense given the speech data is locally uniform. A research like this still makes sense for us GPU-poor guys

https://arxiv.org/abs/2603.09408v1

Reviving ConvNeXt for Efficient Convolutional Diffusion Models

Recent diffusion models increasingly favor Transformer backbones, motivated by the remarkable scalability of fully attentional architectures. Yet the locality bias, parameter efficiency, and...

1.26K viewsedited 17:48

Speech Technology

Just another reminder there is no point in ONNX

https://github.com/eschmidbauer/moonshine-c

source is pure C 825 lines of code, executable is 40kb. It runs ASR just fine.

GitHub - eschmidbauer/moonshine-c

Contribute to eschmidbauer/moonshine-c development by creating an account on GitHub.

1.14K views22:19