Speech Technology

Matcha and followup repos combine duration network training with flow matching decoder training. Deepseek comment on it above.

RapFlow paper correctly suggests to freeze the encoder while training the FM

https://arxiv.org/abs/2506.16741

Fun that DeepSeek cites non-existent paper as confirmation source though:

"Gradient Conflicts in Multi-Objective Generative Modeling" (Zhang et al., ICML 2023) [arXiv:2302.08954]

1.02K viewsedited 16:10

Speech Technology

NonVerbalSpeech-38K:
A Scalable Pipeline for Enabling Non-Verbal Speech Generation and Understanding
Anonymous submission

https://huggingface.co/datasets/nonverbalspeech/nonverbalspeech38k

Abstract Human spoken communication involves not only lexical content but also non-verbal vocalizations (NVs) such as laughter, sighs, and coughs, which convey emotions, intentions, and social signals. However, most existing speech systems focus solely on verbal content and lack the ability to understand and generate such non-verbal cues, reducing the emotional intelligence and communicative richness of spoken interfaces. In this work, we introduce NonVerbalSpeech-38K, a large and diverse dataset for non-verbal speech generation and understanding, collected from real-world media and annotated using an automatic pipeline. The dataset contains 38,718 samples (about 131 hours) with 10 categories of non-verbal cues, such as laughter, sniff, and throat clearing. We further validate the dataset by fine-tuning state-of-the-art models, including F5-TTS and Qwen2-Audio, demonstrating its effectiveness in non-verbal speech generation and understanding tasks. Our contributions are threefold: (1) We propose a practical pipeline for building natural and diverse non-verbal speech datasets; (2) We release a large-scale dataset to advance research on non-verbal speech generation and understanding; (3) We validate the dataset’s effectiveness by demonstrating improvements in both non-verbal speech synthesis and captioning, thereby facilitating richer human-computer interaction..

https://nonverbalspeech38k.github.io/nonverspeech38k/

1.05K viewsedited 12:45

Speech Technology

Somewhat interesting tech

liquid-audio-nets implements Liquid Neural Networks (LNNs) optimized for ultra-low-power audio processing on edge devices. Based on 2025 field tests showing 10× power reduction compared to CNNs, this library enables always-on audio sensing for battery-powered IoT devices.

https://github.com/danieleschmidt/liquid-audio-nets

Key Innovations

Continuous-Time Dynamics: ODEs instead of discrete layers
Adaptive Computation: Timestep scales with signal complexity
Sparse Activation: Only necessary neurons fire
State Persistence: Temporal memory without explicit recurrence

GitHub

GitHub - danieleschmidt/liquid-audio-nets: liquid-audio-nets implements Liquid Neural Networks (LNNs) optimized for ultra-low-power…

liquid-audio-nets implements Liquid Neural Networks (LNNs) optimized for ultra-low-power audio processing on edge devices. Based on 2025 field tests showing 10× power reduction compared to CNNs, th...

1.37K views12:51

Speech Technology

So the situation in LLM world is that they basically indexed all available internet and now try to maximize the effect with test-time compute.

People say GPT-5 for example traded less layers for more test-time tokens. A paper on the subject from DeepMind:

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
https://arxiv.org/abs/2408.03314

Speech is few years behind as usual, and not many test-time compute papers yet (although MAP adaptation was a thing long time ago). But sure it's going to be popular soon.

Chain-of-Thought Training for Open E2E Spoken Dialogue Systems
https://arxiv.org/abs/2506.00722

arXiv.org

Scaling LLM Test-Time Compute Optimally can be More Effective than...

Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In...

1.19K views21:14

Speech Technology

Well, another recent CoT repo:

https://github.com/FunAudioLLM/ThinkSound

GitHub

GitHub - FunAudioLLM/ThinkSound: [NeurIPS 2025] PyTorch implementation of [ThinkSound], a unified framework for generating audio…

[NeurIPS 2025] PyTorch implementation of [ThinkSound], a unified framework for generating audio from any modality, guided by Chain-of-Thought (CoT) reasoning. - FunAudioLLM/ThinkSound

1.37K views21:19

Speech Technology

https://seed.bytedance.com/en/seed_liveinterpret

https://arxiv.org/abs/2507.17527

Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice

Shanbo Cheng at all

Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry, with product-level automatic systems long plagued by intractable challenges: subpar transcription and translation quality, lack of real-time speech generation, multi-speaker confusion, and translated speech inflation, especially in long-form discourses. In this study, we introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities. As a fully operational product-level solution, Seed-LiveInterpret 2.0 tackles these challenges head-on through our novel duplex speech-to-speech understanding-generating framework...

1.72K viewsedited 20:13

Speech Technology

Interspeech 2025 starts tomorrow, yet to read the papers.

Interesting that some guys leave speech, mesolitica developer for example said he released the last model

https://x.com/huseinzol05/status/1956638778367578265

Just learned Alan Black retired to Alaska some time ago:

https://www.cs.cmu.edu/~awb/

Not many familiar names in IS papers too, so many people gone.

X (formerly Twitter)

husein (@huseinzol05) on X

Thank u everyone, bbye!

1.31K views14:14

Speech Technology

Comprehensive google survey on lightweight keyword spotting

https://github.com/google-research/google-research/tree/master/kws_streaming#streamable-and-non-streamable-models

This model is recommended on our Reddit. Just 10k params:

https://github.com/Qualcomm-AI-research/bcresnet

From our reddit:

https://www.reddit.com/r/speechtech/comments/1mmrc3b/comment/n93hm1h/

GitHub

google-research/kws_streaming at master · google-research/google-research

Google Research. Contribute to google-research/google-research development by creating an account on GitHub.

1.49K views08:33

Speech Technology

Diffusion in ASR too. No code yet, hopefully will be there soon. Nice benchmarks, Gemini tops on speech (confirmed by our tests too).

https://arxiv.org/abs/2507.18452

DIFFA: Large Language Diffusion Models Can Listen and Understand
Jiaming Zhou, Hongjie Chen, Shiwan Zhao, Jian Kang, Jie Li, Enzhi Wang, Yujie Guo, Haoqin Sun, Hui Wang, Aobo Kong, Yong Qin, Xuelong Li
Recent advances in large language models (LLMs) have shown remarkable capabilities across textual and multimodal domains. In parallel, diffusion-based language models have emerged as a promising alternative to the autoregressive paradigm, offering improved controllability, bidirectional context modeling, and robust generation. However, their application to the audio modality remains underexplored. In this work, we introduce \textbf{DIFFA}, the first diffusion-based large audio-language model designed to perform spoken language understanding. DIFFA integrates a frozen diffusion language model with a lightweight dual-adapter architecture that bridges speech understanding and natural language reasoning. We employ a two-stage training pipeline: first, aligning semantic representations via an ASR objective; then, learning instruction-following abilities through synthetic audio-caption pairs automatically generated by prompting LLMs. Despite being trained on only 960 hours of ASR and 127 hours of synthetic instruction data, DIFFA demonstrates competitive performance on major benchmarks, including MMSU, MMAU, and VoiceBench, outperforming several autoregressive open-source baselines. Our results reveal the potential of diffusion-based language models for efficient and scalable audio understanding, opening a new direction for speech-driven AI. Our code will be available at this https URL.

arXiv.org

DIFFA: Large Language Diffusion Models Can Listen and Understand

Recent advances in large language models (LLMs) have shown remarkable capabilities across textual and multimodal domains. In parallel, diffusion-based language models have emerged as a promising...

1.23K views05:32

Speech Technology

While some things are questionable, return back to phonemes is nice

https://github.com/tabahi/contexless-phonemes-CUPE

https://github.com/tabahi/bournemouth-forced-aligner

GitHub

GitHub - tabahi/contexless-phonemes-CUPE: pytorch model for contexless-phoneme prediction from speech audio

pytorch model for contexless-phoneme prediction from speech audio - tabahi/contexless-phonemes-CUPE

1.14K views14:19

Speech Technology

For us flow matching guys

https://github.com/primepake/F5-TTS-meanflow

https://arxiv.org/abs/2505.13447

Mean Flows for One-step Generative Modeling

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J. Zico Kolter, Kaiming He

We propose a principled and effective framework for one-step generative modeling. We introduce the notion of average velocity to characterize flow fields, in contrast to instantaneous velocity modeled by Flow Matching methods. A well-defined identity between average and instantaneous velocities is derived and used to guide neural network training. Our method, termed the MeanFlow model, is self-contained and requires no pre-training, distillation, or curriculum learning. MeanFlow demonstrates strong empirical performance: it achieves an FID of 3.43 with a single function evaluation (1-NFE) on ImageNet 256x256 trained from scratch, significantly outperforming previous state-of-the-art one-step diffusion/flow models. Our study substantially narrows the gap between one-step diffusion/flow models and their multi-step predecessors, and we hope it will motivate future research to revisit the foundations of these powerful models.

GitHub

GitHub - primepake/F5-TTS-meanflow: Meanflow for F5-TTS model

Meanflow for F5-TTS model. Contribute to primepake/F5-TTS-meanflow development by creating an account on GitHub.

1.47K views14:22

Speech Technology

Microsoft released TTS model, should be good

https://github.com/microsoft/VibeVoice

GitHub

GitHub - microsoft/VibeVoice: Open-Source Frontier Voice AI

Open-Source Frontier Voice AI. Contribute to microsoft/VibeVoice development by creating an account on GitHub.

1.53K views18:31

Speech Technology

Another TTS thing, claims are very good

https://github.com/HeCheng0625/Diffusion-Speech-Tokenizer

GitHub

GitHub - HeCheng0625/Diffusion-Speech-Tokenizer: This repository contains a series of works on diffusion-based speech tokenizers…

This repository contains a series of works on diffusion-based speech tokenizers, including the official implementation of the paper: "TaDiCodec: Text-aware Diffusion Speech Tokenizer for S...

1.47K views21:00

Speech Technology

CoLMbo is a Speaker Language Model (SLM) designed to go beyond traditional speaker recognition. While most systems stop at identifying “who” the speaker is, CoLMbo answers “what is this speaker like?” by generating context-rich, descriptive captions from speaker embeddings including gender, age, personality, and dialect.

https://github.com/massabaali7/CoLMbo

GitHub

GitHub - massabaali7/CoLMbo: Speaker Language Model

Speaker Language Model. Contribute to massabaali7/CoLMbo development by creating an account on GitHub.

1.58K viewsedited 01:00

Speech Technology

This is nice, crepe is extremely slow

https://github.com/lars76/swift-f0

GitHub

GitHub - lars76/swift-f0: Fast and accurate fundamental frequency (F0) detector using convolutional neural networks

Fast and accurate fundamental frequency (F0) detector using convolutional neural networks - lars76/swift-f0

1.67K views01:06

Speech Technology

A guy proposed a model for hf asr leaderboard. Average WER 3.1% compared to previous best 6.1%

https://github.com/huggingface/open_asr_leaderboard/pull/92#issuecomment-3239312224

WER on librispeech test-clean 0.71, quite a bold claim

This suggests the importance of closed source tests.

GitHub

Add Whisper-based SOTA model (record-breaking WER) by vivek-shunyalabs · Pull Request #92 · huggingface/open_asr_leaderboard

Hello Open-ASR team,
This PR adds my Whisper-based ASR model to the leaderboard. The modification is minimal yet it represents a model that has achieved record-breaking WER in evaluation.
This mode...

1.56K viewsedited 13:43

Speech Technology

Everyone talks about smart VAD these days. Backchannel actions are also important

https://github.com/Linyx1125/MM-F2F

GitHub

GitHub - Linyx1125/MM-F2F: [ACL 2025] Predicting Turn-Taking and Backchannel in Human-Machine Conversations Using Linguistic, Acoustic…

[ACL 2025] Predicting Turn-Taking and Backchannel in Human-Machine Conversations Using Linguistic, Acoustic, and Visual Signals - GitHub - Linyx1125/MM-F2F: [ACL 2025] Predicting Turn-Taking and B...

1.36K views16:32

Speech Technology

https://arxiv.org/abs/2506.21619

IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech
Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, Jingchen Shu
Existing autoregressive large-scale text-to-speech (TTS) models have advantages in speech naturalness, but their token-by-token generation mechanism makes it difficult to precisely control the duration of synthesized speech. This becomes a significant limitation in applications requiring strict audio-visual synchronization, such as video dubbing. This paper introduces IndexTTS2, which proposes a novel, general, and autoregressive model-friendly method for speech duration control. The method supports two generation modes: one explicitly specifies the number of generated tokens to precisely control speech duration; the other freely generates speech in an autoregressive manner without specifying the number of tokens, while faithfully reproducing the prosodic features of the input prompt. Furthermore, IndexTTS2 achieves disentanglement between emotional expression and speaker identity, enabling independent control over timbre and emotion. In the zero-shot setting, the model can accurately reconstruct the target timbre (from the timbre prompt) while perfectly reproducing the specified emotional tone (from the style prompt). To enhance speech clarity in highly emotional expressions, we incorporate GPT latent representations and design a novel three-stage training paradigm to improve the stability of the generated speech. Additionally, to lower the barrier for emotional control, we designed a soft instruction mechanism based on text descriptions by fine-tuning Qwen3, effectively guiding the generation of speech with the desired emotional orientation. Finally, experimental results on multiple datasets show that IndexTTS2 outperforms state-of-the-art zero-shot TTS models in terms of word error rate, speaker similarity, and emotional fidelity. Audio samples are available at: this https URL

arXiv.org

IndexTTS2: A Breakthrough in Emotionally Expressive and...

Existing autoregressive large-scale text-to-speech (TTS) models have advantages in speech naturalness, but their token-by-token generation mechanism makes it difficult to precisely control the...

1.55K views16:35

Speech Technology

Meet Chatterbox Multilingual! 🔥

Production grade. Open source. Voice Cloning in 23 languages. Emotion and intensity control. PerTh watermarking on by default. MIT license. Free forever.
You asked for this, we delivered.

Chatterbox Multilingual adds zero-shot voice cloning in 23 languages from Arabic and Hindi to Chinese and Swahili.

https://github.com/resemble-ai/chatterbox

Arabic (ar) • Danish (da) • German (de) • Greek (el) • English (en) • Spanish (es) • Finnish (fi) • French (fr) • Hebrew (he) • Hindi (hi) • Italian (it) • Japanese (ja) • Korean (ko) • Malay (ms) • Dutch (nl) • Norwegian (no) • Polish (pl) • Portuguese (pt) • Russian (ru) • Swedish (sv) • Swahili (sw) • Turkish (tr) • Chinese (zh)

GitHub

GitHub - resemble-ai/chatterbox: SoTA open-source TTS

SoTA open-source TTS. Contribute to resemble-ai/chatterbox development by creating an account on GitHub.

2.35K views20:10

Speech Technology

https://github.com/Tobertz-max/DiFlow-TTS

DiFlow-TTS delivers low-latency, zero-shot text-to-speech through discrete flow matching and factorized speech tokens. It combines a compact token representation with a flow-based sampler to produce natural speech quickly, even for unseen speakers and languages

GitHub

GitHub - Tobertz-max/DiFlow-TTS: DiFlow-TTS delivers low-latency zero-shot TTS via discrete flow matching and factorized speech…

DiFlow-TTS delivers low-latency zero-shot TTS via discrete flow matching and factorized speech tokens. A compact, open framework for fast voice synthesis.🐙 - Tobertz-max/DiFlow-TTS

1.55K views05:07

Speech Technology

From DeepMind

https://www.arxiv.org/abs/2509.05256

Recomposer: Event-roll-guided generative audio editing

Daniel P. W. Ellis, Eduardo Fonseca, Ron J. Weiss, Kevin Wilson, Scott Wisdom, Hakan Erdogan, John R. Hershey, Aren Jansen, R. Channing Moore, Manoj Plakal

Editing complex real-world sound scenes is difficult because individual sound sources overlap in time. Generative models can fill-in missing or corrupted details based on their strong prior understanding of the data domain. We present a system for editing individual sound events within complex scenes able to delete, insert, and enhance individual sound events based on textual edit descriptions (e.g., ``enhance Door'') and a graphical representation of the event timing derived from an ``event roll'' transcription. We present an encoder-decoder transformer working on SoundStream representations, trained on synthetic (input, desired output) audio example pairs formed by adding isolated sound events to dense, real-world backgrounds. Evaluation reveals the importance of each part of the edit descriptions -- action, class, timing. Our work demonstrates ``recomposition'' is an important and practical application.

arXiv.org

Recomposer: Event-roll-guided generative audio editing

Editing complex real-world sound scenes is difficult because individual sound sources overlap in time. Generative models can fill-in missing or corrupted details based on their strong prior...

1.49K views20:17

About

Blog

Apps

Platform