Speech Technology

Google released open source medical dictation model

https://huggingface.co/google/medasr

The counter intuitive thing that this relatively small model beats Gemini 2.5 Pro by a large margin. Probably test is just biased, it is hard to imagine advanced Gemini model can't sort out important things.

huggingface.co

google/medasr · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.48K views22:45

Speech Technology

This model seems interesting. Model trained on 1.3M hours of data.

> Through this meticulous two-stage process, we obtain approximately 1,000 hours of high-quality speech with detailed paralinguistic annotations, providing a robust foundation for expressive and context-aware speech synthesis.

https://huggingface.co/Soul-AILab/SoulX-Podcast-1.7B

https://arxiv.org/abs/2510.23541

SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity

Hanke Xie, Haopeng Lin, Wenxiao Cao, Dake Guo, Wenjie Tian, Jun Wu, Hanlin Wen, Ruixuan Shang, Hongmei Liu, Zhiqi Jiang, Yuepeng Jiang, Wenxi Chen, Ruiqi Yan, Jiale Qian, Yichao Yan, Shunshun Yin, Ming Tao, Xie Chen, Lei Xie, Xinsheng Wang

Recent advances in text-to-speech (TTS) synthesis have significantly improved speech expressiveness and naturalness. However, most existing systems are tailored for single-speaker synthesis and fall short in generating coherent multi-speaker conversational speech. This technical report presents SoulX-Podcast, a system designed for podcast-style multi-turn, multi-speaker dialogic speech generation, while also achieving state-of-the-art performance in conventional TTS tasks.
To meet the higher naturalness demands of multi-turn spoken dialogue, SoulX-Podcast integrates a range of paralinguistic controls and supports both Mandarin and English, as well as several Chinese dialects, including Sichuanese, Henanese, and Cantonese, enabling more personalized podcast-style speech generation. Experimental results demonstrate that SoulX-Podcast can continuously produce over 90 minutes of conversation with stable speaker timbre and smooth speaker transitions. Moreover, speakers exhibit contextually adaptive prosody, reflecting natural rhythm and intonation changes as dialogues progress. Across multiple evaluation metrics, SoulX-Podcast achieves state-of-the-art performance in both monologue TTS and multi-turn conversational speech synthesis.

huggingface.co

Soul-AILab/SoulX-Podcast-1.7B · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.09K viewsedited 20:08

Speech Technology

Long-term modeling in speech also close

https://www.linkedin.com/posts/longshen-ou_phrasevae-and-phraseldm-latent-diffusion-activity-7408804594631270400-a_Ur/

We solved the long-sequence problem in symbolic music generation 🎶 Here is how we reduce sequence length from 10k+ to just 512, and directly model an entire song, not musical excerpts.

🎧 Demo: https://lnkd.in/g6Y7V9_8

Full-song symbolic music generation has long been constrained by extremely long token sequences, limited context length, and weak support for global structure. Most existing models still operate at the note-attribute level and generate music autoregressively, note by note, and segment by segment.

In our new technical report, we introduce PhraseVAE and PhraseLDM—a phrase-level latent diffusion framework for full-song multitrack symbolic music generation.

Key ideas:
🔹 Shift the modeling unit from note-attribute tokens to musically meaningful phrases, reducing full-song context from 10k+ tokens to 512 latents.
🔹 PhraseVAE compresses variable-length polyphonic note sequences (with instrument identity) into compact 64-D phrase-level latents, achieving near-perfect reconstruction (99.0% F1_op).
🔹 PhraseVAE introduces multi-query compression and a progressive bottleneck training strategy for high-fidelity yet compact representations.
🔹 Built on this latent space, PhraseLDM generates an entire multitrack song in a single pass—without any autoregressive components.
🔹 The framework supports up to 128 bars (~8 minutes at 64 BPM) and produces complete songs with coherent local texture, idiomatic instrument usage, and clear global structure.

Both PhraseVAE and PhraseLDM inherit the REMI-z symbolic grammar introduced in our previous NeurIPS work, which makes phrase-level compression and full-song modeling intuitive and musically grounded.

With only 45M parameters, the system can generate a full multitrack song within seconds, offering a practical and scalable alternative to note-attribute autoregressive models.

📄 Paper (arXiv): https://arxiv.org/abs/2512.11348
🎧 Demo & samples: https://www.oulongshen.xyz/midi_ldm

“Beginners learn pitch and rhythm.
Intermediates learn how notes are arranged.
Masters express meaning through phrases — and models should do the same.”

PhraseVAE and PhraseLDM: Latent Diffusion for Full-Song Multitrack Symbolic Music Generation | Longshen Ou

We solved the long-sequence problem in symbolic music generation 🎶 Here is how we reduce sequence length from 10k+ to just 512, and directly model an entire song, not musical excerpts.

🎧 Demo: https://lnkd.in/g6Y7V9_8

Full-song symbolic music generation…

1.16K views15:33

Speech Technology

Small autoregressive TTS is doing pretty well recently. Examples are

https://github.com/taylorchu/2cent-tts (100M params)

https://github.com/ekwek1/soprano (80M params)

https://github.com/neuphonic/neutts-air (500M params)

GitHub

GitHub - taylorchu/2cent-tts

Contribute to taylorchu/2cent-tts development by creating an account on GitHub.

1.45K views18:18

Speech Technology

Another Alibaba release

https://github.com/FunAudioLLM/Fun-Audio-Chat

GitHub

GitHub - FunAudioLLM/Fun-Audio-Chat: Fun-Audio-Chat is a Large Audio Language Model built for natural, low-latency voice interactions.

Fun-Audio-Chat is a Large Audio Language Model built for natural, low-latency voice interactions. - FunAudioLLM/Fun-Audio-Chat

1.66K views18:21

Speech Technology

Cascaded systems remain most reliable

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

https://arxiv.org/abs/2512.16378

Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano, Gerard I. Gállego, Jorge Iranzo-Sánchez, Ahrii Kim, Dominik Macháček, Patricia Schmidtova, Maike Züfle

As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which aim to translate spoken language directly, thereby bypassing traditional transcription-based pipelines. Whether this integration improves speech-to-text translation quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable overall, while current SpeechLLMs only match cascades in selected settings and SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.

arXiv.org

Hearing to Translate: The Effectiveness of Speech Modality...

As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which directly process spoken language and enable speech-to-text...

1.4K views23:39

Speech Technology

ASR-optimized noise cancellation becomes a hot topic, demonstrating the gap in ASR system training

https://ai-coustics.com/2025/11/20/quail-stt-asr-transcription/

another one

https://www.sanas.ai/blog/inside-sanas-asr-optimized-noise-cancellation-for-agentic-ai

Ai-Coustics

Meet Quail: Improving transcription in every condition | ai-coustics | Audio intelligence

Real-world audio breaks many STT systems. Quail is a speech enhancement model built to improve transcription accuracy across noisy environments.

1.2K views18:41

Speech Technology

NVIDIA just released Nemotron Speech ASR:
🤖0.6B streaming cache-aware transducer
📉low latency (down to 80ms)
📈high throughput (up to 900 concurrent streams on H100)
🎮adjustable latency-throughput-accuracy trade-off without re-training
🌎English ASR
🔗https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b

huggingface.co

nvidia/nemotron-speech-streaming-en-0.6b · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.16K views23:28

Speech Technology

LFM released

https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B

LFM2.5-Audio-1.5B is Liquid AI's updated end-to-end audio foundation model. Key improvements include a custom, LFM based audio detokenizer, llama.cpp compatible GGUFs for CPU inference, and better ASR and TTS performance.

LFM2.5-Audio is an end-to-end multimodal speech and text language model, and as such does not require separate ASR and TTS components. Designed with low latency and real time conversation in mind, at only 1.5 billion parameters LFM2.5-Audio enables seamless conversational interaction, achieving capabilities on par with much larger models. Our model consists of a pretrained LFM2.5 model as its multimodal backbone, along with a FastConformer based audio encoder to handle continuous audio inputs, and an RQ-transformer generating discrete tokens coupled with a lightweight audio detokenizer for audio output.

huggingface.co

LiquidAI/LFM2.5-Audio-1.5B · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.3K views00:08

Speech Technology

Nice effort and dataset, Russian is not great somehow despite 27k hours training data

https://lemas-project.github.io/LEMAS-Project/

lemas-project.github.io

LEMAS Project Page

LEMAS: A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models

1.13K views14:08

Speech Technology

Good in-depth research. Codecs could be better with a simple changes

https://arxiv.org/abs/2512.20211

Code here

https://github.com/sizigi/AliasingFreeNeuralAudioSynthesis

Aliasing-Free Neural Audio Synthesis

Yicheng Gu, Junan Zhang, Chaoren Wang, Jerry Li, Zhizheng Wu, Lauri Juvela

Neural vocoders and codecs reconstruct waveforms from acoustic representations, which directly impact the audio quality. Among existing methods, upsampling-based time-domain models are superior in both inference speed and synthesis quality, achieving state-of-the-art performance. Still, despite their success in producing perceptually natural sound, their synthesis fidelity remains limited due to the aliasing artifacts brought by the inadequately designed model architectures. In particular, the unconstrained nonlinear activation generates an infinite number of harmonics that exceed the Nyquist frequency, resulting in ``folded-back'' aliasing artifacts. The widely used upsampling layer, ConvTranspose, copies the mirrored low-frequency parts to fill the empty high-frequency region, resulting in ``mirrored'' aliasing artifacts. Meanwhile, the combination of its inherent periodicity and the mirrored DC bias also brings ``tonal artifact,'' resulting in constant-frequency ringing. This paper aims to solve these issues from a signal processing perspective. Specifically, we apply oversampling and anti-derivative anti-aliasing to the activation function to obtain its anti-aliased form, and replace the problematic ConvTranspose layer with resampling to avoid the ``tonal artifact'' and eliminate aliased components. Based on our proposed anti-aliased modules, we introduce Pupu-Vocoder and Pupu-Codec, and release high-quality pre-trained checkpoints to facilitate audio generation research. We build a test signal benchmark to illustrate the effectiveness of the anti-aliased modules, and conduct experiments on speech, singing voice, music, and audio to validate our proposed models. Experimental results confirm that our lightweight Pupu-Vocoder and Pupu-Codec models can easily outperform existing systems on singing voice, music, and audio, while achieving comparable performance on speech.

arXiv.org

Aliasing-Free Neural Audio Synthesis

Neural vocoders and codecs reconstruct waveforms from acoustic representations, which directly impact the audio quality. Among existing methods, upsampling-based time-domain models are superior in...

1.29K views01:00

Speech Technology

Interesting classification of events continuous like <singing>...</singing> and standalone like [cough]

https://huggingface.co/datasets/yfish/WESR-Bench

https://arxiv.org/abs/2601.04508

WESR: Scaling and Evaluating Word-level Event-Speech Recognition

Chenchen Yang, Kexin Huang, Liwei Fan, Qian Tu, Botian Jiang, Dong Zhang, Linqi Yin, Shimin Li, Zhaoye Fei, Qinyuan Cheng, Xipeng Qiu

Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying. While semantic transcription is well-studied, the precise localization of non-verbal events remains a critical yet under-explored challenge. Current methods suffer from insufficient task definitions with limited category coverage and ambiguous temporal granularity. They also lack standardized evaluation frameworks, hindering the development of downstream applications. To bridge this gap, we first develop a refined taxonomy of 21 vocal events, with a new categorization into discrete (standalone) versus continuous (mixed with speech) types. Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol that disentangles ASR errors from event detection, enabling precise localization measurement for both discrete and continuous events. We also build a strong baseline by constructing a 1,700+ hour corpus, and train specialized models, surpassing both open-source audio-language models and commercial APIs while preserving ASR quality. We anticipate that WESR will serve as a foundational resource for future research in modeling rich, real-world auditory scenes.

huggingface.co

yfish/WESR-Bench · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.25K views00:53

Speech Technology

More Anime. At modern state of research it is harmful to segment audio on chunks

https://huggingface.co/datasets/OmniAICreator/ASMR-Archive-Processed

1.53K viewsedited 01:30

Speech Technology

Nice and somewhat reasonable to separate experts by modality

https://github.com/NUS-HPC-AI-Lab/MoST

GitHub

GitHub - NUS-HPC-AI-Lab/MoST: MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts

MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts - NUS-HPC-AI-Lab/MoST

1.3K views23:13

Speech Technology

Another nice model for songs ASR and TTS

https://github.com/HeartMuLa/heartlib

GitHub

GitHub - HeartMuLa/heartlib: HeartMuLa Official Repo: The Most Powerful Open-Source Music Generation Model of 2026

HeartMuLa Official Repo: The Most Powerful Open-Source Music Generation Model of 2026 - HeartMuLa/heartlib

1.36K views23:16

Speech Technology

The quality of NovaSR frequently mentioned recently

https://github.com/ysharma3501/NovaSR/issues/6

GitHub

This project just interpolates? · Issue #6 · ysharma3501/NovaSR

There is an old trick where you can take 16khz audio and do linear interpolation on it which will fill out the spectrum. This creates very visible "mirroring" at 8khz and 16khz if you int...

1.4K views17:17

Speech Technology

Japanese dataset for emotions with descriptions in natural language. Good development from very generic emotion labels and even more confusing emotion vectors

https://github.com/UEC-InabaLab/ETCDataset

GitHub

GitHub - UEC-InabaLab/ETCDataset

Contribute to UEC-InabaLab/ETCDataset development by creating an account on GitHub.

1.66K views20:41

Speech Technology

Many releases recently, including qwen-tts and others. Another LLM

https://github.com/FlashLabs-AI-Corp/FlashLabs-Chroma

GitHub

GitHub - FlashLabs-AI-Corp/FlashLabs-Chroma: Worlds first open-source real-time end-to-end spoken dialogue model with personalized…

Worlds first open-source real-time end-to-end spoken dialogue model with personalized voice cloning. - FlashLabs-AI-Corp/FlashLabs-Chroma

1.86K views13:42

Speech Technology

Vibevoice finetune for European languages, good results compared to baseline

https://huggingface.co/kugelaudio/kugelaudio-0-open

1.53K views02:21

Speech Technology

https://www.assemblyai.com/universal-3-pro new model by assembly ai, LLM based. Supposed to be free for February, so a good chance to test.

Assemblyai

Universal-3 Pro by AssemblyAI

Introducing Universal-3 Pro, a first of its kind promptable speech language model. Control transcription using natural language prompting and domain context.

1.2K views00:05

Speech Technology

Interesting effort from Shinji on phoneme recognition

https://huggingface.co/espnet/powsm

https://arxiv.org/abs/2510.24992

POWSM: A Phonetic Open Whisper-Style Speech Foundation Model

Chin-Jou Li, Kalvin Chang, Shikhar Bharadwaj, Eunjung Yeo, Kwanghee Choi, Jian Zhu, David Mortensen, Shinji Watanabe

Recent advances in spoken language processing have led to substantial progress in phonetic tasks such as automatic speech recognition (ASR), phone recognition (PR), grapheme-to-phoneme conversion (G2P), and phoneme-to-grapheme conversion (P2G). Despite their conceptual similarity, these tasks have largely been studied in isolation, each relying on task-specific architectures and datasets. In this paper, we introduce POWSM (Phonetic Open Whisper-style Speech Model), the first unified framework capable of jointly performing multiple phone-related tasks. POWSM enables seamless conversion between audio, text (graphemes), and phones, opening up new possibilities for universal and low-resource speech processing. Our model outperforms or matches specialized PR models of similar size (Wav2Vec2Phoneme and ZIPA) while jointly supporting G2P, P2G, and ASR. Our training data, code and models are released to foster open science.

huggingface.co

espnet/powsm · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.27K views19:14

About

Blog

Apps

Platform