Speech Technology – Telegram

Speech Technology

1.6K subscribers

122 photos

4 videos

1 file

2.12K links

Download Telegram

About

Blog

Apps

Platform

Speech Technology

1.6K subscribers

Speech Technology

Good in-depth research. Codecs could be better with a simple changes

https://arxiv.org/abs/2512.20211

Code here

https://github.com/sizigi/AliasingFreeNeuralAudioSynthesis

Aliasing-Free Neural Audio Synthesis

Yicheng Gu, Junan Zhang, Chaoren Wang, Jerry Li, Zhizheng Wu, Lauri Juvela

Neural vocoders and codecs reconstruct waveforms from acoustic representations, which directly impact the audio quality. Among existing methods, upsampling-based time-domain models are superior in both inference speed and synthesis quality, achieving state-of-the-art performance. Still, despite their success in producing perceptually natural sound, their synthesis fidelity remains limited due to the aliasing artifacts brought by the inadequately designed model architectures. In particular, the unconstrained nonlinear activation generates an infinite number of harmonics that exceed the Nyquist frequency, resulting in ``folded-back'' aliasing artifacts. The widely used upsampling layer, ConvTranspose, copies the mirrored low-frequency parts to fill the empty high-frequency region, resulting in ``mirrored'' aliasing artifacts. Meanwhile, the combination of its inherent periodicity and the mirrored DC bias also brings ``tonal artifact,'' resulting in constant-frequency ringing. This paper aims to solve these issues from a signal processing perspective. Specifically, we apply oversampling and anti-derivative anti-aliasing to the activation function to obtain its anti-aliased form, and replace the problematic ConvTranspose layer with resampling to avoid the ``tonal artifact'' and eliminate aliased components. Based on our proposed anti-aliased modules, we introduce Pupu-Vocoder and Pupu-Codec, and release high-quality pre-trained checkpoints to facilitate audio generation research. We build a test signal benchmark to illustrate the effectiveness of the anti-aliased modules, and conduct experiments on speech, singing voice, music, and audio to validate our proposed models. Experimental results confirm that our lightweight Pupu-Vocoder and Pupu-Codec models can easily outperform existing systems on singing voice, music, and audio, while achieving comparable performance on speech.

Aliasing-Free Neural Audio Synthesis

Neural vocoders and codecs reconstruct waveforms from acoustic representations, which directly impact the audio quality. Among existing methods, upsampling-based time-domain models are superior in...

1.29K views01:00

Speech Technology

Interesting classification of events continuous like <singing>...</singing> and standalone like [cough]

https://huggingface.co/datasets/yfish/WESR-Bench

https://arxiv.org/abs/2601.04508

WESR: Scaling and Evaluating Word-level Event-Speech Recognition

Chenchen Yang, Kexin Huang, Liwei Fan, Qian Tu, Botian Jiang, Dong Zhang, Linqi Yin, Shimin Li, Zhaoye Fei, Qinyuan Cheng, Xipeng Qiu

Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying. While semantic transcription is well-studied, the precise localization of non-verbal events remains a critical yet under-explored challenge. Current methods suffer from insufficient task definitions with limited category coverage and ambiguous temporal granularity. They also lack standardized evaluation frameworks, hindering the development of downstream applications. To bridge this gap, we first develop a refined taxonomy of 21 vocal events, with a new categorization into discrete (standalone) versus continuous (mixed with speech) types. Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol that disentangles ASR errors from event detection, enabling precise localization measurement for both discrete and continuous events. We also build a strong baseline by constructing a 1,700+ hour corpus, and train specialized models, surpassing both open-source audio-language models and commercial APIs while preserving ASR quality. We anticipate that WESR will serve as a foundational resource for future research in modeling rich, real-world auditory scenes.

yfish/WESR-Bench · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.25K views00:53

Speech Technology

More Anime. At modern state of research it is harmful to segment audio on chunks

https://huggingface.co/datasets/OmniAICreator/ASMR-Archive-Processed

1.53K viewsedited 01:30

Speech Technology

Nice and somewhat reasonable to separate experts by modality

https://github.com/NUS-HPC-AI-Lab/MoST

GitHub - NUS-HPC-AI-Lab/MoST: MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts

MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts - NUS-HPC-AI-Lab/MoST

1.3K views23:13

Speech Technology

Another nice model for songs ASR and TTS

https://github.com/HeartMuLa/heartlib

GitHub - HeartMuLa/heartlib: HeartMuLa Official Repo: The Most Powerful Open-Source Music Generation Model of 2026

HeartMuLa Official Repo: The Most Powerful Open-Source Music Generation Model of 2026 - HeartMuLa/heartlib

1.36K views23:16

Speech Technology

The quality of NovaSR frequently mentioned recently

https://github.com/ysharma3501/NovaSR/issues/6

This project just interpolates? · Issue #6 · ysharma3501/NovaSR

There is an old trick where you can take 16khz audio and do linear interpolation on it which will fill out the spectrum. This creates very visible "mirroring" at 8khz and 16khz if you int...

1.4K views17:17

Speech Technology

Japanese dataset for emotions with descriptions in natural language. Good development from very generic emotion labels and even more confusing emotion vectors

https://github.com/UEC-InabaLab/ETCDataset

GitHub - UEC-InabaLab/ETCDataset

Contribute to UEC-InabaLab/ETCDataset development by creating an account on GitHub.

1.66K views20:41

Speech Technology

Many releases recently, including qwen-tts and others. Another LLM

https://github.com/FlashLabs-AI-Corp/FlashLabs-Chroma

GitHub - FlashLabs-AI-Corp/FlashLabs-Chroma: Worlds first open-source real-time end-to-end spoken dialogue model with personalized…

Worlds first open-source real-time end-to-end spoken dialogue model with personalized voice cloning. - FlashLabs-AI-Corp/FlashLabs-Chroma

1.86K views13:42

Speech Technology

Vibevoice finetune for European languages, good results compared to baseline

https://huggingface.co/kugelaudio/kugelaudio-0-open

1.53K views02:21

Speech Technology

https://www.assemblyai.com/universal-3-pro new model by assembly ai, LLM based. Supposed to be free for February, so a good chance to test.

Universal-3 Pro by AssemblyAI

Introducing Universal-3 Pro, a first of its kind promptable speech language model. Control transcription using natural language prompting and domain context.

1.2K views00:05

Speech Technology

Interesting effort from Shinji on phoneme recognition

https://huggingface.co/espnet/powsm

https://arxiv.org/abs/2510.24992

POWSM: A Phonetic Open Whisper-Style Speech Foundation Model

Chin-Jou Li, Kalvin Chang, Shikhar Bharadwaj, Eunjung Yeo, Kwanghee Choi, Jian Zhu, David Mortensen, Shinji Watanabe

Recent advances in spoken language processing have led to substantial progress in phonetic tasks such as automatic speech recognition (ASR), phone recognition (PR), grapheme-to-phoneme conversion (G2P), and phoneme-to-grapheme conversion (P2G). Despite their conceptual similarity, these tasks have largely been studied in isolation, each relying on task-specific architectures and datasets. In this paper, we introduce POWSM (Phonetic Open Whisper-style Speech Model), the first unified framework capable of jointly performing multiple phone-related tasks. POWSM enables seamless conversion between audio, text (graphemes), and phones, opening up new possibilities for universal and low-resource speech processing. Our model outperforms or matches specialized PR models of similar size (Wav2Vec2Phoneme and ZIPA) while jointly supporting G2P, P2G, and ASR. Our training data, code and models are released to foster open science.

espnet/powsm · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.27K views19:14

Speech Technology

Low-resource ASR Leaderboard by Microsoft

https://huggingface.co/spaces/microsoft/paza-bench

PazaBench - a Hugging Face Space by microsoft

ASR Leaderboard for low resource languages

1.26K views13:35

Speech Technology

Some great results in phone recognition, no code yet but probably it will appear soon

https://www.arxiv.org/abs/2602.01634

HuPER: A Human-Inspired Framework for Phonetic Perception

Chenxu Guo, Jiachen Lian, Yisi Liu, Baihe Huang, Shriyaa Narayanan, Cheol Jun Cho, Gopala Anumanchipalli

We propose HuPER, a human-inspired framework that models phonetic perception as adaptive inference over acoustic-phonetics evidence and linguistic knowledge. With only 100 hours of training data, HuPER achieves state-of-the-art phonetic error rates on five English benchmarks and strong zero-shot transfer to 95 unseen languages. HuPER is also the first framework to enable adaptive, multi-path phonetic perception under diverse acoustic conditions. All training data, models, and code are open-sourced. Code and demo avaliable at this https URL.

HuPER: A Human-Inspired Framework for Phonetic Perception

We propose HuPER, a human-inspired framework that models phonetic perception as adaptive inference over acoustic-phonetics evidence and linguistic knowledge. With only 100 hours of training data,...

1.47K views16:00

Speech Technology

https://github.com/FireRedTeam/FireRedASR2S

Interesting things:

FireRedVAD 100+ languages, 20+ Chinese dialects/accents
FireRedLID 100+ languages, 20+ Chinese dialects/accents

FLEURS-VAD-102: We randomly selected ~100 audio files per language from FLEURS test set, resulting in 9,443 audio files with manually annotated binary VAD labels (speech=1, silence=0). This VAD testset will be open sourced (coming soon).

1.27K viewsedited 17:23

Speech Technology

8B TTS model claims to support many languages

https://github.com/OpenMOSS/MOSS-TTS

GitHub - OpenMOSS/MOSS-TTS: MOSS‑TTS Family is an open‑source speech and sound generation model family from MOSI.AI and the OpenMOSS…

MOSS‑TTS Family is an open‑source speech and sound generation model family from MOSI.AI and the OpenMOSS team. It is designed for high‑fidelity, high‑expressiveness, and complex real‑world scenario...

1.56K views23:15

Speech Technology

Very true

https://x.com/KaitlynZhou/status/2023800965535789511

https://arxiv.org/abs/2602.12249

"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most

Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, James Zou

Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.

X (formerly Twitter)

Kaitlyn Zhou (@KaitlynZhou) on X

Text-to-speech models can’t get your address right? Turns out you’re not the only one.

📢New preprint! State-of-the-art speech models get 44% of street names wrong — and non-English primary speakers suffer twice the error impact!

1.24K views22:35

Speech Technology

Audio Reasoning Challenge results

https://audio-reasoning-challenge.github.io/leaderboard/

some info about winner Taltech entry

https://www.linkedin.com/posts/aivo-olev-73944965_its-official-i-built-an-ai-agent-that-outperformed-ugcPost-7429801097202069504-G3U8

The task was to build an agent that can reason about audio using any open-source tools and my unique solution basically taught a deaf LLM (Kimi K2) to answer questions about 1000 audio files (music, speech, other sounds). That would be hard for a human as well. It had input from other LLMs and 35 tools that were able to pick up some unreliable info (ofter incorrect or even hallucinated) from the audio and that is what made this challenge the most exiting and why I basically worked non-stop for the 4 weeks. A normal AI agent can be pretty sure that when it reads a file or gets some other tool input that the information is correct. It might be irrelevant for the task, but mostly LLMs trust input (which is a problem in the real word with input from web search, malicious input, another agent's opinion etc). They also reason quite linearly which is a problem when you have unreliable info.

Audio Reasoning Challenge

Audio Reasoning Challenge - Interspeech 2026

1.34K views13:08

Speech Technology

1.4K views13:09

Speech Technology

Somehow one can create multimodal embeddings from speech and text and make them useful. Some projects I've around recently:

https://github.com/facebookresearch/SONAR

Used for ASR WER approximation

On the Robust Approximation of ASR Metrics
Abdul Waheed, Hanin Atwany, Rita Singh, Bhiksha Raj
https://arxiv.org/abs/2502.12408

Another one to detect dataset quality issues

https://huggingface.co/yuriyvnv/WAVe-1B-Multimodal-PT

GitHub - facebookresearch/SONAR: SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite…

SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders. - facebookresearch/SONAR

1.47K views19:32

Speech Technology

https://alphacephei.com/nsh/2026/02/23/am-lm-factor.html

Speech Recognition With Vosk

Factorizing E2E on acoustic and language models

While end-to-end speech recognition systems are dominating leaderboards, it’s still valuable to consider the separate acoustic and language models. This separation present in the network as the lower layers of the network handle acoustic information, filtering…

1.09K views23:27

Speech Technology

No model weights, but somewhat interesting ideas.

Transfusion: Transfusion (Zhou et al., 2025) was originally proposed in computer vision to develop a model that can jointly perform generation and understanding tasks.

https://arxiv.org/abs/2602.17097

AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing

William Chen, Prem Seetharaman, Rithesh Kumar, Oriol Nieto, Shinji Watanabe, Justin Salamon, Zeyu Jin

Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple speakers and background/foreground sound effects. Compared to traditional audio processing tasks, audio stories introduce new layers of semantic, temporal, and physical complexity. To address this challenge, we propose AudioChat, a framework for developing audio foundation models that can generate, edit, and understand audio stories. AudioChat introduces a new paradigm in which LLM-based toolcalling agents simulate interactions between users and the system, and these simulated dialogues are used as training data. We also introduce a novel Audio Transfusion Forcing objective to train the AudioChat model, allowing it to simultaneously decompose high-level instructions via structured chain-of-thought reasoning and perform interactive multi-turn audio understanding/generation. To evaluate generation and editing performance, we develop three new metrics that directly measure task performance instead of relying upon distribution-based scoring. We highly encourage readers to visit our demo to better understand the capabilities of AudioChat: this https URL.

AudioChat: Unified Audio Storytelling, Editing, and Understanding...

Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple...

1.08K views22:08