Speech Technology – Telegram

Speech Technology

1.6K subscribers

122 photos

4 videos

1 file

2.12K links

Download Telegram

About

Blog

Apps

Platform

Speech Technology

1.6K subscribers

Speech Technology

We track InWorld company status as it was founded by Dialogflow guys (it was very popular those days). Interesting that AI for games didn't work

https://www.linkedin.com/posts/kylangibbs_inworld-is-evolving-1-we-just-published-activity-7341215644828188672-aYWf

Inworld is evolving.

1. We just published our vision of the future. These are distilled learnings based on our first 4 years engaged with partners like NVIDIA, Microsoft, Status, Little Umbrella, Streamlabs, Nanobit, NBCUniversal, Mistral AI, Google and thousands of other developers.

Due to explosive growth in demand we are widening our focus to broader consumer applications (extending from games into new areas like fitness, learning and social connection). We are seeing new and existing companies across consumer categories shift the focus of AI adoption from cost savings to net new revenue opportunities through novel AI-native applications, and we are leaning in to support that shift.

Inworld's vision for the future and major model release | Kylan Gibbs posted on the topic | LinkedIn

Inworld is evolving.

1. We just published our vision of the future. These are distilled learnings based on our first 4 years engaged with partners like NVIDIA, Microsoft, Status, Little Umbrella, Streamlabs, Nanobit, NBCUniversal, Mistral AI, Google and…

1.2K views14:08

Speech Technology

Tested https://huggingface.co/kyutai/stt-1b-en_fr model on some diverse data. Accuracy is on the lower side.

CMUKids WER is 11.3 for example compared to 4.8 for parakeet-tdt-0.6b-v2. Librispeech test-clean WER is 4+ too.

Output sometimes Chinese, sometimes Arabic.

kyutai/stt-1b-en_fr · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.87K viewsedited 07:53

Speech Technology

Supports speech recognition

https://huggingface.co/blog/gemma3n

Gemma 3n fully available in the open-source ecosystem!

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.32K viewsedited 21:06

Speech Technology

Gemma3n USM encoder operates at 6.5 frames per second

https://huggingface.co/n0mad-0/gemma3n-usm-rip

n0mad-0/gemma3n-usm-rip · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

2.01K views11:57

Speech Technology

Not just speech but speech part is also nice

https://huggingface.co/datasets/facebook/seamless-interaction

facebook/seamless-interaction · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.69K views18:37

Speech Technology

Not something exceptional, just a current trend

https://arxiv.org/abs/2507.05911

Differentiable Reward Optimization for LLM based TTS system

Changfeng Gao, Zhihao Du, Shiliang Zhang

This paper proposes a novel Differentiable Reward Optimization (DiffRO) method aimed at enhancing the performance of neural codec language models based text-to-speech (TTS) systems. In contrast to conventional reinforcement learning from human feedback (RLHF) approaches applied to TTS, DiffRO directly compute the rewards based on neural codec tokens, rather than relying on synthesized audio. Furthermore, we employ the Gumbel-Softmax technique to render the reward function differentiable, thereby streamlining the RLHF training process. Additionally, we introduce a multi-task reward (MTR) model which can provide feedback from different perspectives and find that it can augment the system's capability to follow instructions this http URL results indicate that DiffRO significantly improves the pronunciation accuracy of the TTS system, achieving state-of-the-art (SOTA) WER results on the seed-tts-eval benchmark. Moreover, with the integration of the MTR model, we demonstrate the ability to control emotional and quality attributes in a zero-shot manner.

Differentiable Reward Optimization for LLM based TTS system

This paper proposes a novel Differentiable Reward Optimization (DiffRO) method aimed at enhancing the performance of neural codec language models based text-to-speech (TTS) systems. In contrast to...

1.52K views15:26

Speech Technology

Similar thing https://dmdspeech.github.io and few others.

1.33K views16:14

Speech Technology

Piper is now GPL with mhansen as developer

https://github.com/OHF-Voice/piper1-gpl

GitHub - OHF-Voice/piper1-gpl: Fast and local neural text-to-speech engine

Fast and local neural text-to-speech engine. Contribute to OHF-Voice/piper1-gpl development by creating an account on GitHub.

1.07K views10:46

Speech Technology

Objective metrics are still not quite relevant to true MOS results unfortunately, needs more work

https://arxiv.org/abs/2507.11306

P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge

Marvin Sach, Yihui Fu, Kohei Saijo, Wangyou Zhang, Samuele Cornell, Robin Scheibler, Chenda Li, Anurag Kumar, Wei Wang, Yanmin Qian, Shinji Watanabe, Tim Fingscheidt

In speech quality estimation for speech enhancement (SE) systems, subjective listening tests so far are considered as the gold standard. This should be even more true considering the large influx of new generative or hybrid methods into the field, revealing issues of some objective metrics. Efforts such as the Interspeech 2025 URGENT Speech Enhancement Challenge also involving non-English datasets add the aspect of multilinguality to the testing procedure. In this paper, we provide a brief recap of the ITU-T P.808 crowdsourced subjective listening test method. A first novel contribution is our proposed process of localizing both text and audio components of Naderi and Cutler's implementation of crowdsourced subjective absolute category rating (ACR) listening tests involving text-to-speech (TTS). Further, we provide surprising analyses of and insights into URGENT Challenge results, tackling the reliability of (P.808) ACR subjective testing as gold standard in the age of generative AI. Particularly, it seems that for generative SE methods, subjective (ACR MOS) and objective (DNSMOS, NISQA) reference-free metrics should be accompanied by objective phone fidelity metrics to reliably detect hallucinations. Finally, in the accepted version, we will release our localization scripts and methods for easy deployment for new multilingual speech enhancement subjective evaluations according to ITU-T P.808.

P.808 Multilingual Speech Enhancement Testing: Approach and...

In speech quality estimation for speech enhancement (SE) systems, subjective listening tests so far are considered as the gold standard. This should be even more true considering the large influx...

1.06K views16:21

Speech Technology

Canary-Qwen-2.5B is our latest, and the first of its kind, ASR model from NVIDIA NeMo team.

🏆 1st place on Open ASR Leaderboard with WER 5.63%
🔥 RTFx=418 on A100 GPU - remarkably fast for its size
💰 CC-BY-4.0 license, commercial-friendly
🌎 English-only

https://x.com/PiotrZelasko/status/1945858933605757008

X (formerly Twitter)

Piotr Żelasko (@PiotrZelasko) on X

Canary-Qwen-2.5B is our latest, and the first of its kind, ASR model from NVIDIA NeMo team.

🏆 1st place on Open ASR Leaderboard with WER 5.63%
🔥 RTFx=418 on A100 GPU - remarkably fast for its size
💰 CC-BY-4.0 license, commercial-friendly
🌎 English-only

1.06K views21:09

Speech Technology

Based on Granary (643k hours for 25 languages)

https://arxiv.org/abs/2505.13404

Granary: Speech Recognition and Translation Dataset in 25 European Languages

Nithin Rao Koluguri, Monica Sekoyan, George Zelenfroynd, Sasha Meister, Shuoyang Ding, Sofia Kostandian, He Huang, Nikolay Karpov, Jagadeesh Balam, Vitaly Lavrukhin, Yifan Peng, Sara Papi, Marco Gaido, Alessio Brutti, Boris Ginsburg

Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. This is the first open-source effort at this scale for both transcription and translation. We enhance data quality using a pseudo-labeling pipeline with segmentation, two-pass inference, hallucination filtering, and punctuation restoration. We further generate translation pairs from pseudo-labeled transcriptions using EuroLLM, followed by a data filtration pipeline. Designed for efficiency, our pipeline processes vast amount of data within hours. We assess models trained on processed data by comparing their performance on previously curated datasets for both high- and low-resource languages. Our findings show that these models achieve similar performance using approx. 50% less data. Dataset will be made available at this https URL

Granary: Speech Recognition and Translation Dataset in 25 European...

Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a...

1.4K views21:12

Speech Technology

https://www.daily.co/blog/smart-turn-v2-faster-inference-and-13-new-languages-for-voice-ai/

Daily API: Developer Tips to Build Real-time Voice, Video, and AI into Apps

Smart Turn v2: Open source semantic VAD for voice AI

Build voice agents with accurate turn detection.

1.14K views21:26

Speech Technology

Improved Matcha. Adversarial learning really helps FM to improve MOS.

https://github.com/naver-ai/RapFlow-TTS
https://www.arxiv.org/abs/2506.16741

RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching

Hyun Joon Park, Jeongmin Liu, Jin Sob Kim, Jeong Yeol Yang, Sung Won Han, Eunwoo Song

We introduce RapFlow-TTS, a rapid and high-fidelity TTS acoustic model that leverages velocity consistency constraints in flow matching (FM) training. Although ordinary differential equation (ODE)-based TTS generation achieves natural-quality speech, it typically requires a large number of generation steps, resulting in a trade-off between quality and inference speed. To address this challenge, RapFlow-TTS enforces consistency in the velocity field along the FM-straightened ODE trajectory, enabling consistent synthetic quality with fewer generation steps. Additionally, we introduce techniques such as time interval scheduling and adversarial learning to further enhance the quality of the few-step synthesis. Experimental results show that RapFlow-TTS achieves high-fidelity speech synthesis with a 5-

GitHub - naver-ai/RapFlow-TTS

Contribute to naver-ai/RapFlow-TTS development by creating an account on GitHub.

1.12K viewsedited 00:10

Speech Technology

Given that adversarial learning really helps with MOS, the same alternative would probably be to use different latent space than simple Mels. Encodec latents as in StableTTS? Sad authors didn't explore that path.

997 views04:49

Speech Technology

ZipvoiceDialog is released, much better than Dia. Less publicity though

https://arxiv.org/abs/2507.09318

ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching

Han Zhu, Wei Kang, Liyong Guo, Zengwei Yao, Fangjun Kuang, Weiji Zhuang, Zhaoqing Li, Zhifeng Han, Dong Zhang, Xin Zhang, Xingchen Song, Long Lin, Daniel Povey

Generating spoken dialogue is more challenging than monologue text-to-speech (TTS) due to the need for realistic turn-taking and distinct speaker timbres. Existing spoken dialogue generation models, being auto-regressive, suffer from slow and unstable inference. To overcome these limitations, we introduce ZipVoice-Dialog, a non-autoregressive zero-shot spoken dialogue generation model built upon flow matching. Key designs include: 1) speaker-turn embeddings for precise speaker turn-taking; 2) a curriculum learning strategy for stable speech-text alignment; 3) specialized strategies to enable stereo dialogue generation. Additionally, recognizing the lack of open-source large-scale spoken dialogue datasets, we curated OpenDialog, a 6.8k-hour spoken dialogue dataset from in-the-wild speech data. Furthermore, we established a benchmark to comprehensively evaluate various models. Experimental results demonstrate that ZipVoice-Dialog achieves superior performance in intelligibility, speaker turn-taking accuracy, speaker similarity, and inference speed. Our codes, model checkpoints, demo samples, and the OpenDialog dataset are all publicly available at this https URL.

ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation...

Generating spoken dialogue is more challenging than monologue text-to-speech (TTS) due to the need for realistic turn-taking and distinct speaker timbres. Existing spoken dialogue generation...

1.19K views05:12

Speech Technology

Introducing STITCH: our new method to make Spoken Language Models (SLMs) think and talk at the same time.

http://arxiv.org/abs/2507.15375

https://x.com/dcml0714/status/1947493948358070783

1.2K views19:29

Speech Technology

MegaTTS 3 voice cloning is here!

For context: a while back, ByteDance released MegaTTS 3 (with exceptional voice cloning capabilities), but for various reasons, they decided not to release the WavVAE encoder necessary for voice cloning to work.

Recently, a WavVAE encoder compatible with MegaTTS 3 was released by ACoderPassBy on ModelScope: https://modelscope.cn/models/ACoderPassBy/MegaTTS-SFT with quite promising results.

I reuploaded the weights to Hugging Face: https://huggingface.co/mrfakename/MegaTTS3-VoiceCloning

And put up a quick Gradio demo to try it out: https://huggingface.co/spaces/mrfakename/MegaTTS3-Voice-Cloning

Overall looks quite impressive - excited to see that we can finally do voice cloning with MegaTTS 3!

h/t to MysteryShack on the StyleTTS 2 Discord for info about the WavVAE encoder

https://www.reddit.com/r/LocalLLaMA/comments/1m641zg/megatts_3_voice_cloning_is_here/

ModelScope 魔搭社区

ModelScope——汇聚各领域先进的机器学习模型，提供模型探索体验、推理、训练、部署和应用的一站式服务。在这里，共建模型开源社区，发现、学习、定制和分享心仪的模型。

1.52K views22:33

Speech Technology

This part is interesting, objective evaluation (well, with Gemini 2.5) of modern TTS

https://github.com/boson-ai/EmergentTTS-Eval-public

GitHub - boson-ai/EmergentTTS-Eval-public: [NeurIPS' 25] Benchmark for evaluating TTS models on complex prosodic, expressiveness…

[NeurIPS' 25] Benchmark for evaluating TTS models on complex prosodic, expressiveness, and linguistic challenges. - boson-ai/EmergentTTS-Eval-public

1.4K views07:56

Speech Technology

Interesting videos about speech perception

https://www.youtube.com/@listenlab_umn/videos

Videos related to speech communication and what makes it difficult for people who have hearing loss.

Currently: lots of videos on speech acoustics and how to use praat.

Listen Lab director: Matthew Winn
@matt_with_ears

1.34K views08:41

Speech Technology

It is a valid task statement to operate on long speech. Highlights issues of default transformers.
Although SSMs might not be optimal solution.

https://arxiv.org/abs/2412.18603v2

Long-Form Speech Generation with Spoken Language Models

Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, RJ Skerry-Ryan

We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, textless spoken language models struggle to generate plausible speech past tens of seconds, due to high temporal resolution of speech tokens causing loss of coherence, architectural issues with long-sequence training or extrapolation, and memory costs at inference time. From these considerations we derive SpeechSSM, the first speech language model family to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates. SpeechSSMs leverage recent advances in linear-time sequence modeling to greatly surpass current Transformer spoken LMs in coherence and efficiency on multi-minute generations while still matching them at the utterance level. As we found current spoken language evaluations uninformative, especially in this new long-form setting, we also introduce: LibriSpeech-Long, a benchmark for long-form speech evaluation; new embedding-based and LLM-judged metrics; and quality measurements over length and time. Speech samples, the LibriSpeech-Long dataset, and any future code or model releases can be found at this https URL.

Long-Form Speech Generation with Spoken Language Models

We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, textless spoken language models...

1.07K views14:55

Speech Technology

https://huggingface.co/mispeech/midashenglm-7b

Outperforms kimi audio

mispeech/midashenglm-7b-0804-fp32 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

953 views09:06