We release Georgian models for old Vosk. Not very outstanding but a good start
https://alphacephei.com/vosk/models/vosk-model-small-ka-0.42.zip
https://alphacephei.com/vosk/models/vosk-model-ka-0.42.zip
https://github.com/alphacep/awesome-speech/blob/main/georgian.md
https://alphacephei.com/vosk/models/vosk-model-small-ka-0.42.zip
https://alphacephei.com/vosk/models/vosk-model-ka-0.42.zip
https://github.com/alphacep/awesome-speech/blob/main/georgian.md
For people interested in Georgian we also recommend to check
https://huggingface.co/NMikka
good work on finetuning major TTS engines (kokoro, qwen, f5, omni) goes there
https://huggingface.co/NMikka
good work on finetuning major TTS engines (kokoro, qwen, f5, omni) goes there
huggingface.co
NMikka (Nika Mikaberidze)
User profile of Nika Mikaberidze on Hugging Face
HF introduced private leaderboard
https://huggingface.co/blog/open-asr-leaderboard-private-data
Qwen is really good for English
https://huggingface.co/blog/open-asr-leaderboard-private-data
Qwen is really good for English
Apptek recently released callcenter dataset (129 hours role played). Qwen3-ASR-1.7B leads again
https://huggingface.co/datasets/apptek-com/apptek_callcenter_dialogues
https://huggingface.co/datasets/apptek-com/apptek_callcenter_dialogues
This is quite insightful paper. Transformers are faster than CNN
https://arxiv.org/abs/2601.20094
T-Mimi: A Transformer-based Mimi Decoder for Real-Time On-Phone TTS
Haibin Wu, Bach Viet Do, Naveen Suda, Julian Chan, Madhavan C R, Gene-Ping Yang, Yi-Chiao Wu, Naoyuki Kanda, Yossef Adi, Xin Lei, Yue Liu, Florian Metze, Yuzong Liu
Neural audio codecs provide promising acoustic features for speech synthesis, with representative streaming codecs like Mimi providing high-quality acoustic features for real-time Text-to-Speech (TTS) applications. However, Mimi's decoder, which employs a hybrid transformer and convolution architecture, introduces significant latency bottlenecks on edge devices due to the the compute intensive nature of deconvolution layers which are not friendly for mobile-CPUs, such as the most representative framework XNNPACK. This paper introduces T-Mimi, a novel modification of the Mimi codec decoder that replaces its convolutional components with a purely transformer-based decoder, inspired by the TS3-Codec architecture. This change dramatically reduces on-device TTS latency from 42.1ms to just 4.4ms. Furthermore, we conduct quantization aware training and derive a crucial finding: the final two transformer layers and the concluding linear layers of the decoder, which are close to the waveform, are highly sensitive to quantization and must be preserved at full precision to maintain audio quality.
Previous paper from the same author
https://arxiv.org/abs/2411.18803
https://arxiv.org/abs/2601.20094
T-Mimi: A Transformer-based Mimi Decoder for Real-Time On-Phone TTS
Haibin Wu, Bach Viet Do, Naveen Suda, Julian Chan, Madhavan C R, Gene-Ping Yang, Yi-Chiao Wu, Naoyuki Kanda, Yossef Adi, Xin Lei, Yue Liu, Florian Metze, Yuzong Liu
Neural audio codecs provide promising acoustic features for speech synthesis, with representative streaming codecs like Mimi providing high-quality acoustic features for real-time Text-to-Speech (TTS) applications. However, Mimi's decoder, which employs a hybrid transformer and convolution architecture, introduces significant latency bottlenecks on edge devices due to the the compute intensive nature of deconvolution layers which are not friendly for mobile-CPUs, such as the most representative framework XNNPACK. This paper introduces T-Mimi, a novel modification of the Mimi codec decoder that replaces its convolutional components with a purely transformer-based decoder, inspired by the TS3-Codec architecture. This change dramatically reduces on-device TTS latency from 42.1ms to just 4.4ms. Furthermore, we conduct quantization aware training and derive a crucial finding: the final two transformer layers and the concluding linear layers of the decoder, which are close to the waveform, are highly sensitive to quantization and must be preserved at full precision to maintain audio quality.
Previous paper from the same author
https://arxiv.org/abs/2411.18803
arXiv.org
T-Mimi: A Transformer-based Mimi Decoder for Real-Time On-Phone TTS
Neural audio codecs provide promising acoustic features for speech synthesis, with representative streaming codecs like Mimi providing high-quality acoustic features for real-time Text-to-Speech...
Following on this, NVIDIA implements transformer encoder instead of conformer
https://github.com/NVIDIA-NeMo/NeMo/pull/15661
https://github.com/NVIDIA-NeMo/NeMo/pull/15661
GitHub
Add Transformer Encoder for ASR by nithinraok · Pull Request #15661 · NVIDIA-NeMo/NeMo
ImportantThe Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that ...
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that ...
Interesting comments by Desh on TML
https://x.com/rdesh26/status/2054246456635150744
for example Game-Time
Game-Time: Evaluating Temporal Dynamics in Spoken Language Models
https://arxiv.org/abs/2509.26388
.
https://x.com/rdesh26/status/2054246456635150744
for example Game-Time
Game-Time: Evaluating Temporal Dynamics in Spoken Language Models
https://arxiv.org/abs/2509.26388
.
X (formerly Twitter)
Desh Raj (@rdesh26) on X
Initial thoughts about TML's new "interaction model"
If you decompose prosody you can do many nice things
https://arxiv.org/abs/2605.05927
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
Wenqian Cui, Xiao-Hui Li, Daxin Tan, Qiyong Zheng, Irwin King
https://arxiv.org/abs/2605.05927
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
Wenqian Cui, Xiao-Hui Li, Daxin Tan, Qiyong Zheng, Irwin King
Speech large language models (SLMs) are typically built from text large language model (TLM) checkpoints, yet they still suffer from a substantial modality gap. Prior work has mainly attempted to reduce this gap from the output side by making speech generation more text-like, but the gap remains. We argue that the key remaining bottleneck lies on the input side. We propose TextPro-SLM, an SLM that makes spoken input more closely resemble that of a prosody-aware text LLM. TextPro-SLM combines WhisperPro, a unified speech encoder that produces synchronized text tokens and prosody embeddings, with an LLM backbone trained to preserve the semantic capabilities of the original TLM while learning paralinguistic understanding. Experiments show that TextPro-SLM achieves the lowest modality gap among leading SLMs at both 3B and 7B scales, while also delivering strong overall performance on paralinguistic understanding tasks. These gains are achieved with only roughly 1,000 hours of LLM training audio, suggesting that reducing the modality gap from the input side is both effective and data-efficient.
arXiv.org
Minimizing Modality Gap from the Input Side: Your Speech LLM Can...
Speech large language models (SLMs) are typically built from text large language model (TLM) checkpoints, yet they still suffer from a substantial modality gap. Prior work has mainly attempted to...
Finally a good competition timeline, not two weeks to implement everything
https://saigonaihub.com/OneVoiceAIChallenge
Dare to build the next generation of realtime translation devices powered by Edge AI
Presented by Saigon AI Hub and Qualcomm
🟠 24 May - 24 June 2026: Registration period
🟠 July 2026: Technical specification submission
🟠 August - September 2026: Prototype submission
🟠 October 2026: Field testing
🟠 November 2026: Grand finale at VNG Campus
https://saigonaihub.com/OneVoiceAIChallenge
Dare to build the next generation of realtime translation devices powered by Edge AI
Presented by Saigon AI Hub and Qualcomm
🟠 24 May - 24 June 2026: Registration period
🟠 July 2026: Technical specification submission
🟠 August - September 2026: Prototype submission
🟠 October 2026: Field testing
🟠 November 2026: Grand finale at VNG Campus
While everyone focuses on latency, there are many measurable aspects of ASR that are easy to evaluate and have a significant impact on user experience. Here are some of them:
* Hallucination rates from noisy inputs.
* Recognition of the short inputs.
* Ability to identify non-speech sounds as music and noises.
* Rare words problem
Some recent training results from our system
https://alphacephei.com/nsh/2026/05/24/asr-details.html
* Hallucination rates from noisy inputs.
* Recognition of the short inputs.
* Ability to identify non-speech sounds as music and noises.
* Rare words problem
Some recent training results from our system
https://alphacephei.com/nsh/2026/05/24/asr-details.html
https://github.com/harrrshall/natscore
Modern neural TTS (CosyVoice2, F5-TTS, MaskGCT, Llasa, XTTS-v2, etc.) generates speech that crosses the threshold where the dominant failure mode is no longer artifacts; it's subtle unnaturalness: prosody glitches, expressive overshoot, speaker-clone drift, breath placement, code-switching mismatches. The existing automatic naturalness scorers were not trained on this kind of failure surface.
UTMOSv2 (the VoiceMOS 2024 winner) was trained on read-speech MOS labels. It saturates at high quality and is documented to produce negative correlations with human judgment on conversational and expressive speech (arXiv 2603.01467).
WhiSQA was designed for telecom and speech-enhancement quality (NISQA training data). It is intentionally not a synthetic-TTS scorer.
DNSMOS, NISQA-TTS, and the rest of the legacy stack predate modern neural TTS and lack the distribution coverage.
SpeechJudge-GRM (released Nov 2025) is excellent, but it is a 7B-parameter LALM. ~$0.001 per score on Modal A100. Unusable inside a TTS training loop or for large-scale offline evaluation.
The data that fixes the distribution gap, SpeechJudge-Data, was released in November 2025: 99K human-labeled TTS preference pairs across CosyVoice2, F5-TTS, MaskGCT, Llasa, and others, in en/zh + code-switching, with both regular and expressive splits. As of writing, no clean public artifact combines this data with a small, deployable, CPU-runnable scorer.
NatScore fills that gap.
Modern neural TTS (CosyVoice2, F5-TTS, MaskGCT, Llasa, XTTS-v2, etc.) generates speech that crosses the threshold where the dominant failure mode is no longer artifacts; it's subtle unnaturalness: prosody glitches, expressive overshoot, speaker-clone drift, breath placement, code-switching mismatches. The existing automatic naturalness scorers were not trained on this kind of failure surface.
UTMOSv2 (the VoiceMOS 2024 winner) was trained on read-speech MOS labels. It saturates at high quality and is documented to produce negative correlations with human judgment on conversational and expressive speech (arXiv 2603.01467).
WhiSQA was designed for telecom and speech-enhancement quality (NISQA training data). It is intentionally not a synthetic-TTS scorer.
DNSMOS, NISQA-TTS, and the rest of the legacy stack predate modern neural TTS and lack the distribution coverage.
SpeechJudge-GRM (released Nov 2025) is excellent, but it is a 7B-parameter LALM. ~$0.001 per score on Modal A100. Unusable inside a TTS training loop or for large-scale offline evaluation.
The data that fixes the distribution gap, SpeechJudge-Data, was released in November 2025: 99K human-labeled TTS preference pairs across CosyVoice2, F5-TTS, MaskGCT, Llasa, and others, in en/zh + code-switching, with both regular and expressive splits. As of writing, no clean public artifact combines this data with a small, deployable, CPU-runnable scorer.
NatScore fills that gap.
GitHub
GitHub - harrrshall/natscore: Preference-supervised naturalness scorer for modern neural TTS . best way to measure naturalness
Preference-supervised naturalness scorer for modern neural TTS . best way to measure naturalness - harrrshall/natscore
https://github.com/ASLP-lab/Smart-Glass-Challenge
https://aslp-lab.github.io/SmartGlasses/
Driven by the rapid advancement of Large Language Models (LLMs) and Multimodal LLMs, AI-powered smart glasses are emerging as a next-generation platform for human-computer interaction. Equipped with microphone arrays and cameras, smart glasses naturally capture the wearer’s egocentric (first-person) perspective, enabling hands-free multimodal communication throughout daily life.
However, deploying robust speech-centric interaction systems on smart glasses introduces distinct challenges compared with traditional stationary devices such as smart speakers or handheld devices such as smartphones. Smart glasses operate in highly dynamic acoustic environments, including environmental noise, user-generated motion noise, and speech from surrounding people.
To address these challenges, the SmartGlasses Challenge introduces a new benchmark for evaluating Time-Stamped Speaker-Attributed ASR (TSA-ASR) and Spoken Language Understanding (SLU) in real-world egocentric interaction scenarios, including dyadic conversation, and multi-party meetings.
https://aslp-lab.github.io/SmartGlasses/
Driven by the rapid advancement of Large Language Models (LLMs) and Multimodal LLMs, AI-powered smart glasses are emerging as a next-generation platform for human-computer interaction. Equipped with microphone arrays and cameras, smart glasses naturally capture the wearer’s egocentric (first-person) perspective, enabling hands-free multimodal communication throughout daily life.
However, deploying robust speech-centric interaction systems on smart glasses introduces distinct challenges compared with traditional stationary devices such as smart speakers or handheld devices such as smartphones. Smart glasses operate in highly dynamic acoustic environments, including environmental noise, user-generated motion noise, and speech from surrounding people.
To address these challenges, the SmartGlasses Challenge introduces a new benchmark for evaluating Time-Stamped Speaker-Attributed ASR (TSA-ASR) and Spoken Language Understanding (SLU) in real-world egocentric interaction scenarios, including dyadic conversation, and multi-party meetings.
Things go fundamental
https://github.com/xzf-thu/Audio-Interaction/
Large-scale streaming-audio dataset for audio-LLM / audio-agent training. Each row is a stream: a sequence of audio turns sharing one unified schema. ~2.28M unique audio clips are organised into six task subsets.
https://huggingface.co/datasets/zhifeixie/StreamAudio-2M
https://github.com/xzf-thu/Audio-Interaction/
Large-scale streaming-audio dataset for audio-LLM / audio-agent training. Each row is a stream: a sequence of audio turns sharing one unified schema. ~2.28M unique audio clips are organised into six task subsets.
https://huggingface.co/datasets/zhifeixie/StreamAudio-2M
GitHub
GitHub - xzf-thu/Audio-Interaction
Contribute to xzf-thu/Audio-Interaction development by creating an account on GitHub.
Everyone looks for translation these days. Good paper covering the task complexity (naturalness, prosody, content). Omni systems still unrealistic
https://arxiv.org/abs/2606.03241
Benchmarking Speech-to-Speech Translation Models
Alkis Koudounas, Hayato Futami, Quentin Jodelet, Osamu Take, Shinji Watanabe, Emiru Tsunoo
https://arxiv.org/abs/2606.03241
Benchmarking Speech-to-Speech Translation Models
Alkis Koudounas, Hayato Futami, Quentin Jodelet, Osamu Take, Shinji Watanabe, Emiru Tsunoo
Speech-to-speech translation (S2ST) has advanced rapidly, but offline evaluation lacks a unified protocol: studies report non-overlapping metric subsets, preventing direct comparisons. We introduce COMPASS, a unified and reproducible benchmarking framework integrating 46 metrics across eight dimensions, and deploy it on 1,248 model-language configurations from FLEURS and CVSS, spanning cascaded and end-to-end architectures over ten language pairs. Architectures exhibit complementary strengths: best-vs-worst gaps exceed 30\% on naturalness and speaker preservation but remain within a few points on translation quality, so single-metric rankings systematically misrepresent system quality. Correlation filtering reduces 46 metrics to 10 per direction, with three axes requiring different metrics across XEN and ENX (e.g., TER/UTMOS vs. ChrF++/NISQA-MOS); these subsets preserve rankings (Spearman's ) while cutting evaluation time by . Human validation across dubbing, podcasts, and medical domains shows standalone MOS predictors fail to predict listener preference, while top domain-specific metrics correlate with human judgment (). We release COMPASS as a foundation for domain-aware S2ST evaluation.
arXiv.org
Benchmarking Speech-to-Speech Translation Models
Speech-to-speech translation (S2ST) has advanced rapidly, but offline evaluation lacks a unified protocol: studies report non-overlapping metric subsets, preventing direct comparisons. We...
Deepmind paper
SURF: Separation via Unsupervised Remixing Flow
https://google.github.io/df-conformer/surf/
https://arxiv.org/abs/2606.04921
The goal of single-channel source separation is to reconstruct K sources given their mixture. In supervised settings where vast amounts of clean source data are available, this challenging, illposed problem has been addressed successfully by generative diffusion and flow-based prior models. However, access to such clean source samples is often limited. To bridge this gap, we present Separation via Unsupervised Remixing Flow (SURF), an unsupervised flow matching approach for source separation that learns directly from observed mixtures. This method relies on a novel combination of state-of-the-art supervised flow matching and regression-based selfsupervised techniques. At a high level, starting from a teacher model, we utilize a “remixing” step to bootstrap the learning of a student flow model from the teacher’s estimates. We provide insights into the objectives optimized by this approach and draw a novel connection to the Wake-Sleep algorithm. Empirical evaluations on image and audio benchmarks demonstrate that SURF establishes a new state-of-the-art, significantly outperforming existing unsupervised methods.
SURF: Separation via Unsupervised Remixing Flow
https://google.github.io/df-conformer/surf/
https://arxiv.org/abs/2606.04921
The goal of single-channel source separation is to reconstruct K sources given their mixture. In supervised settings where vast amounts of clean source data are available, this challenging, illposed problem has been addressed successfully by generative diffusion and flow-based prior models. However, access to such clean source samples is often limited. To bridge this gap, we present Separation via Unsupervised Remixing Flow (SURF), an unsupervised flow matching approach for source separation that learns directly from observed mixtures. This method relies on a novel combination of state-of-the-art supervised flow matching and regression-based selfsupervised techniques. At a high level, starting from a teacher model, we utilize a “remixing” step to bootstrap the learning of a student flow model from the teacher’s estimates. We provide insights into the objectives optimized by this approach and draw a novel connection to the Wake-Sleep algorithm. Empirical evaluations on image and audio benchmarks demonstrate that SURF establishes a new state-of-the-art, significantly outperforming existing unsupervised methods.
arXiv.org
SURF: Separation via Unsupervised Remixing Flow
The goal of single-channel source separation is to reconstruct $K$ sources given their mixture. In supervised settings where vast amounts of clean source data are available, this challenging,...
Recent SynSIG seminars finally uploaded
https://www.youtube.com/@isca-synsig
for example
https://www.youtube.com/watch?v=M8n9I9eGyTM
https://www.youtube.com/@isca-synsig
for example
https://www.youtube.com/watch?v=M8n9I9eGyTM
YouTube
SynSIG seminars - S1E02 - Nikita Torgashov
Streaming TTS with Dynamic Rate Control
Nikita Torgashov -- PhD student at KTH Royal Institute of Technology
This work explores full-stream text-to-speech systems for real-time interaction, focusing on incremental speech generation from streaming input with…
Nikita Torgashov -- PhD student at KTH Royal Institute of Technology
This work explores full-stream text-to-speech systems for real-time interaction, focusing on incremental speech generation from streaming input with…