Speech Technology – Telegram

Speech Technology

1.6K subscribers

122 photos

4 videos

1 file

2.12K links

Download Telegram

About

Blog

Apps

Platform

Speech Technology

1.6K subscribers

Speech Technology

Interesting job, those are rare nowdays

Bland.ai builds AI voice agents that handle real phone calls for some of the largest companies in the world. Our software runs inside critical workflows at companies like Samsara, Gallup, TripAdvisor, Snapchat, Signant Health, Better.com, and others. We have raised $65 million from top Silicon Valley investors including Emergence Capital, Scale Venture Partners, Y Combinator, and the founders of Twilio, Affirm, and ElevenLabs.

We are expanding our research team as we train and deploy our own TTS and STT models in production. We are also investing heavily in next generation speech to speech and speech inference systems.

We are currently hiring for two roles:

Research
If you have designed and trained your own models, published papers or in depth technical writing, and are working at the leading edge of audio research, we would love to hear from you:
https://jobs.ashbyhq.com/bland/d2e08077-61f0-4810-bc72-3efd7944647b

You might be a strong fit if you have experience with:
- Large scale TTS, STT, or neural audio codec systems
- Self supervised learning, generative modeling, or multimodal modeling
- Neural audio codecs, discrete or continuous latent representations, and compression tradeoffs
- Running tight ablations and controlled experiments that move ideas from hypothesis to validation quickly
- Optimizing inference for real time, low latency production systems

Machine Learning Engineer
If you are a strong programmer who enjoys building terabyte scale datasets, designing training pipelines, and working on model inference and deployment, while staying closely connected to research, apply here:
https://jobs.ashbyhq.com/bland/05906608-0628-412c-8b01-a050d87986c5

If you have any questions please feel free to shoot me a DM!

Machine Learning Researcher, Audio

1.07K views13:21

Speech Technology

Or friend @vancheeck recently pushed a new generation of an outstanding speaker identification architecture

https://github.com/PalabraAI/redimnet2

It is great this project continues in Palabra https://www.palabra.ai

GitHub - PalabraAI/redimnet2

Contribute to PalabraAI/redimnet2 development by creating an account on GitHub.

1.24K views00:30

Speech Technology

Fishaudio financials (and mention of S2)

https://x.com/rissa_cao/status/2029236698018914456

X (formerly Twitter)

Rissa Cao (@rissa_cao) on X

$0 → $10M ARR in 12 months.
Open source.
No sales team.
No paid ads.

Today:
• 5M users
• 1.5M+ MAU
• 2M public UGC voices (largest voice library in the world)
• ~50% revenue from enterprise

Here's the playbook behind @FishAudio🧵

1.3K views00:48

Speech Technology

IWSLT 2026 has some interesting competitions (like subtitling) with data available for download

https://iwslt.org/2026/subtitling

Evaluation period starts April 1st

Subtitling track

Home of the IWSLT conference and SIGSLT.

1.34K views00:51

Speech Technology

Two talks uploaded, interesting information in both:

State of the art in AudioLLMs (no hope compared to text ones)
https://www.youtube.com/watch?v=BJ3L0Kmz7Jw

Meeting transcription. LLMs are still bad at diarization, specialized systems (Diarizen + SE-Dicow) are much better
https://www.youtube.com/watch?v=2iIXUEnVkAA

Auden: Where is the “GPT moment” for audio? - Yiwen Shao

Talk 41 of the Conversational AI Reading Group "Auden: Where is the “GPT moment” for audio?" by Yiwen Shao - Tencent AI Lab.

For further information about the Reading Group, please check out https://poonehmousavi.github.io/rg

1.39K viewsedited 22:47

Speech Technology

Google DeepMind released African ASR/TTS data, somewhat interesting

The WAXAL dataset is a large-scale multilingual speech corpus for African languages, introduced in the paper WAXAL: A Large-Scale Multilingual African Language Speech Corpus.

https://huggingface.co/datasets/google/WaxalNLP

google/WaxalNLP · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

975 views01:48

Speech Technology

Reasoning in audio LLMs is a problem

https://github.com/Blinorot/ALARM

https://arxiv.org/abs/2603.09556

This is the official implementation of ALARM: Audio–Language Alignment for Reasoning Models, an audio reasoning language model trained in a self-generation setup that achieves state-of-the-art performance on Speech Understanding benchmarks with a 4B backbone.

Abstract: Large audio language models (ALMs) extend LLMs with auditory understanding. A common approach freezes the LLM and trains only an adapter on self-generated targets. However, this fails for reasoning LLMs (RLMs) whose built-in chain-of-thought traces expose the textual surrogate input, yielding unnatural responses. We propose self-rephrasing, converting self-generated responses into audio-understanding variants compatible with RLMs while preserving distributional alignment. We further fuse and compress multiple audio encoders for stronger representations. For training, we construct a 6M-instance multi-task corpus (2.5M unique prompts) spanning 19K hours of speech, music, and sound. Our 4B-parameter ALM outperforms similarly sized models and surpasses most larger ALMs on related audio-reasoning benchmarks, while preserving textual capabilities with a low training cost. Notably, we achieve the best open-source result on the MMAU-speech and MMSU benchmarks and rank third among all the models.

GitHub - Blinorot/ALARM: Official Implementation of "ALARM: Audio–Language Alignment for Reasoning Models"

Official Implementation of "ALARM: Audio–Language Alignment for Reasoning Models" - Blinorot/ALARM

1.06K views23:53

Speech Technology

https://huggingface.co/datasets/ai-coustics/dawn_chorus_en

dawn_chorus_en
An open-source evaluation dataset for accurate foreground speaker transcription.

The dataset targets mixture conditions where foreground speech remains generally transcribable by speech-to-text systems, while background speech is distinctly perceived as background. It provides around 90 minutes of foreground–background speech mixtures composed of recorded and synthesized foreground speech, along with ground truth foreground speech and corresponding transcripts.

Inspired by DAPS, which frames speech enhancement as a direct transformation from real-world device recordings to professionally produced studio speech via aligned input–output pairs, we design this dataset around an equally application-driven mapping: from realistic foreground–background speech mixtures to isolated primary-speaker speech that remains robustly transcribable by downstream STT systems. Like DAPS, our approach emphasizes time-aligned references and real recording / transmission conditions rather than purely synthetic degradations, enabling evaluation of suppression strength versus foreground speech distortion.

ai-coustics/dawn_chorus_en · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1.3K views00:14

Speech Technology

Nice upsampler - trained for music, supports upsampling from 8khz (important)

https://github.com/woongzip1/UniverSR

GitHub - woongzip1/UniverSR: Official implemtation of UniverSR (ICASSP 2026)

Official implemtation of UniverSR (ICASSP 2026). Contribute to woongzip1/UniverSR development by creating an account on GitHub.

1.27K views15:33

Speech Technology

DiTs are powering modern TTS systems however one rarely mentions their issues. Longer training time, higher data requirements. Convolutions still have sense given the speech data is locally uniform. A research like this still makes sense for us GPU-poor guys

https://arxiv.org/abs/2603.09408v1

Reviving ConvNeXt for Efficient Convolutional Diffusion Models

Recent diffusion models increasingly favor Transformer backbones, motivated by the remarkable scalability of fully attentional architectures. Yet the locality bias, parameter efficiency, and...

1.25K viewsedited 17:48

Speech Technology

Just another reminder there is no point in ONNX

https://github.com/eschmidbauer/moonshine-c

source is pure C 825 lines of code, executable is 40kb. It runs ASR just fine.

GitHub - eschmidbauer/moonshine-c

Contribute to eschmidbauer/moonshine-c development by creating an account on GitHub.

1.14K views22:19

Speech Technology

Interesting community on Reddit

https://www.reddit.com/r/VoiceAutomationAI/

will host AMA session with Tony Robinson, one of the most knowledgeable person I know

Upcoming AMA with Dr Tony Robinson (Founder Speechmatics)

Excited to announce that Dr Tony Robinson will be joining Unio - The Voice AI Community powered by SLNG for a live AMA with builders & founders.

If you’re building voice AI, you already know this:
it works in demos… and breaks in production.

Dr Tony has spent 36+ years in Voice AI, starting in 1989 at Cambridge where he built one of the earliest neural network based speech recognition systems, long before deep learning became mainstream.

Today, Speechmatics powers voice AI across 50+ languages, with customers seeing 9x growth in voice agent adoption in 2025.

📅 Date: 27 March
⏰ Time: 10:30 AM PST / 11:00 PM IST
📍 Location: Reddit (r/VoiceAutomationAI)

For the next 24 hours, he’ll be answering questions about:

• What actually breaks in production voice AI (and how to fix it)
• Accents, noise, latency & real-world edge cases
• Designing reliable STT-LLM-TTS pipelines
• Lessons from 35+ years building speech systems
• Where voice AI is really heading (beyond the hype)
• What he’d do differently if starting today

If you're building in Voice AI, AI agents, or conversational automation, this is a rare opportunity to learn from someone who has been solving these problems for decades.

Join the reddit community to drop questions👇
Link in the first comment.

r/VoiceAutomationAI

Welcome to r/VoiceAutomationAI - Unio, the Voice AI Community, powered by SLNG AI.

A community for builders, founders, engineers, product teams, and enterprises working on real world AI Agents and Voice AI systems.

Join weekly AMAs with funded founders…

1.46K views18:35

Speech Technology

https://github.com/k2-fsa/OmniVoice

GitHub - k2-fsa/OmniVoice: High-Quality Voice Cloning TTS for 600+ Languages

High-Quality Voice Cloning TTS for 600+ Languages. Contribute to k2-fsa/OmniVoice development by creating an account on GitHub.

1.36K views07:42

Speech Technology

Good talk on SpeechLMs

https://www.youtube.com/watch?v=m65SiSnsZ3g

Explained the paper below. Basically at different point of time one has to pick different layers from text LM for adapters. Word boundaries require more linguistic knowledge, middle words more acoustic knowledge. Big improvements with adjusted adapters as a result.

https://arxiv.org/abs/2503.06211

Late Fusion and Multi-Level Fission Amplify Cross-Modal Transfer in Text-Speech LMs

Santiago Cuervo, Adel Moumen, Yanis Labrak, Sameer Khurana, Antoine Laurent, Mickael Rouvier, Phil Woodland, Ricard Marxer

Text-Speech Language Models (TSLMs) -- language models trained to jointly process and generate text and speech -- are commonly trained through an early modality fusion/fission approach, in which both modalities are fed and predicted from a shared backbone via linear layers. We hypothesize that this approach limits cross-modal transfer by neglecting feature compositionality -- specifically, the finer-grained nature of speech representations compared to text -- preventing the emergence of a shared feature hierarchy within model layers. In this paper, we argue that this limitation can be addressed through late fusion and fission, with a fission process that accesses both high- and low-level features for speech generation. Our models implementing these principles, SmolTolk, rival or surpass state-of-the-art TSLMs trained with orders of magnitude more compute, and achieve significantly improved cross-modal performance relative to early fusion/fission baselines. Representation analyses further suggest that our method enhances the model's ability to abstract higher-level, more semantic features from speech, and leads to increasingly shared representation spaces across layers.

Advancing the Linguistic Capabilities of Speech Language Models - Ricard Marxer - ILLS, CNRS

Talk 42 of the Conversational AI Reading Group "Advancing the Linguistic Capabilities of Speech Language Models" by Ricard Marxer - Université de Toulon, ILLS, CNRS

For further information about the Reading Group, please check out https://poonehmousavi.github.io/rg

887 views02:18

Speech Technology

Ultra-Sortformer: Extending NVIDIA Sortformer to N Speakers

https://github.com/LilDevsy0117/Ultra-Sortformer

GitHub - LilDevsy0117/Ultra-Sortformer: Ultra-Sortformer for Scalable Speaker Diarization

Ultra-Sortformer for Scalable Speaker Diarization. Contribute to LilDevsy0117/Ultra-Sortformer development by creating an account on GitHub.

701 viewsedited 22:54

Speech Technology

VoxCPM2 is the latest major release — a 2B parameter model trained on over 2 million hours of multilingual speech data, now supporting 30 languages, Voice Design, Controllable Voice Cloning, and 48kHz studio-quality audio output. Built on a MiniCPM-4 backbone.

https://github.com/OpenBMB/VoxCPM

GitHub - OpenBMB/VoxCPM: VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life…

VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning - OpenBMB/VoxCPM

534 views17:23

Speech Technology

Useful thing as trained on rare private SFX data

https://github.com/SonyResearch/Woosh

GitHub - SonyResearch/Woosh: Public release of the Sound Effect Foundation model by Sony AI.

Public release of the Sound Effect Foundation model by Sony AI. - SonyResearch/Woosh

490 views07:52

Speech Technology

Interesting 1.6M TTS engine, based on StyleTTS

https://github.com/tronghieuit/tiny-tts

GitHub - tronghieuit/tiny-tts: The Smallest English TTS Model with only 1M parameters

The Smallest English TTS Model with only 1M parameters - tronghieuit/tiny-tts

322 views06:04