Speech Technology

Similar recent thing from DeepMind

https://github.com/deepmind/transformer_grammars

GitHub - google-deepmind/transformer_grammars: Transformer Grammars: Augmenting Transformer Language Models with Syntactic Inductive…

Transformer Grammars: Augmenting Transformer Language Models with Syntactic Inductive Biases at Scale, TACL (2022) - google-deepmind/transformer_grammars

607 views23:32

Speech Technology

https://twitter.com/DrJimFan/status/1622276293776793600

Looks like many of you are ready to embrace the Year of Sound Waves!

Here’s a big and OPEN dataset for you to get your hands dirty on AI audio modeling: EPIC-SOUNDS, 78k segments of annotated, audible events and actions.

Downloadable here: https://epic-kitchens.github.io/epic-sounds/

683 views00:12

Speech Technology

CMU pubs are nice. High quality TTS trained from Youtube

https://github.com/b04901014/MQTTS

https://arxiv.org/abs/2302.04215

A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech

Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky

Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-level naturalness. The diversity of human speech, however, often goes beyond the coverage of these corpora. We believe the ability to handle such diversity is crucial for AI systems to achieve human-level communication. Our work explores the use of more abundant real-world data for building speech synthesizers. We train TTS systems using real-world speech from YouTube and podcasts. We observe the mismatch between training and inference alignments in mel-spectrogram based autoregressive models, leading to unintelligible synthesis, and demonstrate that learned discrete codes within multiple code groups effectively resolves this issue. We introduce our MQTTS system whose architecture is designed for multiple code generation and monotonic alignment, along with the use of a clean silence prompt to improve synthesis quality. We conduct ablation analyses to identify the efficacy of our methods. We show that MQTTS outperforms existing TTS systems in several objective and subjective measures.

GitHub

GitHub - b04901014/MQTTS

Contribute to b04901014/MQTTS development by creating an account on GitHub.

764 viewsedited 20:16

Speech Technology

It is interesting how quickly people implement ideas. Like the one of podcast transcript with Whisper. Here is a selection

https://podscript.ai/
https://podtext.ai/
https://podscription.app/
https://podsearch.page/

Discussion https://news.ycombinator.com/item?id=34727695

574 views22:48

Speech Technology

Learned about https://uberduck.ai/ from https://news.ycombinator.com/item?id=34736745

TTS is really popular these days

www.uberduck.ai

AI Vocals and Text To Speech | Uberduck

Make Music, Voiceovers and Videos With AI Vocals, Text to Speech, Voice Conversion and Voice Cloning

592 views10:14

Speech Technology

https://github.com/openai/whisper/discussions/937

Whisper model in CTranslate2, which is a fast inference engine for Transformer models. The project supports many useful inference features such as CPU and GPU execution, asynchronous execution, multi-GPU execution, 8-bit quantization, etc.

You can find a usage example here.

Note that it does not currently implement the full transcription loop, only the model.decode part. So you would still need to implement the transcription logic from transcribe.py on top of it (iterate on each 30-second window, accumulate the context in the prompt, etc.).

For example, here's the transcription time of 13 minutes of audio on a V100 for the same accuracy:

Implementation Time with "small" model Time with "medium" model
Baseline 1m37s 3m16s
CTranslate2 0m25s 0m42s

GitHub

Accelerate the Whisper decoding with CTranslate2 · openai whisper · Discussion #937

Hello, We integrated the Whisper model in CTranslate2, which is a fast inference engine for Transformer models. The project implements many useful inference features such as optimized CPU and GPU e...

642 views19:55

Speech Technology

Forwarded from Nick Fisher

https://huggingface.co/blog/speecht5

huggingface.co

Speech Synthesis, Recognition, and More With SpeechT5

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

699 views23:56

Speech Technology

Abdelrahman Mohamed, Director of AI at Meta joined Rembrand

https://www.rembrand.com/blog/rembrand-announces-8-million-seed-round/

639 views22:20

Speech Technology

Some things about state of software from this weekend:

Transformers library pipeline API doesn't split long texts for NER yet https://github.com/huggingface/transformers/pull/19735

Fixed bug in Kaldi matrix to load 10Gb matrices, not sure how it was unnoticed for such a long time https://github.com/kaldi-asr/kaldi/pull/4823

600 viewsedited 21:43

Speech Technology

From FunASR

Updated onnxruntime today, optimized inference speed, actual measurement, paraformer-large, compared to modelscope pipeline, 100 averages on cpu, 2.8 times faster inference speed, rtf: 0.110 -> 0.0386, deployed using onnx.

Users can update the new pipeline: https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/onnxruntime/paraformer/rapid_paraformer

674 views15:22

Speech Technology

Generally, TRT gives avg ~26.5% speed up. See README.md for details.

https://github.com/k2-fsa/sherpa/pull/300

GitHub

Support TensorRT FP16 for Conformer Offline encoder model by wd929 · Pull Request #300 · k2-fsa/sherpa

Hi all,
This PR gives how to convert Conformer Offline ONNX model to TRT.
We also show throughput and latency comparisons of TRT vs ONNX.
Generally, TRT gives avg ~26.5% speed up. See README.md fo...

585 views09:02

Speech Technology

https://opendata.iisys.de/

Good German TTS dataset (326 hours, 5 speakers)

and 610h ASR dataset

Learned from https://arxiv.org/abs/2302.06008
ASR Bundestag: A Large-Scale political debate dataset in German

We present ASR Bundestag, a dataset for automatic speech recognition in German, consisting of 610 hours of aligned audio-transcript pairs for supervised training as well as 1,038 hours of unlabeled audio snippets for self-supervised learning, based on raw audio data and transcriptions from plenary sessions and committee meetings of the German parliament. In addition, we discuss utilized approaches for the automated creation of speech datasets and assess the quality of the resulting dataset based on evaluations and finetuning of a pre-trained state of the art model. We make the dataset publicly available, including all subsets.

714 views23:12

Speech Technology

Whisper optimized for Radeon cards

https://github.com/Const-me/Whisper

GitHub

GitHub - Const-me/Whisper: High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model

High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model - Const-me/Whisper

613 views23:27

Speech Technology

https://www.youtube.com/watch?v=T7FjTwLppPE

YouTube

[MERL Seminar Series Spring 2023] Towards Complex Language in Partially Observed Environments

Prof Stefanie Tellex, Brown University, presented a talk in the MERL Seminar Series on February 14, 2023.

Abstract:
Robots can act as a force multiplier for people, whether a robot assisting an astronaut with a repair on the International Space station,…

626 views00:18

Speech Technology

Coqui's AI voice studio is live!

🪄 Create new voices
🪞 Clone your voice
⚡️ Fuse voices
🎬 Direct your voices
📂 Organize your projects

https://twitter.com/coqui_ai/status/1626738634849239042

X (formerly Twitter)

coqui (@coqui_ai) on X

💫 💫 💫 💫 💫 💫 💫 💫

Coqui's AI voice studio is live!

🪄 Create new voices
🪞 Clone your voice
⚡️ Fuse voices
🎬 Direct your voices
📂 Organize your projects

Try yourself today!

https://t.co/X9DKLdXJEQ

💫 💫 💫 💫 💫 💫 💫 💫

746 views00:43

Speech Technology

We glad people use Vosk for real-life applications. If you have built something using Vosk please share. Here is a great example

Pal Robotics uses Vosk to recognize speech in ARI V2 service robot

https://pal-robotics.com/wp-content/uploads/2022/12/ARI-Datasheet.pdf

782 views13:42

Speech Technology

Attention ASR developers and researchers! 🚀 Great news, with the latest update of 🤗 PEFT, you can now fine-tune your Whisper-large model faster than ever before! The new update allows you to fit 5X larger batches with less than 10GB GPU VRAM, thanks to LoRA and Tim Dettmers's bnb packaged nicely in 🤗 PEFT. And the best part? You get a comparable WER, but just faster!! ⚡️

But that's not all, you no longer have to compromise on the training speed to maintain WER. In fact, in our experiments with the Marathi language, the WER was comparable with full fine-tuning runs of Whisper-large. Without PEFT, 13.64 WER (full training run)
and with PEFT, 14.01 WER (trained on a @googlecolab
). With 🤗 PEFT, you can now train a Whisper-large v2 model in less than 8GB GPU VRAM! 📉

Without 🤗 PEFT, you could experience OOM on a Colab T4, but not anymore! You can easily save on storage and port tiny checkpoints, ~63 MB compared to 6.7 GB fully fine-tuned model. 🐜
And that's not all! For low latency, you can convert the PEFT model to ONNX and use ORT using 🤗 Optimum.

Start experimenting today and fine-tune your Whisper using PEFT+INT8 in Colab on a language of your choice! Join our Discord community to get involved in the conversation and discuss your results and questions. 🔬

Check out the Colab notebook examples and start your ASR development journey with 🤗 PEFT today!

https://github.com/huggingface/peft

GitHub

GitHub - huggingface/peft: 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.

🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning. - huggingface/peft

736 views15:21

Speech Technology

From Google

https://arxiv.org/abs/2302.11186

UML: A Universal Monolingual Output Layer for Multilingual ASR

Chao Zhang, Bo Li, Tara N. Sainath, Trevor Strohman, Shuo-yiin Chang

Word-piece models (WPMs) are commonly used subword units in state-of-the-art end-to-end automatic speech recognition (ASR) systems. For multilingual ASR, due to the differences in written scripts across languages, multilingual WPMs bring the challenges of having overly large output layers and scaling to more languages. In this work, we propose a universal monolingual output layer (UML) to address such problems. Instead of one output node for only one WPM, UML re-associates each output node with multiple WPMs, one for each language, and results in a smaller monolingual output layer shared across languages. Consequently, the UML enables to switch in the interpretation of each output node depending on the language of the input speech. Experimental results on an 11-language voice search task demonstrated the feasibility of using UML for high-quality and high-efficiency multilingual streaming ASR.

664 views18:41

Speech Technology

https://arxiv.org/abs/2302.10248

VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge

Jaesung Huh, Andrew Brown, Jee-weon Jung, Joon Son Chung, Arsha Nagrani, Daniel Garcia-Romero, Andrew Zisserman

This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022. The goal of this challenge was to evaluate how well state-of-the-art speaker recognition systems can diarise and recognise speakers from speech obtained "in the wild". The challenge consisted of: (i) the provision of publicly available speaker recognition and diarisation data from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a public challenge and hybrid workshop held at INTERSPEECH 2022. We describe the four tracks of our challenge along with the baselines, methods, and results. We conclude with a discussion on the new domain-transfer focus of VoxSRC-22, and on the progression of the challenge from the previous three editions.

604 views19:04

Speech Technology

BigVGAN is accepted at ICLR 2023.
Listen audio samples:
https://bigvgan-demo.github.io

A universal audio synthesis model, trained on speech only, works for out-of-distribution scenarios, e.g., unseen singing voices and music audio!

Code and models are released!
https://github.com/NVIDIA/BigVGAN

https://twitter.com/_weiping/status/1628210425480515584

620 viewsedited 19:18

Speech Technology

From Phil Woodland

https://arxiv.org/abs/2302.08579

Adaptable End-to-End ASR Models using Replaceable Internal LMs and Residual Softmax

Keqi Deng, Philip C. Woodland

End-to-end (E2E) automatic speech recognition (ASR) implicitly learns the token sequence distribution of paired audio-transcript training data. However, it still suffers from domain shifts from training to testing, and domain adaptation is still challenging. To alleviate this problem, this paper designs a replaceable internal language model (RILM) method, which makes it feasible to directly replace the internal language model (LM) of E2E ASR models with a target-domain LM in the decoding stage when a domain shift is encountered. Furthermore, this paper proposes a residual softmax (R-softmax) that is designed for CTC-based E2E ASR models to adapt to the target domain without re-training during inference. For E2E ASR models trained on the LibriSpeech corpus, experiments showed that the proposed methods gave a 2.6% absolute WER reduction on the Switchboard data and a 1.0% WER reduction on the AESRC2020 corpus while maintaining intra-domain ASR results.

734 views19:19

About

Blog

Apps

Platform