Speech Technology – Telegram

Speech Technology

1.59K subscribers

122 photos

4 videos

1 file

2.12K links

Download Telegram

About

Blog

Apps

Platform

Speech Technology

1.59K subscribers

Speech Technology

12m hours of speech data

https://arxiv.org/abs/2303.01037

Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk, Françoise Beaufays, Yonghui Wu

We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model, our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.

848 views23:51

Speech Technology

https://github.com/haoheliu/AudioLDM

GitHub - haoheliu/AudioLDM: AudioLDM: Generate speech, sound effects, music and beyond, with text.

AudioLDM: Generate speech, sound effects, music and beyond, with text. - haoheliu/AudioLDM

654 viewsedited 22:24

Speech Technology

https://github.com/tuanct1997/Federated-Learning-ASR-based-on-wav2vec-2.0

GitHub - tuanct1997/Federated-Learning-ASR-based-on-wav2vec-2.0

Contribute to tuanct1997/Federated-Learning-ASR-based-on-wav2vec-2.0 development by creating an account on GitHub.

605 views20:57

Speech Technology

https://www.speechmatics.com/company/articles-and-news/introducing-ursa-the-worlds-most-accurate-speech-to-text

Introducing Ursa from Speechmatics

We are excited to introduce Ursa, the world’s most accurate speech-to-text system. Ursa delivers unprecedented performance across a wide range of voices.

606 views17:34

Speech Technology

faster whisper much faster than whisper.cpp

https://github.com/ggerganov/whisper.cpp/discussions/589

Further improve the speed of Whisper.cpp (might need a dependency) · ggerganov/whisper.cpp · Discussion #589

I came across Faster Whisper which is 5x faster than whisper.cpp with comparable memory footprint. It's built in python and uses this C++ library (Ctranslate2) Just bringing this to your attent...

687 views17:32

Speech Technology

https://github.com/alibabasglab/mossformer

640 views22:14

Speech Technology

https://t.me/speechtech/1449

Repeated this test with new Speechmatics. Async WER improved to 6.88. Indeed new Ursa model improved significantly!

An interesting thing is that it is phoneme-based

Speech Technology

Recently I've came around one ASR SaaS service Soniox. Overall, pretty nice, fast and clean UI, good transcription accuracy and features. 5 hours a month for free for user.

I've read their whitepaper too with very cool results.

https://soniox.com/medi…

621 viewsedited 22:05

Speech Technology

Tried a popular https://github.com/Kyubyong/g2p. As usual, networks are very bad for unseen cases. Missing letters, extra letters, etc. Watch outputs carefully. Example:

bio-sand B AY1 OW0 S T AE2 N D

GitHub - Kyubyong/g2p: g2p: English Grapheme To Phoneme Conversion

g2p: English Grapheme To Phoneme Conversion. Contribute to Kyubyong/g2p development by creating an account on GitHub.

646 views22:41

Speech Technology

https://twitter.com/TanelAlumae/status/1635221485060227072

https://arxiv.org/abs/2302.14624

https://haldus.taltech.ee/sites/default/files/2023-03/LRE22__Vocapia_TalTech_System_Description.pdf

585 viewsedited 16:18

Speech Technology

BUT 3rd

System description

https://www.fit.vutbr.cz/research/groups/speech/publi/2022/NIST_LRE_2022_System_Description.pdf

596 views16:25

Speech Technology

Paraformer released models for other languages too:

We release several new UniASR model: Southern Fujian Dialect model, French model, German model, Vietnamese model, Persian model.

https://github.com/alibaba-damo-academy/FunASR

GitHub - modelscope/FunASR: A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting…

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc. - modelscope/FunASR

623 views20:00

Speech Technology

How can we make inference faster when using big #speech #selfsupervised models?

Check out @salah_zaiem 's paper that compares various approaches, revealing some pretty interesting insights.

https://arxiv.org/abs/2303.06740

These techniques will be soon available in #SpeechBrain

https://twitter.com/mirco_ravanelli/status/1635678132731518976

713 viewsedited 21:35

Speech Technology

New model from Assembly AI. Definitely improved from before, but not as great as Speechmatics.

On a toy test WER 10.89, previous assemblyAI (version 9) was at 11.04, version before 11.89. Speechmatics 6.88. Whisper large 8.94

https://twitter.com/AssemblyAI/status/1636050346240884744

Introducing Conformer-1: our latest state-of-the-art speech recognition model.

Built on top of the Conformer architecture and trained on 650K hours of audio data, it achieves near-human-level performance, making up to 43% fewer errors on noisy data than other ASR models.

We use a modified version of the conformer neural net published by Google Brain.

It's built on top of an Efficient Conformer (Orange Labs, 2021), that introduces the following technical modifications:

- Progressive Downsampling to reduce the length of the encoded sequence
- Grouped Attention: A modified version of the attention mechanism that makes it agnostic to sequence-length

These changes yield speedups of 29% at inference time and 36% at training time.

To further improve our model’s accuracy on noisy audio, we implemented a modified version of Sparse Attention, a pruning method for achieving sparsity of the model’s weights in order to achieve regularization.

We took inspiration from the data scaling laws described in DeepMind's Chinchilla paper and adapted them to the ASR domain.

Our team curated a dataset of 650K hours of English audio - making our model the largest-trained supervised model for English available today.

Based on our results, Conformer-1 is more robust on real-world data than popular commercial and open-source ASR models, making up to 43% fewer errors on average on noisy data:

The biggest improvement with this new release is in our robustness to a wide variety of data domains and noisy audio.

Introducing Conformer-1: our latest state-of-the-art speech recognition model.

Built on top of the Conformer architecture and trained on 650K hours of audio data, it achieves near-human-level performance, making up to 43% fewer errors on noisy data than…

754 views22:55

Speech Technology

Kincaid46 WER from Ursa announcement:

AssemblyAI: 8.6
Speechmatics: 7.88
Microsoft: 9.70
Whisper Large-v2: 8.7
Vosk 0.42 Gigaspeech 15.8
Google 12.52
Amazon 10.94

751 viewsedited 23:49

Speech Technology

Streaming punctuation model is interesting

https://github.com/alibaba-damo-academy/FunASR/releases/tag/v0.3.0

825 views08:23

Speech Technology

The amount of models this guy trained is quite outstanding

https://malaya-speech.readthedocs.io/en/latest/index.html

890 views21:23

Speech Technology

🚨 🔔: We've just released our GitHub repository for #ASR and #NLP tools for air traffic control communications, based on ATCO2 dataset
@Atco2P
!

We made public 5000+ hours of audio --> research on ASR for ATC.

GitHub: https://github.com/idiap/atco2-corpus

https://twitter.com/Pablogomez3/status/1640331512389279744

GitHub - idiap/atco2-corpus: A Corpus for Research on Robust Automatic Speech Recognition and Natural Language Understanding of…

A Corpus for Research on Robust Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications - idiap/atco2-corpus

784 views18:00

Speech Technology

12th ISCA Speech Synthesis Workshop (SSW) is now open for submissions!
Final submission deadline: May, 3 2023
Late breaking reports submission deadline : June, 28 2023

The Speech Synthesis Workshop will be held in Grenoble, France and is organized as a satellite event of the Interspeech conference in Dublin, Ireland
Come and join the SSW community and the people who creates machines that talk!

Visit the official site for more information
https://ssw2023.org/

661 views09:57

Speech Technology

Forwarded from Machinelearning

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

Propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT.

Конвейер обработки для фильтрации зашумленных данных и создания высококачественных титров.

🖥

Github: https://github.com/xinhaomei/wavcaps

⏩

Paper: https://arxiv.org/abs/2303.17395v1

💨

Dataset: https://paperswithcode.com/dataset/sounddescs

ai_machinelearning_big_data

Please open Telegram to view this post

VIEW IN TELEGRAM

294 views00:24

Speech Technology

This is interesting, all open source conformer implementations have bugs:

📢 We have just released open source a bug-free 🚫🪲implementation of the Conformer model.
📌Check it at: https://github.com/hlt-mt/FBK-fairseq/blob/master/fbk_works/BUGFREE_CONFORMER.md
Want to discover what "bug-free" means?
➡ Take a look at our paper: https://arxiv.org/pdf/2303.16166.pdf

#opensource #conformer #speech #bug #bugfree #NLProc

https://twitter.com/sarapapi/status/1641750885524029440

FBK-fairseq/fbk_works/BUGFREE_CONFORMER.md at master · hlt-mt/FBK-fairseq

Repository containing the open source code of works published at the FBK MT unit. - hlt-mt/FBK-fairseq

661 views00:37

Speech Technology

📢The largest 2,000 hours multi-layer annotated corpus QASR is available @ https://arabicspeech.org/qasr/ QASR is suitable for ASR, dialect ID, punctuation, speaker ID-linking, and potentially other NLP modules for spoken data.
#nlproc #speechproc #Arabic #AI
@QatarComputing

@qcrialt

https://twitter.com/ArabicSpeech/status/1641402805951815681

X (formerly Twitter)

Arabic Speech (@ArabicSpeech) on X

📢The largest 2,000 hours multi-layer annotated corpus QASR is available @ https://t.co/sPwMy4DSLj QASR is suitable for ASR, dialect ID, punctuation, speaker ID-linking, and potentially other NLP modules for spoken data.
#nlproc #speechproc #Arabic #AI @QatarComputing…

630 views13:36