Speech Technology – Telegram

Speech Technology

1.59K subscribers

122 photos

4 videos

1 file

2.12K links

Download Telegram

About

Blog

Apps

Platform

Speech Technology

1.59K subscribers

Speech Technology

NaturalSpeech 2, a new powerful zero-shot TTS model in NaturaSpeech series🔥
1. Latent diffusion model + continuous codec, avoiding the dilemma in language model + discrete codec;
2. Strong zero-shot speech synthesis with a 3s prompt, singing synthesis with only a speech prompt!

abs: https://arxiv.org/abs/2304.09116
project page: https://speechresearch.github.io/naturalspeech2/

1.43K views04:52

Speech Technology

https://www.youtube.com/watch?v=-oDQ4ggjBnQ

#42 Teaching speech recognizers new words — without retraining [Amazon AWS AI SLT 2022 paper]

https://www.amazon.science/blog/teaching-speech-recognizers-new-words-without-retraining
https://www.amazon.science/publications/personalization-of-ctc-speech-recognition-models

00:00 How is Amazon teaching speech recognizers new words?
00:17 This is a blog…

858 views19:22

Speech Technology

JaX is faster than Pytorch

https://twitter.com/sanchitgandhi99/status/1649046661816287236

1.16K views18:45

Speech Technology

https://github.com/152334H/tortoise-tts-fast

GitHub - 152334H/tortoise-tts-fast: Fast TorToiSe inference (5x or your money back!)

Fast TorToiSe inference (5x or your money back!). Contribute to 152334H/tortoise-tts-fast development by creating an account on GitHub.

703 views17:15

Speech Technology

Whisper can actually do speaker diarization with a prompt. Magic is:

or do a crude form of speaker turn tracking (e.g. " - Hey how are you doing? - I'm doing good. How are you?", note that the token for " -" is suppressed by default and will need to be enabled manually.)

https://github.com/openai/whisper/discussions/117#discussioncomment-3727051

prompt vs prefix in DecodingOptions · openai whisper · Discussion #117

DecodingOptions has the following properties that aren't really discussed in the blog post or paper: # prompt, prefix, and token suppression prompt: Optional[Union[str, List[int]]] = None # tex...

685 viewsedited 00:45

Speech Technology

http://www.asru2023.org/

Taiwan, Taipei

December 16-20, 2023

Regular & Challenge paper submission due: July 3, 2023

644 viewsedited 08:41

Speech Technology

https://slt2022.org/hackathon_projects.php

2022 IEEE Workshop on Spoken Language Technology

The 2022 IEEE Spoken Language Technology Workshop (SLT 2022) will be held on 9th - 12th January 2023 at Doha, Qatar. SLT 2022 will be the first speech conference to have visited the Middle East and the first speech conference to be held in an Arabic speaking…

602 views19:25

Speech Technology

LODR decoding in K2

https://mp.weixin.qq.com/s/HJDaZ5BN1TzEa8oWQ9CBhw

Adding LODR to the rescore process only increases the decoding time by 20% compared to beam search, but reduces the word error rate by 13.8%, which is fast and accurate.

696 views02:55

Speech Technology

https://arxiv.org/abs/2203.16776

An Empirical Study of Language Model Integration for Transducer based Speech Recognition

Huahuan Zheng, Keyu An, Zhijian Ou, Chen Huang, Ke Ding, Guanglu Wan

Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-T) for speech recognition is challenging. Recently, a class of methods such as density ratio (DR) and internal language model estimation (ILME) have been developed, outperforming the classic shallow fusion (SF) method. The basic idea behind these methods is that RNN-T posterior should first subtract the implicitly learned internal language model (ILM) prior, in order to integrate the ELM. While recent studies suggest that RNN-T only learns some low-order language model information, the DR method uses a well-trained neural language model with full context, which may be inappropriate for the estimation of ILM and deteriorate the integration performance. Based on the DR method, we propose a low-order density ratio method (LODR) by replacing the estimation with a low-order weak language model. Extensive empirical experiments are conducted on both in-domain and cross-domain scenarios on English LibriSpeech & Tedlium-2 and Chinese WenetSpeech & AISHELL-1 datasets. It is shown that LODR consistently outperforms SF in all tasks, while performing generally close to ILME and better than DR in most tests.

An Empirical Study of Language Model Integration for Transducer...

Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-T) for speech recognition is challenging. Recently, a class of methods such as density ratio (DR)...

746 views02:56

Speech Technology

https://github.com/declare-lab/tango

GitHub - declare-lab/tango: A family of diffusion models for text-to-audio generation.

A family of diffusion models for text-to-audio generation. - declare-lab/tango

716 views15:56

Speech Technology

https://www.hackster.io/shahizat/tinyml-baby-cry-detection-using-chatgpt-and-synthetic-data-1e715b

TinyML: Baby Cry Detection using ChatGPT and Synthetic data

The combination of ChatGPT, TinyML and Text-to-Audio technologies was utilized to create artificial data for detecting baby crying.

884 views16:55

Speech Technology

Nvidia published a pack of new fastconformer models

https://github.com/NVIDIA/NeMo/commit/091ce965da99f1ca63f64417b0ea612d744c7c81

For example English one

https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_fastconformer_hybrid_large_pc

Added fastconfomer hybrid asr models for en, es, it, de, pl, hr, ua, by · NVIDIA/NeMo@091ce96

Signed-off-by: KunalDhawan <kunaldhawan97@gmail.com>

960 views22:16

Speech Technology

Some Indonesian speech data released recently

https://indonlp.github.io/nusa-catalogue/

for example

https://github.com/s-sakti/data_indsp_teldialog_svcsr

indonlp.github.io

IndoNLP | Data Catalogue

Indonesian NLP Data Catalogue

690 viewsedited 00:19

Speech Technology

New Mandarin TTS dataset

https://www.openslr.org/138/

SHALCAS22A
Identifier: SLR138

Summary: A Chinese Mandarin corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd.

Category: Speech

License: Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

Downloads (use a mirror closer to you):
SHALCAS22A.tgz [3.9G] ( Corpus ) Mirrors: [US] [EU] [CN]

About this resource:

SHALCAS22A is a 1-channel Chinese Mandarin speech corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd. It was collected over a Hi-Fi microphone in a quiet environment. The corpus contains 14,580 utterances from 60 speakers. Each speaker has 243 utterances.
The contents include number passwords, short Chinese words, and long Chinese sentences. The mapping between the content and utterance is given in content.txt.

This corpus can be used in text-dependent speaker verification on number passwords, text-independent speaker verification on short utterances, and other speech-related fields. Please cite the corpus as "SHALCAS22A, a free Chinese Mandarin corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd., 2022".

Contact: Feng Hong, hongfeng@mail.ioa.ac.cn

845 views00:36

Speech Technology

Encodec has just changed to an MIT license. Great news for anyone working on LM approaches to audio or just looking for a high-quality audio codec.

No training code but still a really significant change.

https://github.com/facebookresearch/encodec/commit/349b72939f57cb3bc7b60906c0ee8228c849485d

Releasing under the MIT license · facebookresearch/encodec@349b729

State-of-the-art deep learning based audio codec supporting both mono 24 kHz audio and stereo 48 kHz audio. - Releasing under the MIT license · facebookresearch/encodec@349b729

850 views21:23

Speech Technology

Open Preview for #ICASSP2023 is now available on
@IEEEXplore
! Available through June 10, you can now browse all the papers that were accepted to ICASSP 2023, free of charge. Browse research here: https://hubs.la/Q01N_PdX0

869 views22:32

Speech Technology

https://www.assemblyai.com/blog/lemur-early-access/

820 views19:04

Speech Technology

Good VS quality

https://quickvc.github.io/quickvc-demo/

https://github.com/quickvc/QuickVC-VoiceConversion

GitHub - quickvc/QuickVC-VoiceConversion: QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for…

QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion - quickvc/QuickVC-VoiceConversion

795 views03:13

Speech Technology

https://github.com/suzuki256/dog-dataset

GitHub - suzuki256/dog-dataset

Contribute to suzuki256/dog-dataset development by creating an account on GitHub.

683 views03:16

Speech Technology

https://www.youtube.com/watch?v=SyJkrdF2Ed4

Recent Works on Speech Translation at Naver Labs Europe

Speakers:
Ioan Calapodescu, Senior Scientist at Naver Labs Europe, NLP team
Laurent Besacier, Principal Scientist at Naver Labs Europe, Interactive Systems Group Lead

Title:
Recent Works on Speech Translation at Naver Labs Europe

Abstract:
In this talk…

782 views04:03

Speech Technology

People report Whisper PEFT + LORA tuning gives quite good results:

https://github.com/huggingface/peft/blob/main/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb

peft/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb at main · huggingface/peft

🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning. - huggingface/peft

852 views14:54