Speech Technology
1.59K subscribers
122 photos
4 videos
1 file
2.12K links
Download Telegram
NaturalSpeech 2, a new powerful zero-shot TTS model in NaturaSpeech series🔥
1. Latent diffusion model + continuous codec, avoiding the dilemma in language model + discrete codec;
2. Strong zero-shot speech synthesis with a 3s prompt, singing synthesis with only a speech prompt!

abs: https://arxiv.org/abs/2304.09116
project page: https://speechresearch.github.io/naturalspeech2/
Whisper can actually do speaker diarization with a prompt. Magic is:

or do a crude form of speaker turn tracking (e.g. " - Hey how are you doing? - I'm doing good. How are you?", note that the token for " -" is suppressed by default and will need to be enabled manually.)

https://github.com/openai/whisper/discussions/117#discussioncomment-3727051
http://www.asru2023.org/

Taiwan, Taipei

December 16-20, 2023

Regular & Challenge paper submission due: July 3, 2023
LODR decoding in K2

https://mp.weixin.qq.com/s/HJDaZ5BN1TzEa8oWQ9CBhw

Adding LODR to the rescore process only increases the decoding time by 20% compared to beam search, but reduces the word error rate by 13.8%, which is fast and accurate.
https://arxiv.org/abs/2203.16776

An Empirical Study of Language Model Integration for Transducer based Speech Recognition

Huahuan Zheng, Keyu An, Zhijian Ou, Chen Huang, Ke Ding, Guanglu Wan

Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-T) for speech recognition is challenging. Recently, a class of methods such as density ratio (DR) and internal language model estimation (ILME) have been developed, outperforming the classic shallow fusion (SF) method. The basic idea behind these methods is that RNN-T posterior should first subtract the implicitly learned internal language model (ILM) prior, in order to integrate the ELM. While recent studies suggest that RNN-T only learns some low-order language model information, the DR method uses a well-trained neural language model with full context, which may be inappropriate for the estimation of ILM and deteriorate the integration performance. Based on the DR method, we propose a low-order density ratio method (LODR) by replacing the estimation with a low-order weak language model. Extensive empirical experiments are conducted on both in-domain and cross-domain scenarios on English LibriSpeech & Tedlium-2 and Chinese WenetSpeech & AISHELL-1 datasets. It is shown that LODR consistently outperforms SF in all tasks, while performing generally close to ILME and better than DR in most tests.
New Mandarin TTS dataset

https://www.openslr.org/138/

SHALCAS22A
Identifier: SLR138

Summary: A Chinese Mandarin corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd.

Category: Speech

License: Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

Downloads (use a mirror closer to you):
SHALCAS22A.tgz [3.9G] ( Corpus ) Mirrors: [US] [EU] [CN]


About this resource:

SHALCAS22A is a 1-channel Chinese Mandarin speech corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd. It was collected over a Hi-Fi microphone in a quiet environment. The corpus contains 14,580 utterances from 60 speakers. Each speaker has 243 utterances.
The contents include number passwords, short Chinese words, and long Chinese sentences. The mapping between the content and utterance is given in content.txt.

This corpus can be used in text-dependent speaker verification on number passwords, text-independent speaker verification on short utterances, and other speech-related fields. Please cite the corpus as "SHALCAS22A, a free Chinese Mandarin corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd., 2022".

Contact: Feng Hong, hongfeng@mail.ioa.ac.cn
Open Preview for #ICASSP2023 is now available on
@IEEEXplore
! Available through June 10, you can now browse all the papers that were accepted to ICASSP 2023, free of charge. Browse research here: https://hubs.la/Q01N_PdX0