http://www.asru2023.org/
Taiwan, Taipei
December 16-20, 2023
Regular & Challenge paper submission due: July 3, 2023
Taiwan, Taipei
December 16-20, 2023
Regular & Challenge paper submission due: July 3, 2023
LODR decoding in K2
https://mp.weixin.qq.com/s/HJDaZ5BN1TzEa8oWQ9CBhw
Adding LODR to the rescore process only increases the decoding time by 20% compared to beam search, but reduces the word error rate by 13.8%, which is fast and accurate.
https://mp.weixin.qq.com/s/HJDaZ5BN1TzEa8oWQ9CBhw
Adding LODR to the rescore process only increases the decoding time by 20% compared to beam search, but reduces the word error rate by 13.8%, which is fast and accurate.
https://arxiv.org/abs/2203.16776
An Empirical Study of Language Model Integration for Transducer based Speech Recognition
Huahuan Zheng, Keyu An, Zhijian Ou, Chen Huang, Ke Ding, Guanglu Wan
Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-T) for speech recognition is challenging. Recently, a class of methods such as density ratio (DR) and internal language model estimation (ILME) have been developed, outperforming the classic shallow fusion (SF) method. The basic idea behind these methods is that RNN-T posterior should first subtract the implicitly learned internal language model (ILM) prior, in order to integrate the ELM. While recent studies suggest that RNN-T only learns some low-order language model information, the DR method uses a well-trained neural language model with full context, which may be inappropriate for the estimation of ILM and deteriorate the integration performance. Based on the DR method, we propose a low-order density ratio method (LODR) by replacing the estimation with a low-order weak language model. Extensive empirical experiments are conducted on both in-domain and cross-domain scenarios on English LibriSpeech & Tedlium-2 and Chinese WenetSpeech & AISHELL-1 datasets. It is shown that LODR consistently outperforms SF in all tasks, while performing generally close to ILME and better than DR in most tests.
An Empirical Study of Language Model Integration for Transducer based Speech Recognition
Huahuan Zheng, Keyu An, Zhijian Ou, Chen Huang, Ke Ding, Guanglu Wan
Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-T) for speech recognition is challenging. Recently, a class of methods such as density ratio (DR) and internal language model estimation (ILME) have been developed, outperforming the classic shallow fusion (SF) method. The basic idea behind these methods is that RNN-T posterior should first subtract the implicitly learned internal language model (ILM) prior, in order to integrate the ELM. While recent studies suggest that RNN-T only learns some low-order language model information, the DR method uses a well-trained neural language model with full context, which may be inappropriate for the estimation of ILM and deteriorate the integration performance. Based on the DR method, we propose a low-order density ratio method (LODR) by replacing the estimation with a low-order weak language model. Extensive empirical experiments are conducted on both in-domain and cross-domain scenarios on English LibriSpeech & Tedlium-2 and Chinese WenetSpeech & AISHELL-1 datasets. It is shown that LODR consistently outperforms SF in all tasks, while performing generally close to ILME and better than DR in most tests.
arXiv.org
An Empirical Study of Language Model Integration for Transducer...
Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-T) for speech recognition is challenging. Recently, a class of methods such as density ratio (DR)...
Nvidia published a pack of new fastconformer models
https://github.com/NVIDIA/NeMo/commit/091ce965da99f1ca63f64417b0ea612d744c7c81
For example English one
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_fastconformer_hybrid_large_pc
https://github.com/NVIDIA/NeMo/commit/091ce965da99f1ca63f64417b0ea612d744c7c81
For example English one
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_fastconformer_hybrid_large_pc
GitHub
Added fastconfomer hybrid asr models for en, es, it, de, pl, hr, ua, by · NVIDIA/NeMo@091ce96
Signed-off-by: KunalDhawan <kunaldhawan97@gmail.com>
Some Indonesian speech data released recently
https://indonlp.github.io/nusa-catalogue/
for example
https://github.com/s-sakti/data_indsp_teldialog_svcsr
https://indonlp.github.io/nusa-catalogue/
for example
https://github.com/s-sakti/data_indsp_teldialog_svcsr
indonlp.github.io
IndoNLP | Data Catalogue
Indonesian NLP Data Catalogue
New Mandarin TTS dataset
https://www.openslr.org/138/
SHALCAS22A
Identifier: SLR138
Summary: A Chinese Mandarin corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd.
Category: Speech
License: Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
Downloads (use a mirror closer to you):
SHALCAS22A.tgz [3.9G] ( Corpus ) Mirrors: [US] [EU] [CN]
About this resource:
SHALCAS22A is a 1-channel Chinese Mandarin speech corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd. It was collected over a Hi-Fi microphone in a quiet environment. The corpus contains 14,580 utterances from 60 speakers. Each speaker has 243 utterances.
The contents include number passwords, short Chinese words, and long Chinese sentences. The mapping between the content and utterance is given in content.txt.
This corpus can be used in text-dependent speaker verification on number passwords, text-independent speaker verification on short utterances, and other speech-related fields. Please cite the corpus as "SHALCAS22A, a free Chinese Mandarin corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd., 2022".
Contact: Feng Hong, hongfeng@mail.ioa.ac.cn
https://www.openslr.org/138/
SHALCAS22A
Identifier: SLR138
Summary: A Chinese Mandarin corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd.
Category: Speech
License: Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
Downloads (use a mirror closer to you):
SHALCAS22A.tgz [3.9G] ( Corpus ) Mirrors: [US] [EU] [CN]
About this resource:
SHALCAS22A is a 1-channel Chinese Mandarin speech corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd. It was collected over a Hi-Fi microphone in a quiet environment. The corpus contains 14,580 utterances from 60 speakers. Each speaker has 243 utterances.
The contents include number passwords, short Chinese words, and long Chinese sentences. The mapping between the content and utterance is given in content.txt.
This corpus can be used in text-dependent speaker verification on number passwords, text-independent speaker verification on short utterances, and other speech-related fields. Please cite the corpus as "SHALCAS22A, a free Chinese Mandarin corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd., 2022".
Contact: Feng Hong, hongfeng@mail.ioa.ac.cn
Encodec has just changed to an MIT license. Great news for anyone working on LM approaches to audio or just looking for a high-quality audio codec.
No training code but still a really significant change.
https://github.com/facebookresearch/encodec/commit/349b72939f57cb3bc7b60906c0ee8228c849485d
No training code but still a really significant change.
https://github.com/facebookresearch/encodec/commit/349b72939f57cb3bc7b60906c0ee8228c849485d
GitHub
Releasing under the MIT license · facebookresearch/encodec@349b729
State-of-the-art deep learning based audio codec supporting both mono 24 kHz audio and stereo 48 kHz audio. - Releasing under the MIT license · facebookresearch/encodec@349b729
Open Preview for #ICASSP2023 is now available on
@IEEEXplore
! Available through June 10, you can now browse all the papers that were accepted to ICASSP 2023, free of charge. Browse research here: https://hubs.la/Q01N_PdX0
@IEEEXplore
! Available through June 10, you can now browse all the papers that were accepted to ICASSP 2023, free of charge. Browse research here: https://hubs.la/Q01N_PdX0
Good VS quality
https://quickvc.github.io/quickvc-demo/
https://github.com/quickvc/QuickVC-VoiceConversion
https://quickvc.github.io/quickvc-demo/
https://github.com/quickvc/QuickVC-VoiceConversion
GitHub
GitHub - quickvc/QuickVC-VoiceConversion: QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for…
QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion - quickvc/QuickVC-VoiceConversion
People report Whisper PEFT + LORA tuning gives quite good results:
https://github.com/huggingface/peft/blob/main/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb
https://github.com/huggingface/peft/blob/main/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb
GitHub
peft/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb at main · huggingface/peft
🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning. - huggingface/peft
EfficientSpeech, or ES for short, is an efficient neural text to speech (TTS) model. It generates mel spectrogram at a speed of 104 (mRTF) or 104 secs of speech per sec on an RPi4. Its tiny version has a footprint of just 266k parameters. Generating 6 secs of speech consumes 90 MFLOPS only.
https://github.com/roatienza/efficientspeech
https://roatienza.github.io/efficientspeech-demo/
https://github.com/roatienza/efficientspeech
https://roatienza.github.io/efficientspeech-demo/
GitHub
GitHub - roatienza/efficientspeech: PyTorch code implementation of EfficientSpeech - to be presented at ICASSP2023.
PyTorch code implementation of EfficientSpeech - to be presented at ICASSP2023. - GitHub - roatienza/efficientspeech: PyTorch code implementation of EfficientSpeech - to be presented at ICASSP2023.
May 12, 2023: Challenge announcement
May 19, 2023: Leaderboard is online and accepting submissions
June 26, 2023: New Language Track Submission Deadline
July 07, 2023: Paper / Model Submission Deadline
July 10, 2023: Paper Revision Deadline
🌍🗣️SUPERB benchmark is back with ML-SUPERB, its multilingual version! The challenge, as one of the #ASRU2023 challenges, includes 3 tracks:
1️⃣ML-SUPERB: For multilingual SSL
2️⃣New language: To new languages!
3️⃣Research: For research papers
More to see 👉 https://multilingual.superbbenchmark.org
May 19, 2023: Leaderboard is online and accepting submissions
June 26, 2023: New Language Track Submission Deadline
July 07, 2023: Paper / Model Submission Deadline
July 10, 2023: Paper Revision Deadline
🌍🗣️SUPERB benchmark is back with ML-SUPERB, its multilingual version! The challenge, as one of the #ASRU2023 challenges, includes 3 tracks:
1️⃣ML-SUPERB: For multilingual SSL
2️⃣New language: To new languages!
3️⃣Research: For research papers
More to see 👉 https://multilingual.superbbenchmark.org
multilingual.superbbenchmark.org
ML-SUPERB: Multilingual Speech processing Universal PERformance Benchmark
A multilingual benchmark for Self-supervised Speech Representation Learning
Universal Source Separation with Weakly Labelled Data
abs: https://arxiv.org/abs/2305.07447
paper page: https://huggingface.co/papers/2305.07447
github: https://github.com/bytedance/uss
abs: https://arxiv.org/abs/2305.07447
paper page: https://huggingface.co/papers/2305.07447
github: https://github.com/bytedance/uss
Some people implement streaming speaker diarization manually
https://github.com/pyannote/pyannote-audio/commit/4a6ea9c825b9447a7d03cb9bd94f5f81d661ca16
others just ask ChatGPT to write it
https://github.com/huseinzol05/malaya-speech/commit/564f50c0d91528126fe3b410f387d1b4ff33d364
ChatGPT version is not that bad
https://github.com/pyannote/pyannote-audio/commit/4a6ea9c825b9447a7d03cb9bd94f5f81d661ca16
others just ask ChatGPT to write it
https://github.com/huseinzol05/malaya-speech/commit/564f50c0d91528126fe3b410f387d1b4ff33d364
ChatGPT version is not that bad
GitHub
wip: add streaming speaker diarization task · pyannote/pyannote-audio@4a6ea9c
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding - wip: add streaming speaker diarization task · pyannote/pyannote-audio@4a6ea9c