Audio
Sounds pretty good and very fast to synthesize. Finally not ljspeech voice.
Looking on Huggingface models. The impression is that the intention is to bury good models under thousands of bad ones.
Looking more on Huggingface models. Most of them finetuned without specaugment...
You see, one can tune a simple model better than a much more advanced Whisper model. Whisper is not great for non-English languages actually and moreover it is not fine-tuning very well.
https://alphacephei.com/nsh/2023/01/15/whisper-finetuning.html
https://alphacephei.com/nsh/2023/01/15/whisper-finetuning.html
Speech Recognition With Vosk
Whisper Fine-Tuning
Whisper is very popular these days, so here are some more observations on it.
Alibaba's FunASR recently had very good Paraformer model release
https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary
https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary
on semi-supervised models
https://twitter.com/huckiyang/status/1615659564606656512
> My take is that joint supervised training and SSL loss is blooming now (e.g., JUST and Maestro) - it is essential to have a supervised encoder also.
https://arxiv.org/abs/2111.08137
https://arxiv.org/abs/2204.03409
https://twitter.com/huckiyang/status/1615659564606656512
> My take is that joint supervised training and SSL loss is blooming now (e.g., JUST and Maestro) - it is essential to have a supervised encoder also.
https://arxiv.org/abs/2111.08137
https://arxiv.org/abs/2204.03409
https://twitter.com/lesterphv/status/1615990752403918850
> We are proud to announce the first Singing Voice Conversion Challenge (SVCC2023)! Building on the success of the previous VCC events, we plan to further push the limits of voice conversion by now focusing on singing voices, which is more difficult to model than speech.
http://www.vc-challenge.org/
> We are proud to announce the first Singing Voice Conversion Challenge (SVCC2023)! Building on the success of the previous VCC events, we plan to further push the limits of voice conversion by now focusing on singing voices, which is more difficult to model than speech.
http://www.vc-challenge.org/
Twitter
We are proud to announce the first Singing Voice Conversion Challenge (SVCC2023)! Building on the success of the previous VCC events, we plan to further push the limits of voice conversion by now focusing on singing voices, which is more difficult to model…
Reason Speech
ReasonSpeech is a labeled Japanese speech corpus consisting of approximately 19,000 hours of broadcast speech. It was built for the purpose of promoting research on Japanese speech recognition technology.
In addition to the speech corpus, we have released a toolkit for building the corpus and a pre-trained model under a free license.
https://research.reazon.jp/projects/ReazonSpeech/index.html
Trained ESPnet model
Apache-2.0
https://huggingface.co/reason-research/reasonspeech-espnet-v1
Corpus building toolkit
Apache-2.0
https://github.com/reason-research/ReasonSpeech
Japanese speech corpus
CDLA-Sharing-1.0 (However, the purpose of use is limited to information analysis specified in Article 30-4 of the Copyright Act)
https://huggingface.co/datasets/reason-research/reasonspeech
research paper
https://research.reazon.jp/_static/reazonspeech_nlp2023.pdf
ReasonSpeech is a labeled Japanese speech corpus consisting of approximately 19,000 hours of broadcast speech. It was built for the purpose of promoting research on Japanese speech recognition technology.
In addition to the speech corpus, we have released a toolkit for building the corpus and a pre-trained model under a free license.
https://research.reazon.jp/projects/ReazonSpeech/index.html
Trained ESPnet model
Apache-2.0
https://huggingface.co/reason-research/reasonspeech-espnet-v1
Corpus building toolkit
Apache-2.0
https://github.com/reason-research/ReasonSpeech
Japanese speech corpus
CDLA-Sharing-1.0 (However, the purpose of use is limited to information analysis specified in Article 30-4 of the Copyright Act)
https://huggingface.co/datasets/reason-research/reasonspeech
research paper
https://research.reazon.jp/_static/reazonspeech_nlp2023.pdf
https://twitter.com/huckiyang/status/1616343651344343046
From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition
https://arxiv.org/abs/2301.07851
From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition
Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Rohit Prabhavalkar, Tara N. Sainath, Trevor Strohman
In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement that, for the first time, empowers model reprogramming on ASR. Specifically, we investigate how to select trainable components (i.e., encoder) of a conformer-based RNN-Transducer, as a frozen pre-trained backbone. Experiments on a seven-language multilingual LibriSpeech speech (MLS) task show that model reprogramming only requires 4.2% (11M out of 270M) to 6.8% (45M out of 660M) of its original trainable parameters from a full ASR model to perform competitive results in a range of 11.9% to 8.1% WER averaged across different languages. In addition, we discover different setups to make large-scale pre-trained ASR succeed in both monolingual and multilingual speech recognition. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses (e.g., w2v-bert) in terms of lower WER and better training efficiency.
From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition
https://arxiv.org/abs/2301.07851
From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition
Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Rohit Prabhavalkar, Tara N. Sainath, Trevor Strohman
In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement that, for the first time, empowers model reprogramming on ASR. Specifically, we investigate how to select trainable components (i.e., encoder) of a conformer-based RNN-Transducer, as a frozen pre-trained backbone. Experiments on a seven-language multilingual LibriSpeech speech (MLS) task show that model reprogramming only requires 4.2% (11M out of 270M) to 6.8% (45M out of 660M) of its original trainable parameters from a full ASR model to perform competitive results in a range of 11.9% to 8.1% WER averaged across different languages. In addition, we discover different setups to make large-scale pre-trained ASR succeed in both monolingual and multilingual speech recognition. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses (e.g., w2v-bert) in terms of lower WER and better training efficiency.
X (formerly Twitter)
Huck Yang (@huckiyang) on X
Happy to share our latest work on Reprogramming frozen Conformer-ASR (CAR) for more languages recognition - fun works during my stay in @tnsainath's team at @GoogleAI, jointly with Bo Li, Yu Zhang, @nchen1993, and @RPPrabhavalkar - time to design your own…
https://twitter.com/alex_conneau/status/1614014965496811520
FLEURS paper won the best paper award at SLT 2022!
@ieee_slt
SLT: https://slt2022.org/best-papers.php
arXiv: https://arxiv.org/abs/2205.12446
FLEURS paper won the best paper award at SLT 2022!
@ieee_slt
SLT: https://slt2022.org/best-papers.php
arXiv: https://arxiv.org/abs/2205.12446
Twitter
Our FLEURS paper won the best paper award at SLT 2022! @ieee_slt
SLT: https://t.co/lxkpjfOQJa
arXiv: https://t.co/L9W3wiNpUu
Thanks to the organizers! Grateful for the collaboration with many great colleagues 🙂
SLT: https://t.co/lxkpjfOQJa
arXiv: https://t.co/L9W3wiNpUu
Thanks to the organizers! Grateful for the collaboration with many great colleagues 🙂
Recently I've came around one ASR SaaS service Soniox. Overall, pretty nice, fast and clean UI, good transcription accuracy and features. 5 hours a month for free for user.
I've read their whitepaper too with very cool results.
https://soniox.com/media/SonioxSpeechToTextBenchmarksNov2022.pdf
Well, from the whitepaper every service is more or less the same, some are better, some worse. I've quickly made a test with a audio broadcast file. Here are the results.
Whisper really shines for English. As for others, they all are more or less the same. Whitepapers not very meaningful.
I've read their whitepaper too with very cool results.
https://soniox.com/media/SonioxSpeechToTextBenchmarksNov2022.pdf
Well, from the whitepaper every service is more or less the same, some are better, some worse. I've quickly made a test with a audio broadcast file. Here are the results.
AssemblyAI stream 14.79
AWS stream 17.20
Azure stream 11.47
Deepgram stream 18.23
Google stream 15.48
Rev stream 17.09
Speechmatics stream 9.75
Soniox stream 12.73
Assembly async 11.01
Rev async 15.25
Soniox async 11.81
Whisper Largev2 async 8.94
Whisper Med.En async 9.29
Nemo RNNT async 19.61
Whisper really shines for English. As for others, they all are more or less the same. Whitepapers not very meaningful.
1ST DUTCH SPEECH TECH DAY
Monday, 20 February 2023
Location: Netherlands Institute for Sound & Vision, Hilversum
https://sites.google.com/view/dutchspeechtechday/home
Monday, 20 February 2023
Location: Netherlands Institute for Sound & Vision, Hilversum
https://sites.google.com/view/dutchspeechtechday/home
Google
Dutch Speech Tech Day
The 2nd Dutch Speech Tech Day will take place on Monday February 19, 2024 at Beeld & Geluid in Hilversum. The 2nd Dutch Speech Tech Day follows the first, very successful Dutch Speech Tech Day which took place at the same location in February 2023, and which…
We made a big test of available Russian models (in Russian)
https://alphacephei.com/nsh/2023/01/22/russian-models.html
In short: Nemo RNNT is good, Whisper is not very good for Russian, even adapted, Vosk still not bad, we are working to improve.
https://alphacephei.com/nsh/2023/01/22/russian-models.html
In short: Nemo RNNT is good, Whisper is not very good for Russian, even adapted, Vosk still not bad, we are working to improve.
Speech Recognition With Vosk
Открытые модели для распознавания русской речи
Обновлено 15.04.2024:
Feels like Zipformer and other formers
Paper:
https://arxiv.org/abs/2210.00077
E-Branchformer: Branchformer with Enhanced merging for speech recognition
Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu J. Han, Shinji Watanabe
Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating convolution and self-attention but they have not managed to match Conformer's performance. The recently introduced Branchformer achieves comparable performance to Conformer by using dedicated branches of convolution and self-attention and merging local and global context from each branch. In this paper, we propose E-Branchformer, which enhances Branchformer by applying an effective merging method and stacking additional point-wise modules. E-Branchformer sets new state-of-the-art word error rates (WERs) 1.81% and 3.65% on LibriSpeech test-clean and test-other sets without using any external training data.
Paper:
https://arxiv.org/abs/2210.00077
E-Branchformer: Branchformer with Enhanced merging for speech recognition
Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu J. Han, Shinji Watanabe
Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating convolution and self-attention but they have not managed to match Conformer's performance. The recently introduced Branchformer achieves comparable performance to Conformer by using dedicated branches of convolution and self-attention and merging local and global context from each branch. In this paper, we propose E-Branchformer, which enhances Branchformer by applying an effective merging method and stacking additional point-wise modules. E-Branchformer sets new state-of-the-art word error rates (WERs) 1.81% and 3.65% on LibriSpeech test-clean and test-other sets without using any external training data.
Thats something new:
https://arxiv.org/abs/2301.08730
Novel-View Acoustic Synthesis
Changan Chen, Alexander Richard, Roman Shapovalov, Vamsi Krishna Ithapu, Natalia Neverova, Kristen Grauman, Andrea Vedaldi
We introduce the novel-view acoustic synthesis (NVAS) task: given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint? We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space by analyzing the input audio-visual cues. To benchmark this task, we collect two first-of-their-kind large-scale multi-view audio-visual datasets, one synthetic and one real. We show that our model successfully reasons about the spatial cues and synthesizes faithful audio on both datasets. To our knowledge, this work represents the very first formulation, dataset, and approach to solve the novel-view acoustic synthesis task, which has exciting potential applications ranging from AR/VR to art and design. Unlocked by this work, we believe that the future of novel-view synthesis is in multi-modal learning from videos.
https://arxiv.org/abs/2301.08730
Novel-View Acoustic Synthesis
Changan Chen, Alexander Richard, Roman Shapovalov, Vamsi Krishna Ithapu, Natalia Neverova, Kristen Grauman, Andrea Vedaldi
We introduce the novel-view acoustic synthesis (NVAS) task: given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint? We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space by analyzing the input audio-visual cues. To benchmark this task, we collect two first-of-their-kind large-scale multi-view audio-visual datasets, one synthetic and one real. We show that our model successfully reasons about the spatial cues and synthesizes faithful audio on both datasets. To our knowledge, this work represents the very first formulation, dataset, and approach to solve the novel-view acoustic synthesis task, which has exciting potential applications ranging from AR/VR to art and design. Unlocked by this work, we believe that the future of novel-view synthesis is in multi-modal learning from videos.
https://sites.google.com/view/merlion-ccs-challenge/
About
The inaugural MERLIon CCS Challenge focuses on developing robust language identification and language diarization systems that are reliable for non-standard, accented, spontaneous code-switched, child-directed speech collected via Zoom
About
The inaugural MERLIon CCS Challenge focuses on developing robust language identification and language diarization systems that are reliable for non-standard, accented, spontaneous code-switched, child-directed speech collected via Zoom