Attention ASR developers and researchers! 🚀 Great news, with the latest update of 🤗 PEFT, you can now fine-tune your Whisper-large model faster than ever before! The new update allows you to fit 5X larger batches with less than 10GB GPU VRAM, thanks to LoRA and Tim Dettmers's bnb packaged nicely in 🤗 PEFT. And the best part? You get a comparable WER, but just faster!! ⚡️
But that's not all, you no longer have to compromise on the training speed to maintain WER. In fact, in our experiments with the Marathi language, the WER was comparable with full fine-tuning runs of Whisper-large. Without PEFT, 13.64 WER (full training run)
and with PEFT, 14.01 WER (trained on a @googlecolab
). With 🤗 PEFT, you can now train a Whisper-large v2 model in less than 8GB GPU VRAM! 📉
Without 🤗 PEFT, you could experience OOM on a Colab T4, but not anymore! You can easily save on storage and port tiny checkpoints, ~63 MB compared to 6.7 GB fully fine-tuned model. 🐜
And that's not all! For low latency, you can convert the PEFT model to ONNX and use ORT using 🤗 Optimum.
Start experimenting today and fine-tune your Whisper using PEFT+INT8 in Colab on a language of your choice! Join our Discord community to get involved in the conversation and discuss your results and questions. 🔬
Check out the Colab notebook examples and start your ASR development journey with 🤗 PEFT today!
https://github.com/huggingface/peft
But that's not all, you no longer have to compromise on the training speed to maintain WER. In fact, in our experiments with the Marathi language, the WER was comparable with full fine-tuning runs of Whisper-large. Without PEFT, 13.64 WER (full training run)
and with PEFT, 14.01 WER (trained on a @googlecolab
). With 🤗 PEFT, you can now train a Whisper-large v2 model in less than 8GB GPU VRAM! 📉
Without 🤗 PEFT, you could experience OOM on a Colab T4, but not anymore! You can easily save on storage and port tiny checkpoints, ~63 MB compared to 6.7 GB fully fine-tuned model. 🐜
And that's not all! For low latency, you can convert the PEFT model to ONNX and use ORT using 🤗 Optimum.
Start experimenting today and fine-tune your Whisper using PEFT+INT8 in Colab on a language of your choice! Join our Discord community to get involved in the conversation and discuss your results and questions. 🔬
Check out the Colab notebook examples and start your ASR development journey with 🤗 PEFT today!
https://github.com/huggingface/peft
GitHub
GitHub - huggingface/peft: 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning. - huggingface/peft
From Google
https://arxiv.org/abs/2302.11186
UML: A Universal Monolingual Output Layer for Multilingual ASR
Chao Zhang, Bo Li, Tara N. Sainath, Trevor Strohman, Shuo-yiin Chang
Word-piece models (WPMs) are commonly used subword units in state-of-the-art end-to-end automatic speech recognition (ASR) systems. For multilingual ASR, due to the differences in written scripts across languages, multilingual WPMs bring the challenges of having overly large output layers and scaling to more languages. In this work, we propose a universal monolingual output layer (UML) to address such problems. Instead of one output node for only one WPM, UML re-associates each output node with multiple WPMs, one for each language, and results in a smaller monolingual output layer shared across languages. Consequently, the UML enables to switch in the interpretation of each output node depending on the language of the input speech. Experimental results on an 11-language voice search task demonstrated the feasibility of using UML for high-quality and high-efficiency multilingual streaming ASR.
https://arxiv.org/abs/2302.11186
UML: A Universal Monolingual Output Layer for Multilingual ASR
Chao Zhang, Bo Li, Tara N. Sainath, Trevor Strohman, Shuo-yiin Chang
Word-piece models (WPMs) are commonly used subword units in state-of-the-art end-to-end automatic speech recognition (ASR) systems. For multilingual ASR, due to the differences in written scripts across languages, multilingual WPMs bring the challenges of having overly large output layers and scaling to more languages. In this work, we propose a universal monolingual output layer (UML) to address such problems. Instead of one output node for only one WPM, UML re-associates each output node with multiple WPMs, one for each language, and results in a smaller monolingual output layer shared across languages. Consequently, the UML enables to switch in the interpretation of each output node depending on the language of the input speech. Experimental results on an 11-language voice search task demonstrated the feasibility of using UML for high-quality and high-efficiency multilingual streaming ASR.
https://arxiv.org/abs/2302.10248
VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge
Jaesung Huh, Andrew Brown, Jee-weon Jung, Joon Son Chung, Arsha Nagrani, Daniel Garcia-Romero, Andrew Zisserman
This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022. The goal of this challenge was to evaluate how well state-of-the-art speaker recognition systems can diarise and recognise speakers from speech obtained "in the wild". The challenge consisted of: (i) the provision of publicly available speaker recognition and diarisation data from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a public challenge and hybrid workshop held at INTERSPEECH 2022. We describe the four tracks of our challenge along with the baselines, methods, and results. We conclude with a discussion on the new domain-transfer focus of VoxSRC-22, and on the progression of the challenge from the previous three editions.
VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge
Jaesung Huh, Andrew Brown, Jee-weon Jung, Joon Son Chung, Arsha Nagrani, Daniel Garcia-Romero, Andrew Zisserman
This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022. The goal of this challenge was to evaluate how well state-of-the-art speaker recognition systems can diarise and recognise speakers from speech obtained "in the wild". The challenge consisted of: (i) the provision of publicly available speaker recognition and diarisation data from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a public challenge and hybrid workshop held at INTERSPEECH 2022. We describe the four tracks of our challenge along with the baselines, methods, and results. We conclude with a discussion on the new domain-transfer focus of VoxSRC-22, and on the progression of the challenge from the previous three editions.
BigVGAN is accepted at ICLR 2023.
Listen audio samples:
https://bigvgan-demo.github.io
A universal audio synthesis model, trained on speech only, works for out-of-distribution scenarios, e.g., unseen singing voices and music audio!
Code and models are released!
https://github.com/NVIDIA/BigVGAN
https://twitter.com/_weiping/status/1628210425480515584
Listen audio samples:
https://bigvgan-demo.github.io
A universal audio synthesis model, trained on speech only, works for out-of-distribution scenarios, e.g., unseen singing voices and music audio!
Code and models are released!
https://github.com/NVIDIA/BigVGAN
https://twitter.com/_weiping/status/1628210425480515584
From Phil Woodland
https://arxiv.org/abs/2302.08579
Adaptable End-to-End ASR Models using Replaceable Internal LMs and Residual Softmax
Keqi Deng, Philip C. Woodland
End-to-end (E2E) automatic speech recognition (ASR) implicitly learns the token sequence distribution of paired audio-transcript training data. However, it still suffers from domain shifts from training to testing, and domain adaptation is still challenging. To alleviate this problem, this paper designs a replaceable internal language model (RILM) method, which makes it feasible to directly replace the internal language model (LM) of E2E ASR models with a target-domain LM in the decoding stage when a domain shift is encountered. Furthermore, this paper proposes a residual softmax (R-softmax) that is designed for CTC-based E2E ASR models to adapt to the target domain without re-training during inference. For E2E ASR models trained on the LibriSpeech corpus, experiments showed that the proposed methods gave a 2.6% absolute WER reduction on the Switchboard data and a 1.0% WER reduction on the AESRC2020 corpus while maintaining intra-domain ASR results.
https://arxiv.org/abs/2302.08579
Adaptable End-to-End ASR Models using Replaceable Internal LMs and Residual Softmax
Keqi Deng, Philip C. Woodland
End-to-end (E2E) automatic speech recognition (ASR) implicitly learns the token sequence distribution of paired audio-transcript training data. However, it still suffers from domain shifts from training to testing, and domain adaptation is still challenging. To alleviate this problem, this paper designs a replaceable internal language model (RILM) method, which makes it feasible to directly replace the internal language model (LM) of E2E ASR models with a target-domain LM in the decoding stage when a domain shift is encountered. Furthermore, this paper proposes a residual softmax (R-softmax) that is designed for CTC-based E2E ASR models to adapt to the target domain without re-training during inference. For E2E ASR models trained on the LibriSpeech corpus, experiments showed that the proposed methods gave a 2.6% absolute WER reduction on the Switchboard data and a 1.0% WER reduction on the AESRC2020 corpus while maintaining intra-domain ASR results.
How much smaller can you make your LM with overtraining?
This figure from Chinchilla gives you a clue on what to expect. Say, you have C = 6e20.
If N = 350M, it performs on par with L_opt of C = 1e20 (N_opt = 900M).
=> 6x training FLOPS for 2.5x less inference FLOPS
https://twitter.com/arankomatsuzaki/status/1630257908238696449
This figure from Chinchilla gives you a clue on what to expect. Say, you have C = 6e20.
If N = 350M, it performs on par with L_opt of C = 1e20 (N_opt = 900M).
=> 6x training FLOPS for 2.5x less inference FLOPS
https://twitter.com/arankomatsuzaki/status/1630257908238696449
https://arxiv.org/abs/2302.12369
Factual Consistency Oriented Speech Recognition
Naoyuki Kanda, Takuya Yoshioka, Yang Liu
This paper presents a novel optimization framework for automatic speech recognition (ASR) with the aim of reducing hallucinations produced by an ASR model. The proposed framework optimizes the ASR model to maximize an expected factual consistency score between ASR hypotheses and ground-truth transcriptions, where the factual consistency score is computed by a separately trained estimator. Experimental results using the AMI meeting corpus and the VoxPopuli corpus show that the ASR model trained with the proposed framework generates ASR hypotheses that have significantly higher consistency scores with ground-truth transcriptions while maintaining the word error rates close to those of cross entropy-trained ASR models. Furthermore, it is shown that training the ASR models with the proposed framework improves the speech summarization quality as measured by the factual consistency of meeting conversation summaries generated by a large language model.
Factual Consistency Oriented Speech Recognition
Naoyuki Kanda, Takuya Yoshioka, Yang Liu
This paper presents a novel optimization framework for automatic speech recognition (ASR) with the aim of reducing hallucinations produced by an ASR model. The proposed framework optimizes the ASR model to maximize an expected factual consistency score between ASR hypotheses and ground-truth transcriptions, where the factual consistency score is computed by a separately trained estimator. Experimental results using the AMI meeting corpus and the VoxPopuli corpus show that the ASR model trained with the proposed framework generates ASR hypotheses that have significantly higher consistency scores with ground-truth transcriptions while maintaining the word error rates close to those of cross entropy-trained ASR models. Furthermore, it is shown that training the ASR models with the proposed framework improves the speech summarization quality as measured by the factual consistency of meeting conversation summaries generated by a large language model.
https://twitter.com/Maureendss/status/1630209732223852544
📢 Exciting news! We just released ProsAudit, a prosodic benchmark for SSL models of speech 🥳
💬 It is now part of the Zero Resource Speech Challenge (track 4). The paper also includes results on a human comparison. 👨💻🤖
📰Check out the preprint: https://arxiv.org/pdf/2302.12057.pdf
📢 Exciting news! We just released ProsAudit, a prosodic benchmark for SSL models of speech 🥳
💬 It is now part of the Zero Resource Speech Challenge (track 4). The paper also includes results on a human comparison. 👨💻🤖
📰Check out the preprint: https://arxiv.org/pdf/2302.12057.pdf
Twitter
📢 Exciting news! We just released ProsAudit, a prosodic benchmark for SSL models of speech 🥳
💬 It is now part of the Zero Resource Speech Challenge (track 4). The paper also includes results on a human comparison. 👨💻🤖
📰Check out the preprint: https…
💬 It is now part of the Zero Resource Speech Challenge (track 4). The paper also includes results on a human comparison. 👨💻🤖
📰Check out the preprint: https…
Nice Chinese chip with analog NPU (very power-efficient) and RISC core
https://en.witmem.com/wtm2101.html
https://en.witmem.com/wtm2101.html
Witmem
WTM2101 computing-in-memory chips
Witmem's WTM2101 chip is suitable for low-power AIoT applications, and can complete large-scale deep learning operations with microwatt to milliwatt power consumption, especially suitable for intelligent voice and intelligent health services in wearable devices…
12m hours of speech data
https://arxiv.org/abs/2303.01037
Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages
Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk, Françoise Beaufays, Yonghui Wu
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model, our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.
https://arxiv.org/abs/2303.01037
Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages
Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk, Françoise Beaufays, Yonghui Wu
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model, our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.
https://t.me/speechtech/1449
Repeated this test with new Speechmatics. Async WER improved to
An interesting thing is that it is phoneme-based
Repeated this test with new Speechmatics. Async WER improved to
6.88. Indeed new Ursa model improved significantly!An interesting thing is that it is phoneme-based
Telegram
Speech Technology
Recently I've came around one ASR SaaS service Soniox. Overall, pretty nice, fast and clean UI, good transcription accuracy and features. 5 hours a month for free for user.
I've read their whitepaper too with very cool results.
https://soniox.com/medi…
I've read their whitepaper too with very cool results.
https://soniox.com/medi…
Tried a popular https://github.com/Kyubyong/g2p. As usual, networks are very bad for unseen cases. Missing letters, extra letters, etc. Watch outputs carefully. Example:
bio-sand B AY1 OW0 S T AE2 N D
bio-sand B AY1 OW0 S T AE2 N D
GitHub
GitHub - Kyubyong/g2p: g2p: English Grapheme To Phoneme Conversion
g2p: English Grapheme To Phoneme Conversion. Contribute to Kyubyong/g2p development by creating an account on GitHub.
BUT 3rd
System description
https://www.fit.vutbr.cz/research/groups/speech/publi/2022/NIST_LRE_2022_System_Description.pdf
System description
https://www.fit.vutbr.cz/research/groups/speech/publi/2022/NIST_LRE_2022_System_Description.pdf