How much smaller can you make your LM with overtraining?
This figure from Chinchilla gives you a clue on what to expect. Say, you have C = 6e20.
If N = 350M, it performs on par with L_opt of C = 1e20 (N_opt = 900M).
=> 6x training FLOPS for 2.5x less inference FLOPS
https://twitter.com/arankomatsuzaki/status/1630257908238696449
This figure from Chinchilla gives you a clue on what to expect. Say, you have C = 6e20.
If N = 350M, it performs on par with L_opt of C = 1e20 (N_opt = 900M).
=> 6x training FLOPS for 2.5x less inference FLOPS
https://twitter.com/arankomatsuzaki/status/1630257908238696449
https://arxiv.org/abs/2302.12369
Factual Consistency Oriented Speech Recognition
Naoyuki Kanda, Takuya Yoshioka, Yang Liu
This paper presents a novel optimization framework for automatic speech recognition (ASR) with the aim of reducing hallucinations produced by an ASR model. The proposed framework optimizes the ASR model to maximize an expected factual consistency score between ASR hypotheses and ground-truth transcriptions, where the factual consistency score is computed by a separately trained estimator. Experimental results using the AMI meeting corpus and the VoxPopuli corpus show that the ASR model trained with the proposed framework generates ASR hypotheses that have significantly higher consistency scores with ground-truth transcriptions while maintaining the word error rates close to those of cross entropy-trained ASR models. Furthermore, it is shown that training the ASR models with the proposed framework improves the speech summarization quality as measured by the factual consistency of meeting conversation summaries generated by a large language model.
Factual Consistency Oriented Speech Recognition
Naoyuki Kanda, Takuya Yoshioka, Yang Liu
This paper presents a novel optimization framework for automatic speech recognition (ASR) with the aim of reducing hallucinations produced by an ASR model. The proposed framework optimizes the ASR model to maximize an expected factual consistency score between ASR hypotheses and ground-truth transcriptions, where the factual consistency score is computed by a separately trained estimator. Experimental results using the AMI meeting corpus and the VoxPopuli corpus show that the ASR model trained with the proposed framework generates ASR hypotheses that have significantly higher consistency scores with ground-truth transcriptions while maintaining the word error rates close to those of cross entropy-trained ASR models. Furthermore, it is shown that training the ASR models with the proposed framework improves the speech summarization quality as measured by the factual consistency of meeting conversation summaries generated by a large language model.
https://twitter.com/Maureendss/status/1630209732223852544
📢 Exciting news! We just released ProsAudit, a prosodic benchmark for SSL models of speech 🥳
💬 It is now part of the Zero Resource Speech Challenge (track 4). The paper also includes results on a human comparison. 👨💻🤖
📰Check out the preprint: https://arxiv.org/pdf/2302.12057.pdf
📢 Exciting news! We just released ProsAudit, a prosodic benchmark for SSL models of speech 🥳
💬 It is now part of the Zero Resource Speech Challenge (track 4). The paper also includes results on a human comparison. 👨💻🤖
📰Check out the preprint: https://arxiv.org/pdf/2302.12057.pdf
Twitter
📢 Exciting news! We just released ProsAudit, a prosodic benchmark for SSL models of speech 🥳
💬 It is now part of the Zero Resource Speech Challenge (track 4). The paper also includes results on a human comparison. 👨💻🤖
📰Check out the preprint: https…
💬 It is now part of the Zero Resource Speech Challenge (track 4). The paper also includes results on a human comparison. 👨💻🤖
📰Check out the preprint: https…
Nice Chinese chip with analog NPU (very power-efficient) and RISC core
https://en.witmem.com/wtm2101.html
https://en.witmem.com/wtm2101.html
Witmem
WTM2101 computing-in-memory chips
Witmem's WTM2101 chip is suitable for low-power AIoT applications, and can complete large-scale deep learning operations with microwatt to milliwatt power consumption, especially suitable for intelligent voice and intelligent health services in wearable devices…
12m hours of speech data
https://arxiv.org/abs/2303.01037
Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages
Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk, Françoise Beaufays, Yonghui Wu
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model, our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.
https://arxiv.org/abs/2303.01037
Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages
Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk, Françoise Beaufays, Yonghui Wu
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model, our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.
https://t.me/speechtech/1449
Repeated this test with new Speechmatics. Async WER improved to
An interesting thing is that it is phoneme-based
Repeated this test with new Speechmatics. Async WER improved to
6.88. Indeed new Ursa model improved significantly!An interesting thing is that it is phoneme-based
Telegram
Speech Technology
Recently I've came around one ASR SaaS service Soniox. Overall, pretty nice, fast and clean UI, good transcription accuracy and features. 5 hours a month for free for user.
I've read their whitepaper too with very cool results.
https://soniox.com/medi…
I've read their whitepaper too with very cool results.
https://soniox.com/medi…
Tried a popular https://github.com/Kyubyong/g2p. As usual, networks are very bad for unseen cases. Missing letters, extra letters, etc. Watch outputs carefully. Example:
bio-sand B AY1 OW0 S T AE2 N D
bio-sand B AY1 OW0 S T AE2 N D
GitHub
GitHub - Kyubyong/g2p: g2p: English Grapheme To Phoneme Conversion
g2p: English Grapheme To Phoneme Conversion. Contribute to Kyubyong/g2p development by creating an account on GitHub.
BUT 3rd
System description
https://www.fit.vutbr.cz/research/groups/speech/publi/2022/NIST_LRE_2022_System_Description.pdf
System description
https://www.fit.vutbr.cz/research/groups/speech/publi/2022/NIST_LRE_2022_System_Description.pdf
Paraformer released models for other languages too:
We release several new UniASR model: Southern Fujian Dialect model, French model, German model, Vietnamese model, Persian model.
https://github.com/alibaba-damo-academy/FunASR
We release several new UniASR model: Southern Fujian Dialect model, French model, German model, Vietnamese model, Persian model.
https://github.com/alibaba-damo-academy/FunASR
GitHub
GitHub - modelscope/FunASR: A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting…
A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc. - modelscope/FunASR
How can we make inference faster when using big #speech #selfsupervised models?
Check out @salah_zaiem 's paper that compares various approaches, revealing some pretty interesting insights.
https://arxiv.org/abs/2303.06740
These techniques will be soon available in #SpeechBrain
https://twitter.com/mirco_ravanelli/status/1635678132731518976
Check out @salah_zaiem 's paper that compares various approaches, revealing some pretty interesting insights.
https://arxiv.org/abs/2303.06740
These techniques will be soon available in #SpeechBrain
https://twitter.com/mirco_ravanelli/status/1635678132731518976
New model from Assembly AI. Definitely improved from before, but not as great as Speechmatics.
On a toy test WER 10.89, previous assemblyAI (version 9) was at 11.04, version before 11.89. Speechmatics 6.88. Whisper large 8.94
https://twitter.com/AssemblyAI/status/1636050346240884744
Introducing Conformer-1: our latest state-of-the-art speech recognition model.
Built on top of the Conformer architecture and trained on 650K hours of audio data, it achieves near-human-level performance, making up to 43% fewer errors on noisy data than other ASR models.
We use a modified version of the conformer neural net published by Google Brain.
It's built on top of an Efficient Conformer (Orange Labs, 2021), that introduces the following technical modifications:
- Progressive Downsampling to reduce the length of the encoded sequence
- Grouped Attention: A modified version of the attention mechanism that makes it agnostic to sequence-length
These changes yield speedups of 29% at inference time and 36% at training time.
To further improve our model’s accuracy on noisy audio, we implemented a modified version of Sparse Attention, a pruning method for achieving sparsity of the model’s weights in order to achieve regularization.
We took inspiration from the data scaling laws described in DeepMind's Chinchilla paper and adapted them to the ASR domain.
Our team curated a dataset of 650K hours of English audio - making our model the largest-trained supervised model for English available today.
Based on our results, Conformer-1 is more robust on real-world data than popular commercial and open-source ASR models, making up to 43% fewer errors on average on noisy data:
The biggest improvement with this new release is in our robustness to a wide variety of data domains and noisy audio.
On a toy test WER 10.89, previous assemblyAI (version 9) was at 11.04, version before 11.89. Speechmatics 6.88. Whisper large 8.94
https://twitter.com/AssemblyAI/status/1636050346240884744
Introducing Conformer-1: our latest state-of-the-art speech recognition model.
Built on top of the Conformer architecture and trained on 650K hours of audio data, it achieves near-human-level performance, making up to 43% fewer errors on noisy data than other ASR models.
We use a modified version of the conformer neural net published by Google Brain.
It's built on top of an Efficient Conformer (Orange Labs, 2021), that introduces the following technical modifications:
- Progressive Downsampling to reduce the length of the encoded sequence
- Grouped Attention: A modified version of the attention mechanism that makes it agnostic to sequence-length
These changes yield speedups of 29% at inference time and 36% at training time.
To further improve our model’s accuracy on noisy audio, we implemented a modified version of Sparse Attention, a pruning method for achieving sparsity of the model’s weights in order to achieve regularization.
We took inspiration from the data scaling laws described in DeepMind's Chinchilla paper and adapted them to the ASR domain.
Our team curated a dataset of 650K hours of English audio - making our model the largest-trained supervised model for English available today.
Based on our results, Conformer-1 is more robust on real-world data than popular commercial and open-source ASR models, making up to 43% fewer errors on average on noisy data:
The biggest improvement with this new release is in our robustness to a wide variety of data domains and noisy audio.
Twitter
Introducing Conformer-1: our latest state-of-the-art speech recognition model.
Built on top of the Conformer architecture and trained on 650K hours of audio data, it achieves near-human-level performance, making up to 43% fewer errors on noisy data than…
Built on top of the Conformer architecture and trained on 650K hours of audio data, it achieves near-human-level performance, making up to 43% fewer errors on noisy data than…
Kincaid46 WER from Ursa announcement:
AssemblyAI: 8.6
Speechmatics: 7.88
Microsoft: 9.70
Whisper Large-v2: 8.7
Vosk 0.42 Gigaspeech 15.8
Google 12.52
Amazon 10.94
AssemblyAI: 8.6
Speechmatics: 7.88
Microsoft: 9.70
Whisper Large-v2: 8.7
Vosk 0.42 Gigaspeech 15.8
Google 12.52
Amazon 10.94
Streaming punctuation model is interesting
https://github.com/alibaba-damo-academy/FunASR/releases/tag/v0.3.0
https://github.com/alibaba-damo-academy/FunASR/releases/tag/v0.3.0
The amount of models this guy trained is quite outstanding
https://malaya-speech.readthedocs.io/en/latest/index.html
https://malaya-speech.readthedocs.io/en/latest/index.html