Russian Speech Recognition
You may have heard about our dataset, Open STT.
And yeah, if you have not guessed we have built a Speech-To-Text system that is better / on par with alleged "market leaders", the only difference being that we publish something for the common good and we do not need 100 Tesla GPUs (wink-wink, Oleg).
Also if it is not super obvious, this thing is already deployed into production and it really works.
Now we decided to go out of stealth mode a bit and publish a series of publications in online Russian / English publications:
- A piece on Habr.com - just published https://habr.com/ru/post/494006/ - it is very short and abridged, you know habr;
- 2 more detailed pieces on https://thegradient.pub - coming soon!
If you want more gory details, you can see a couple of posts on our project's website:
- STT system quality benchmarks - https://www.silero.ai/russian-stt-benchmarks/
- STT system speed https://www.silero.ai/stt-system-speed/
- How to measure quality in STT https://www.silero.ai/stt-quality-metrics/
If you would like to test our system, you may first want to:
- Try a demo http://demo.silero.ai/ (more beautiful mobile demo coming up!)
- See the API https://api.silero.ai/docs
#deep_learning
#speech
#nlp
You may have heard about our dataset, Open STT.
And yeah, if you have not guessed we have built a Speech-To-Text system that is better / on par with alleged "market leaders", the only difference being that we publish something for the common good and we do not need 100 Tesla GPUs (wink-wink, Oleg).
Also if it is not super obvious, this thing is already deployed into production and it really works.
Now we decided to go out of stealth mode a bit and publish a series of publications in online Russian / English publications:
- A piece on Habr.com - just published https://habr.com/ru/post/494006/ - it is very short and abridged, you know habr;
- 2 more detailed pieces on https://thegradient.pub - coming soon!
If you want more gory details, you can see a couple of posts on our project's website:
- STT system quality benchmarks - https://www.silero.ai/russian-stt-benchmarks/
- STT system speed https://www.silero.ai/stt-system-speed/
- How to measure quality in STT https://www.silero.ai/stt-quality-metrics/
If you would like to test our system, you may first want to:
- Try a demo http://demo.silero.ai/ (more beautiful mobile demo coming up!)
- See the API https://api.silero.ai/docs
#deep_learning
#speech
#nlp
Silero
🥇 Сравнение Нашей Системы STT с Остальными Системами на Рынке по Качеству
Если вы еще не познакомились с нашей статьей, то самое время прочитать. Обязатально прочитайте и возвращайтесь!
⌛ Как Правильно Измерять Качество STT?> Мы часто сталкиваемся с заблуждениями по поводу того как ”правильно” проверятькачество STT систем. В этой…
⌛ Как Правильно Измерять Качество STT?> Мы часто сталкиваемся с заблуждениями по поводу того как ”правильно” проверятькачество STT систем. В этой…
Towards an ImageNet Moment for Speech-to-Text
First CV, and then (arguably) NLP, have had their ImageNet moment — a technical shift that makes tackling many problems much easier. Could Speech-To-Text be next?
Following the release of our production models / metrics, this is our piece on this topic on thegradient.pub! So far this is the largest work ever we have done, and I hope that it will not go under the radar.
It is in our hands now to make sure that speech recognition brings value to people worldwide, and not only some fat cats.
So, without further ado:
- The piece itself https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/
- Some more links here https://spark-in.me/post/towards-an-imagenet-moment-for-speech-to-text
- If you are on Twitter, please repost this message - https://twitter.com/gradientpub/status/1243967773635571712
A lot of thanks to Thegradient team, especially Andrey and Jacob, for the sheer amount of work they put in to make this piece readable and understandable!
Please like, share, repost!
Also, there will be a second piece with criticism, so stay tuned!
#speech
#deep_learning
First CV, and then (arguably) NLP, have had their ImageNet moment — a technical shift that makes tackling many problems much easier. Could Speech-To-Text be next?
Following the release of our production models / metrics, this is our piece on this topic on thegradient.pub! So far this is the largest work ever we have done, and I hope that it will not go under the radar.
It is in our hands now to make sure that speech recognition brings value to people worldwide, and not only some fat cats.
So, without further ado:
- The piece itself https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/
- Some more links here https://spark-in.me/post/towards-an-imagenet-moment-for-speech-to-text
- If you are on Twitter, please repost this message - https://twitter.com/gradientpub/status/1243967773635571712
A lot of thanks to Thegradient team, especially Andrey and Jacob, for the sheer amount of work they put in to make this piece readable and understandable!
Please like, share, repost!
Also, there will be a second piece with criticism, so stay tuned!
#speech
#deep_learning
The Gradient
Towards an ImageNet Moment for Speech-to-Text
First CV, and then NLP, have had their 'ImageNet moment' — a technical shift that makes tackling many problems much easier. Could Speech-To-Text be next?
Towards End-to-end ASR
Towards End-to-end ASR - an internal (?) presentation by Google
https://drive.google.com/file/d/1Rpob1-C223L9UWTiLJ6_Dy12mTA3YyTn/view
This is such a huge corpus of work.
Interesting conclusions:
- Google writes your voice (at least from Google assistant, unclear whether they abuse their "phone" app) and uses this data for their models. Surprise surprise!
- Obviously Google is pushing towards end-to-end ASR within one NN on a mobile device for a number of reasons:
(i) easier packaging
(ii) quantization
(iii) no requirement to run a large LM alongside the model
(iv) Google has a lot of data (end-to-end models suffer from lack of data mostly)
- 120MB total system size on mobile device. This means AM + LM, which in this case is one quantized RNN-T model (4x - float32 => int8)
- They also write that hybrid systems with LM fusion / rescoring perform better overall
- The "best" cited solutions are not end-to-end, though
- Finally understood why they were pushing their RNN-T models instead of 10x more frugal alternatives. Old and optimized layers, hacks to speed up inference, unlimited resources, better performance (on the same step). Also LSTMs are known to be able to replace LMs
- Google also knows about "Time Reduction Layer", but looks like when using it within and RNN it is a bit painful - a lot of fiddling in the model logic
- Looks like given unlimited resources, data and compute - the easiest solution is to train large LSTMs in an end-to-end fashion (I also noticed that LSTMs have higher quality on same step, but MUCH weaker speed and convergence overall in terms of time-to-accuracy), optimize it heavily, quantize and deploy
- Sharing AMs / LMs for different dialects kind of works (maybe in terms of time-to-accuracy?), but direct tuning is better
But is full 100% end-to-end feasible on any scale below Google?
Probably not. Unless you are Facebook.
Having a fully end-2-end pipeline will have OOV (even with BPE / word-piece tokens) and other problems - like bias towards domains where you have audio. It will certainly NOT generalize towards unseen new words and pronunciations.
Meh.
But can you have extremely small mobile models?
Yes and no. Our latest small AM is targeting 200MB before quantization and probably 50MB after. Current production model is around 90MB (after quantization). But can it serve instead of an LM?
Technically yes, but quality will suffer. Unlike Google we do not have unlimited data, compute and low level engineers. On the other hand fully neural post-processing / decoding w/o huge transformer-like models is more than feasible. So we will see =)
#speech
Towards End-to-end ASR - an internal (?) presentation by Google
https://drive.google.com/file/d/1Rpob1-C223L9UWTiLJ6_Dy12mTA3YyTn/view
This is such a huge corpus of work.
Interesting conclusions:
- Google writes your voice (at least from Google assistant, unclear whether they abuse their "phone" app) and uses this data for their models. Surprise surprise!
- Obviously Google is pushing towards end-to-end ASR within one NN on a mobile device for a number of reasons:
(i) easier packaging
(ii) quantization
(iii) no requirement to run a large LM alongside the model
(iv) Google has a lot of data (end-to-end models suffer from lack of data mostly)
- 120MB total system size on mobile device. This means AM + LM, which in this case is one quantized RNN-T model (4x - float32 => int8)
- They also write that hybrid systems with LM fusion / rescoring perform better overall
- The "best" cited solutions are not end-to-end, though
- Finally understood why they were pushing their RNN-T models instead of 10x more frugal alternatives. Old and optimized layers, hacks to speed up inference, unlimited resources, better performance (on the same step). Also LSTMs are known to be able to replace LMs
- Google also knows about "Time Reduction Layer", but looks like when using it within and RNN it is a bit painful - a lot of fiddling in the model logic
- Looks like given unlimited resources, data and compute - the easiest solution is to train large LSTMs in an end-to-end fashion (I also noticed that LSTMs have higher quality on same step, but MUCH weaker speed and convergence overall in terms of time-to-accuracy), optimize it heavily, quantize and deploy
- Sharing AMs / LMs for different dialects kind of works (maybe in terms of time-to-accuracy?), but direct tuning is better
But is full 100% end-to-end feasible on any scale below Google?
Probably not. Unless you are Facebook.
Having a fully end-2-end pipeline will have OOV (even with BPE / word-piece tokens) and other problems - like bias towards domains where you have audio. It will certainly NOT generalize towards unseen new words and pronunciations.
Meh.
But can you have extremely small mobile models?
Yes and no. Our latest small AM is targeting 200MB before quantization and probably 50MB after. Current production model is around 90MB (after quantization). But can it serve instead of an LM?
Technically yes, but quality will suffer. Unlike Google we do not have unlimited data, compute and low level engineers. On the other hand fully neural post-processing / decoding w/o huge transformer-like models is more than feasible. So we will see =)
#speech
Towards an ImageNet Moment for Speech-to-Text Part 2
Following our post on habr (https://habr.com/ru/post/494006/) and our first post dedicated to training practical STT models (https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/) we are publishing a second post, more technical and dedicated to the STT industry and academia itself.
Please feel free to revisit these posts for more info / more info on our models:
- https://t.me/snakers4/2443
- https://t.me/snakers4/2431
Now, enjoy:
- A Speech-To-Text Practitioner’s Criticisms of Industry and Academia https://thegradient.pub/a-speech-to-text-practitioners-criticisms-of-industry-and-academia/
Please share and repost!
#speech
#deep_learning
Following our post on habr (https://habr.com/ru/post/494006/) and our first post dedicated to training practical STT models (https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/) we are publishing a second post, more technical and dedicated to the STT industry and academia itself.
Please feel free to revisit these posts for more info / more info on our models:
- https://t.me/snakers4/2443
- https://t.me/snakers4/2431
Now, enjoy:
- A Speech-To-Text Practitioner’s Criticisms of Industry and Academia https://thegradient.pub/a-speech-to-text-practitioners-criticisms-of-industry-and-academia/
Please share and repost!
#speech
#deep_learning
Хабр
Понижаем барьеры на вход в распознавание речи
Автоматическое распознавание речи (STT или ASR) прошло долгий путь совершенствования и имеет довольно обширную историю. Расхожим мнением является то, что лишь о...
Forwarded from Data Science by ODS.ai 🦜
This media is not supported in your browser
VIEW IN TELEGRAM
Nvidia AI Noise Reduction
#Nvidia launches #KrispAI competitor Noise Reduction by AI on RTX Videocards.
Seems it works significantly better then other that kind of tools. But it needs to have Nvidia RTX officially.
But it possible to run it on older cards. The instruction is below. Or you can just download already hacked executable (also, below)
Setup Guide: https://www.nvidia.com/en-us/geforce/guides/nvidia-rtx-voice-setup-guide/
The instruction: https://forums.guru3d.com/threads/nvidia-rtx-voice-works-without-rtx-gpu-heres-how.431781/
Executable (use it on your own risk): https://mega.nz/file/CJ0xDYTB#LPorY_aPVqVKfHqWVV7zxK8fNfRmxt6iw6KdkHodz1M
#noisereduction #soundlearning #dl #noise #sound #speech #nvidia
#Nvidia launches #KrispAI competitor Noise Reduction by AI on RTX Videocards.
Seems it works significantly better then other that kind of tools. But it needs to have Nvidia RTX officially.
But it possible to run it on older cards. The instruction is below. Or you can just download already hacked executable (also, below)
Setup Guide: https://www.nvidia.com/en-us/geforce/guides/nvidia-rtx-voice-setup-guide/
The instruction: https://forums.guru3d.com/threads/nvidia-rtx-voice-works-without-rtx-gpu-heres-how.431781/
Executable (use it on your own risk): https://mega.nz/file/CJ0xDYTB#LPorY_aPVqVKfHqWVV7zxK8fNfRmxt6iw6KdkHodz1M
#noisereduction #soundlearning #dl #noise #sound #speech #nvidia
New Minor Open STT Release
https://github.com/snakers4/open_stt/releases/tag/v1.01
Dataset conversion to OPUS
OPUS torrent - https://academictorrents.com/details/95b4cab0f99850e119114c8b6df00193ab5fa34f
OPUS helpers and build instructions - https://github.com/snakers4/open_stt/#how-to-open-opus
Coming soon - new unlimited direct links =)
Further reading links
#deep_learning
#speech
https://github.com/snakers4/open_stt/releases/tag/v1.01
Dataset conversion to OPUS
OPUS torrent - https://academictorrents.com/details/95b4cab0f99850e119114c8b6df00193ab5fa34f
OPUS helpers and build instructions - https://github.com/snakers4/open_stt/#how-to-open-opus
Coming soon - new unlimited direct links =)
Further reading links
#deep_learning
#speech
GitHub
Release OPUS torrent micro release · snakers4/open_stt
OPUS torrent micro release
Dataset conversion to OPUS
OPUS torrent - https://academictorrents.com/details/95b4cab0f99850e119114c8b6df00193ab5fa34f
OPUS helpers and build instructions - https://git...
Dataset conversion to OPUS
OPUS torrent - https://academictorrents.com/details/95b4cab0f99850e119114c8b6df00193ab5fa34f
OPUS helpers and build instructions - https://git...
Free Unlimited Download Links for Open STT!
Kudos to Azure Open Datasets, we now have brand new direct download links!
https://github.com/snakers4/open_stt/releases/tag/v1.02
#deep_learning
#speech
Kudos to Azure Open Datasets, we now have brand new direct download links!
https://github.com/snakers4/open_stt/releases/tag/v1.02
#deep_learning
#speech
GitHub
Release Direct Download Links · snakers4/open_stt
New OPUS direct download links
Forwarded from Silero News (Alexander)
💎 Silero English STT Models V6 💎
📎 We have published the new
📎 Please see the metrics here
📎 A large number of new validation datasets added for dialects and VOIP
📎 The model family now includes variations of
📎 Single digit quality gains both for CE and EE models, the gains are less pronounced with EE models
🗜 Best gains reserved for
⚠️ The models seem to be fit quite well on the data, but the returns are diminishing compared to V3 => V4 => V5. We are already investigating new radical ways to make the models better, stay tuned
📦 Also we have started working on packaging the utils for the public Silero models in a
📎 We have published the new
en_v6
speech-to-text models 📎 Please see the metrics here
📎 A large number of new validation datasets added for dialects and VOIP
📎 The model family now includes variations of
small
and xlarge
models 📎 Single digit quality gains both for CE and EE models, the gains are less pronounced with EE models
🗜 Best gains reserved for
xsmall
models, which will not be public for the time being and have almost reached small
models in terms of quality, but are 2x smaller (14M params)⚠️ The models seem to be fit quite well on the data, but the returns are diminishing compared to V3 => V4 => V5. We are already investigating new radical ways to make the models better, stay tuned
📦 Also we have started working on packaging the utils for the public Silero models in a
pip
package (will work similarly to torch.hub.load
)GitHub
GitHub - snakers4/silero-models: Silero Models: pre-trained speech-to-text, text-to-speech and text-enhancement models made embarrassingly…
Silero Models: pre-trained speech-to-text, text-to-speech and text-enhancement models made embarrassingly simple - snakers4/silero-models