Spark in me
2.28K subscribers
728 photos
47 videos
114 files
2.62K links
Lost like tears in rain. DS, ML, a bit of philosophy and math. No bs or ads.
Download Telegram
Russian Speech Recognition

You may have heard about our dataset, Open STT.

And yeah, if you have not guessed we have built a Speech-To-Text system that is better / on par with alleged "market leaders", the only difference being that we publish something for the common good and we do not need 100 Tesla GPUs (wink-wink, Oleg).

Also if it is not super obvious, this thing is already deployed into production and it really works.

Now we decided to go out of stealth mode a bit and publish a series of publications in online Russian / English publications:

- A piece on Habr.com - just published https://habr.com/ru/post/494006/ - it is very short and abridged, you know habr;
- 2 more detailed pieces on https://thegradient.pub - coming soon!

If you want more gory details, you can see a couple of posts on our project's website:

- STT system quality benchmarks - https://www.silero.ai/russian-stt-benchmarks/
- STT system speed https://www.silero.ai/stt-system-speed/
- How to measure quality in STT https://www.silero.ai/stt-quality-metrics/

If you would like to test our system, you may first want to:

- Try a demo http://demo.silero.ai/ (more beautiful mobile demo coming up!)
- See the API https://api.silero.ai/docs

#deep_learning
#speech
#nlp
Towards an ImageNet Moment for Speech-to-Text

First CV, and then (arguably) NLP, have had their ImageNet moment ⁠— a technical shift that makes tackling many problems much easier. Could Speech-To-Text be next?

Following the release of our production models / metrics, this is our piece on this topic on thegradient.pub! So far this is the largest work ever we have done, and I hope that it will not go under the radar.

It is in our hands now to make sure that speech recognition brings value to people worldwide, and not only some fat cats.

So, without further ado:

- The piece itself https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/
- Some more links here https://spark-in.me/post/towards-an-imagenet-moment-for-speech-to-text
- If you are on Twitter, please repost this message - https://twitter.com/gradientpub/status/1243967773635571712

A lot of thanks to Thegradient team, especially Andrey and Jacob, for the sheer amount of work they put in to make this piece readable and understandable!

Please like, share, repost!

Also, there will be a second piece with criticism, so stay tuned!

#speech
#deep_learning
Towards End-to-end ASR

Towards End-to-end ASR - an internal (?) presentation by Google
https://drive.google.com/file/d/1Rpob1-C223L9UWTiLJ6_Dy12mTA3YyTn/view

This is such a huge corpus of work.

Interesting conclusions:

- Google writes your voice (at least from Google assistant, unclear whether they abuse their "phone" app) and uses this data for their models. Surprise surprise!

- Obviously Google is pushing towards end-to-end ASR within one NN on a mobile device for a number of reasons:

(i) easier packaging
(ii) quantization
(iii) no requirement to run a large LM alongside the model
(iv) Google has a lot of data (end-to-end models suffer from lack of data mostly)

- 120MB total system size on mobile device. This means AM + LM, which in this case is one quantized RNN-T model (4x - float32 => int8)

- They also write that hybrid systems with LM fusion / rescoring perform better overall

- The "best" cited solutions are not end-to-end, though

- Finally understood why they were pushing their RNN-T models instead of 10x more frugal alternatives. Old and optimized layers, hacks to speed up inference, unlimited resources, better performance (on the same step). Also LSTMs are known to be able to replace LMs

- Google also knows about "Time Reduction Layer", but looks like when using it within and RNN it is a bit painful - a lot of fiddling in the model logic

- Looks like given unlimited resources, data and compute - the easiest solution is to train large LSTMs in an end-to-end fashion (I also noticed that LSTMs have higher quality on same step, but MUCH weaker speed and convergence overall in terms of time-to-accuracy), optimize it heavily, quantize and deploy

- Sharing AMs / LMs for different dialects kind of works (maybe in terms of time-to-accuracy?), but direct tuning is better

But is full 100% end-to-end feasible on any scale below Google?

Probably not. Unless you are Facebook.
Having a fully end-2-end pipeline will have OOV (even with BPE / word-piece tokens) and other problems - like bias towards domains where you have audio. It will certainly NOT generalize towards unseen new words and pronunciations.
Meh.

But can you have extremely small mobile models?

Yes and no. Our latest small AM is targeting 200MB before quantization and probably 50MB after. Current production model is around 90MB (after quantization). But can it serve instead of an LM?

Technically yes, but quality will suffer. Unlike Google we do not have unlimited data, compute and low level engineers. On the other hand fully neural post-processing / decoding w/o huge transformer-like models is more than feasible. So we will see =)

#speech
Towards an ImageNet Moment for Speech-to-Text Part 2

Following our post on habr (https://habr.com/ru/post/494006/) and our first post dedicated to training practical STT models (https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/) we are publishing a second post, more technical and dedicated to the STT industry and academia itself.

Please feel free to revisit these posts for more info / more info on our models:

- https://t.me/snakers4/2443
- https://t.me/snakers4/2431

Now, enjoy:

- A Speech-To-Text Practitioner’s Criticisms of Industry and Academia https://thegradient.pub/a-speech-to-text-practitioners-criticisms-of-industry-and-academia/

Please share and repost!

#speech
#deep_learning
This media is not supported in your browser
VIEW IN TELEGRAM
Nvidia AI Noise Reduction

#Nvidia launches #KrispAI competitor Noise Reduction by AI on RTX Videocards.

Seems it works significantly better then other that kind of tools. But it needs to have Nvidia RTX officially.

But it possible to run it on older cards. The instruction is below. Or you can just download already hacked executable (also, below)

Setup Guide: https://www.nvidia.com/en-us/geforce/guides/nvidia-rtx-voice-setup-guide/
The instruction: https://forums.guru3d.com/threads/nvidia-rtx-voice-works-without-rtx-gpu-heres-how.431781/
Executable (use it on your own risk): https://mega.nz/file/CJ0xDYTB#LPorY_aPVqVKfHqWVV7zxK8fNfRmxt6iw6KdkHodz1M

#noisereduction #soundlearning #dl #noise #sound #speech #nvidia
Free Unlimited Download Links for Open STT!

Kudos to Azure Open Datasets, we now have brand new direct download links!

https://github.com/snakers4/open_stt/releases/tag/v1.02

#deep_learning
#speech
Forwarded from Silero News (Alexander)
💎 Silero English STT Models V6 💎

📎 We have published the new en_v6 speech-to-text models

📎 Please see the metrics here

📎 A large number of new validation datasets added for dialects and VOIP

📎 The model family now includes variations of small and xlarge models

📎 Single digit quality gains both for CE and EE models, the gains are less pronounced with EE models

🗜 Best gains reserved for xsmall models, which will not be public for the time being and have almost reached small models in terms of quality, but are 2x smaller (14M params)

⚠️ The models seem to be fit quite well on the data, but the returns are diminishing compared to V3 => V4 => V5. We are already investigating new radical ways to make the models better, stay tuned

📦 Also we have started working on packaging the utils for the public Silero models in a pip package (will work similarly to torch.hub.load)