Spark in me
2.28K subscribers
641 photos
42 videos
114 files
2.56K links
Lost like tears in rain. DS, ML, a bit of philosophy and math. No bs or ads.
Download Telegram
New in our Open STT dataset

https://github.com/snakers4/open_stt#updates

- An mp3 version of the dataset;
- A torrent for mp3 dataset;
- A torrent for the original wav dataset;
- Benchmarks on the public dataset / files with "poor" annotation marked;

#deep_learning
#data_science
#dataset
New version of our open STT dataset - 0.5, now in beta

Please share and repost!

https://github.com/snakers4/open_stt/releases/tag/v0.5-beta

What is new?
- A new domain - radio (1000+ new hours);
- A larger YouTube dataset with 1000+ additional hours;
- A small (300 hours) YouTube dataset downloaded in maximum quality;
- Ground truth validation sets for YouTube / books / public calls manually annotated;
- Now we will start to focus on actually cleaning and distilling the dataset. We have published a second list of "bad" data;

I'm back from vacation)

#deep_learning
#data_science
#dataset
Support Open STT

Now you can support Open STT on our github page via opencollective!
https://github.com/snakers4/open_stt

Opencollective seemed to be the best platform supported by GitHub for now.

#dataset
Ukrainian Open STT 1000 Hours

Following the path of Open STT in Russian, now you can enjoy a similar dataset in Ukrainian:

- Torrent Link
- GitHub Link

Congratulations to our Ukrainian friends for finally publishing a diverse easily downloadable dataset!

Their pages / dataset UX is still a bit rough on the edges, but compared how fast for example Common Voice accumulates data (130 hours for Russian and 43 hours for Ukrainian), UA Open STT and Open STT remain the best resources for respective languages to date.

Also unlike the majority of STT datasets which are (i) behind a paywall or sponsored by corporations (ii) have limited scope / domains (iii) fit some sort of agenda (i.e. use more GPUs than necessary, use our bloated tools, etc), this dataset is legit made by real people.

Also recently corporations have taken up the trend of rehashing publicly available data, which is cool, but unique data is still nowhere to be seen for obvious reasons (except for Common Voice, which is decent only for English).

#dataset
Forwarded from Silero News (Alexander)
Faster Mirror Links for Open STT

Added Zenodo mirror links for Open STT - https://github.com/snakers4/open_stt#links

Zenodo provides sort-of free hosting for non-commercial / academic projects.

Their support has a very ivory tower attitude, but beggars cannot be choosers.

#dataset