Spark in me

Habr.com / TowardsDataScience post for our dataset

In addition to a github release and a medium post, we also made habr.com post:
- https://habr.com/ru/post/450760/

Also our post was accepted to an editor's pick part of TDS:
- http://bit.ly/ru_open_stt

Share / give us a star / clap if you have not already!

Original release
https://github.com/snakers4/open_stt/

#deep_learning
#data_science
#dataset

Хабр

Огромный открытый датасет русской речи

Специалистам по распознаванию речи давно не хватало большого открытого корпуса устной русской речи, поэтому только крупные компании могли позволить себе заниматься этой задачей, но они не...

1.8K viewsAlexander, 11:28

Spark in me

New in our Open STT dataset

https://github.com/snakers4/open_stt#updates

- An mp3 version of the dataset;
- A torrent for mp3 dataset;
- A torrent for the original wav dataset;
- Benchmarks on the public dataset / files with "poor" annotation marked;

#deep_learning
#data_science
#dataset

GitHub

GitHub - snakers4/open_stt: Open STT

Open STT. Contribute to snakers4/open_stt development by creating an account on GitHub.

1.5K viewsAlexander, 06:21

Spark in me

New version of our open STT dataset - 0.5, now in beta

Please share and repost!

https://github.com/snakers4/open_stt/releases/tag/v0.5-beta

What is new?
- A new domain - radio (1000+ new hours);
- A larger YouTube dataset with 1000+ additional hours;
- A small (300 hours) YouTube dataset downloaded in maximum quality;
- Ground truth validation sets for YouTube / books / public calls manually annotated;
- Now we will start to focus on actually cleaning and distilling the dataset. We have published a second list of "bad" data;

I'm back from vacation)

#deep_learning
#data_science
#dataset

GitHub

Release New major release - radio / youtube / data quality distillation · snakers4/open_stt

TLDR:

855 GB (in .wav format in int16) non archived;
(new!) A new domain - radio;
(new!) A larger YouTube dataset with 1000+ additional hours;
(new!) A small (300 hours) YouTube dataset downloaded...

1.3K viewsAlexander, 07:34

Spark in me

Support Open STT

Now you can support Open STT on our github page via opencollective!
https://github.com/snakers4/open_stt

Opencollective seemed to be the best platform supported by GitHub for now.

#dataset

2.0K viewsAlexander, 16:47

Spark in me

Ukrainian Open STT 1000 Hours

Following the path of Open STT in Russian, now you can enjoy a similar dataset in Ukrainian:

- Torrent Link
- GitHub Link

Congratulations to our Ukrainian friends for finally publishing a diverse easily downloadable dataset!

Their pages / dataset UX is still a bit rough on the edges, but compared how fast for example Common Voice accumulates data (130 hours for Russian and 43 hours for Ukrainian), UA Open STT and Open STT remain the best resources for respective languages to date.

Also unlike the majority of STT datasets which are (i) behind a paywall or sponsored by corporations (ii) have limited scope / domains (iii) fit some sort of agenda (i.e. use more GPUs than necessary, use our bloated tools, etc), this dataset is legit made by real people.

Also recently corporations have taken up the trend of rehashing publicly available data, which is cool, but unique data is still nowhere to be seen for obvious reasons (except for Common Voice, which is decent only for English).

#dataset

GitHub

GitHub - snakers4/open_stt: Open STT

Open STT. Contribute to snakers4/open_stt development by creating an account on GitHub.

20.0K viewsAlexander, edited 01:47

Spark in me

Forwarded from Silero News (Alexander)

Faster Mirror Links for Open STT

Added Zenodo mirror links for Open STT - https://github.com/snakers4/open_stt#links

Zenodo provides sort-of free hosting for non-commercial / academic projects.

Their support has a very ivory tower attitude, but beggars cannot be choosers.

#dataset

GitHub

GitHub - snakers4/open_stt: Open STT

Open STT. Contribute to snakers4/open_stt development by creating an account on GitHub.

958 viewsAlexander, 16:50

About

Blog

Apps

Platform