Habr.com / TowardsDataScience post for our dataset
In addition to a github release and a medium post, we also made habr.com post:
- https://habr.com/ru/post/450760/
Also our post was accepted to an editor's pick part of TDS:
- http://bit.ly/ru_open_stt
Share / give us a star / clap if you have not already!
Original release
https://github.com/snakers4/open_stt/
#deep_learning
#data_science
#dataset
In addition to a github release and a medium post, we also made habr.com post:
- https://habr.com/ru/post/450760/
Also our post was accepted to an editor's pick part of TDS:
- http://bit.ly/ru_open_stt
Share / give us a star / clap if you have not already!
Original release
https://github.com/snakers4/open_stt/
#deep_learning
#data_science
#dataset
Хабр
Огромный открытый датасет русской речи
Специалистам по распознаванию речи давно не хватало большого открытого корпуса устной русской речи, поэтому только крупные компании могли позволить себе заниматься этой задачей, но они не...
New in our Open STT dataset
https://github.com/snakers4/open_stt#updates
- An
- A torrent for
- A torrent for the original
- Benchmarks on the public dataset / files with "poor" annotation marked;
#deep_learning
#data_science
#dataset
https://github.com/snakers4/open_stt#updates
- An
mp3
version of the dataset;- A torrent for
mp3
dataset;- A torrent for the original
wav
dataset;- Benchmarks on the public dataset / files with "poor" annotation marked;
#deep_learning
#data_science
#dataset
GitHub
GitHub - snakers4/open_stt: Open STT
Open STT. Contribute to snakers4/open_stt development by creating an account on GitHub.
New version of our open STT dataset - 0.5, now in beta
Please share and repost!
https://github.com/snakers4/open_stt/releases/tag/v0.5-beta
What is new?
- A new domain - radio (1000+ new hours);
- A larger YouTube dataset with 1000+ additional hours;
- A small (300 hours) YouTube dataset downloaded in maximum quality;
- Ground truth validation sets for YouTube / books / public calls manually annotated;
- Now we will start to focus on actually cleaning and distilling the dataset. We have published a second list of "bad" data;
I'm back from vacation)
#deep_learning
#data_science
#dataset
Please share and repost!
https://github.com/snakers4/open_stt/releases/tag/v0.5-beta
What is new?
- A new domain - radio (1000+ new hours);
- A larger YouTube dataset with 1000+ additional hours;
- A small (300 hours) YouTube dataset downloaded in maximum quality;
- Ground truth validation sets for YouTube / books / public calls manually annotated;
- Now we will start to focus on actually cleaning and distilling the dataset. We have published a second list of "bad" data;
I'm back from vacation)
#deep_learning
#data_science
#dataset
GitHub
Release New major release - radio / youtube / data quality distillation · snakers4/open_stt
TLDR:
855 GB (in .wav format in int16) non archived;
(new!) A new domain - radio;
(new!) A larger YouTube dataset with 1000+ additional hours;
(new!) A small (300 hours) YouTube dataset downloaded...
855 GB (in .wav format in int16) non archived;
(new!) A new domain - radio;
(new!) A larger YouTube dataset with 1000+ additional hours;
(new!) A small (300 hours) YouTube dataset downloaded...
Support Open STT
Now you can support Open STT on our github page via opencollective!
https://github.com/snakers4/open_stt
Opencollective seemed to be the best platform supported by GitHub for now.
#dataset
Now you can support Open STT on our github page via opencollective!
https://github.com/snakers4/open_stt
Opencollective seemed to be the best platform supported by GitHub for now.
#dataset
Ukrainian Open STT 1000 Hours
Following the path of Open STT in Russian, now you can enjoy a similar dataset in Ukrainian:
- Torrent Link
- GitHub Link
Congratulations to our Ukrainian friends for finally publishing a diverse easily downloadable dataset!
Their pages / dataset UX is still a bit rough on the edges, but compared how fast for example Common Voice accumulates data (130 hours for Russian and 43 hours for Ukrainian), UA Open STT and Open STT remain the best resources for respective languages to date.
Also unlike the majority of STT datasets which are (i) behind a paywall or sponsored by corporations (ii) have limited scope / domains (iii) fit some sort of agenda (i.e. use more GPUs than necessary, use our bloated tools, etc), this dataset is legit made by real people.
Also recently corporations have taken up the trend of rehashing publicly available data, which is cool, but unique data is still nowhere to be seen for obvious reasons (except for Common Voice, which is decent only for English).
#dataset
Following the path of Open STT in Russian, now you can enjoy a similar dataset in Ukrainian:
- Torrent Link
- GitHub Link
Congratulations to our Ukrainian friends for finally publishing a diverse easily downloadable dataset!
Their pages / dataset UX is still a bit rough on the edges, but compared how fast for example Common Voice accumulates data (130 hours for Russian and 43 hours for Ukrainian), UA Open STT and Open STT remain the best resources for respective languages to date.
Also unlike the majority of STT datasets which are (i) behind a paywall or sponsored by corporations (ii) have limited scope / domains (iii) fit some sort of agenda (i.e. use more GPUs than necessary, use our bloated tools, etc), this dataset is legit made by real people.
Also recently corporations have taken up the trend of rehashing publicly available data, which is cool, but unique data is still nowhere to be seen for obvious reasons (except for Common Voice, which is decent only for English).
#dataset
GitHub
GitHub - snakers4/open_stt: Open STT
Open STT. Contribute to snakers4/open_stt development by creating an account on GitHub.
Forwarded from Silero News (Alexander)
Faster Mirror Links for Open STT
Added Zenodo mirror links for Open STT - https://github.com/snakers4/open_stt#links
Zenodo provides sort-of free hosting for non-commercial / academic projects.
Their support has a very ivory tower attitude, but beggars cannot be choosers.
#dataset
Added Zenodo mirror links for Open STT - https://github.com/snakers4/open_stt#links
Zenodo provides sort-of free hosting for non-commercial / academic projects.
Their support has a very ivory tower attitude, but beggars cannot be choosers.
#dataset
GitHub
GitHub - snakers4/open_stt: Open STT
Open STT. Contribute to snakers4/open_stt development by creating an account on GitHub.