Dependency parsing and POS tagging in Russian
Less popular set of NLP tasks.
Popular tools reviewed
https://habr.com/ru/company/sberbank/blog/418701/
Only morphology:
(0) Well known
Only POS tags and morphology:
(0) https://github.com/IlyaGusev/rnnmorph (easy to use);
(1) https://github.com/nlpub/pymystem3 (easy to use);
Full dependency parsing
(0) Russian spacy plugin:
- https://github.com/buriy/spacy-ru - installation
- https://github.com/buriy/spacy-ru/blob/master/examples/POS_and_syntax.ipynb - usage with examples
(1) Malt parser based solution (drawback - no examples)
- https://github.com/oxaoo/mp4ru
(2) Google's syntaxnet
- https://github.com/tensorflow/models/tree/master/research/syntaxnet
#nlp
Less popular set of NLP tasks.
Popular tools reviewed
https://habr.com/ru/company/sberbank/blog/418701/
Only morphology:
(0) Well known
pymorphy2
package;Only POS tags and morphology:
(0) https://github.com/IlyaGusev/rnnmorph (easy to use);
(1) https://github.com/nlpub/pymystem3 (easy to use);
Full dependency parsing
(0) Russian spacy plugin:
- https://github.com/buriy/spacy-ru - installation
- https://github.com/buriy/spacy-ru/blob/master/examples/POS_and_syntax.ipynb - usage with examples
(1) Malt parser based solution (drawback - no examples)
- https://github.com/oxaoo/mp4ru
(2) Google's syntaxnet
- https://github.com/tensorflow/models/tree/master/research/syntaxnet
#nlp
Хабр
Изучаем синтаксические парсеры для русского языка
Привет! Меня зовут Денис Кирьянов, я работаю в Сбербанке и занимаюсь проблемами обработки естественного языка (NLP). Однажды нам понадобилось выбрать синтаксический парсер для работы с русским языком....
LSTM vs TCN vs Trellis network
- Did not try the Trellis network - decided it was too complex;
- All the TCN properties from the digest https://spark-in.me/post/2018_ds_ml_digest_31 hold - did not test for very long sequences;
- Looks like a really simple and reasonable alternative for RNNs for modeling and ensembling;
- On a sensible benchmark - performes mostly the same as LSTM from a practical standpoint;
https://github.com/locuslab/TCN/blob/master/TCN/tcn.py
#deep_learning
- Did not try the Trellis network - decided it was too complex;
- All the TCN properties from the digest https://spark-in.me/post/2018_ds_ml_digest_31 hold - did not test for very long sequences;
- Looks like a really simple and reasonable alternative for RNNs for modeling and ensembling;
- On a sensible benchmark - performes mostly the same as LSTM from a practical standpoint;
https://github.com/locuslab/TCN/blob/master/TCN/tcn.py
#deep_learning
Spark in me
2018 DS/ML digest 31
2018 DS/ML digest 31
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Tracking your hardware ... for data science
For a long time I though that if you really want to track all your servers' metrics you need Zabbix (which is very complicated).
A friend recommended me an amazing tool
- https://prometheus.io/docs/guides/node-exporter/
It installs and runs literally in minutes.
If you want to auto-start it properly, there are even a bit older Ubuntu packages and systemd examples
- https://github.com/prometheus/node_exporter/tree/master/examples/systemd
Dockerized metric exporters for GPUs by Nvidia
- https://github.com/NVIDIA/gpu-monitoring-tools/tree/master/exporters/prometheus-dcgm
It also features extensive alerting features, but they are very difficult to easily start, there being no minimal example
- https://prometheus.io/docs/alerting/overview/
- https://github.com/prometheus/docs/issues/581
#linux
For a long time I though that if you really want to track all your servers' metrics you need Zabbix (which is very complicated).
A friend recommended me an amazing tool
- https://prometheus.io/docs/guides/node-exporter/
It installs and runs literally in minutes.
If you want to auto-start it properly, there are even a bit older Ubuntu packages and systemd examples
- https://github.com/prometheus/node_exporter/tree/master/examples/systemd
Dockerized metric exporters for GPUs by Nvidia
- https://github.com/NVIDIA/gpu-monitoring-tools/tree/master/exporters/prometheus-dcgm
It also features extensive alerting features, but they are very difficult to easily start, there being no minimal example
- https://prometheus.io/docs/alerting/overview/
- https://github.com/prometheus/docs/issues/581
#linux
prometheus.io
Monitoring Linux host metrics with the Node Exporter | Prometheus
An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.
Anyone knows anyone from TopCoder?
As usual with competition platforms organization sometimes has its issues
As usual with competition platforms organization sometimes has its issues
Forwarded from Анна
Привет!
Если кто не знает, кроме призовых за топ места, в спутниках была ещё одна классная фича - student's prize - приз для _студента_ с самым высоким скором. Там всё оказалось довольно неочевидно, отдельного лидерборда для студентов не было. Долго пыталась достучаться до админов, писала на почту, на форум, чтобы узнать больше подробностей. Спустя месяц админ таки ответил, что я единственный претендент на приз и, вроде, никаких проблем, всё улаживаем, кидай студак. И снова пропал. Периодически напоминала о своем существовании, интересовалась, как там дела, есть ли подвижки, в ответ игнор. *Ответа нет до сих пор.* Я впервые участвую в серьезном сореве и не совсем понимаю, что можно сделать в такой ситуации. Ждать новостей? Писать посты в твитер? Есть ли какой-то способ достучаться до админов?
Олсо, написала тут небольшую статейку про свое решение. https://spark-in.me/post/spacenet4
Если кто не знает, кроме призовых за топ места, в спутниках была ещё одна классная фича - student's prize - приз для _студента_ с самым высоким скором. Там всё оказалось довольно неочевидно, отдельного лидерборда для студентов не было. Долго пыталась достучаться до админов, писала на почту, на форум, чтобы узнать больше подробностей. Спустя месяц админ таки ответил, что я единственный претендент на приз и, вроде, никаких проблем, всё улаживаем, кидай студак. И снова пропал. Периодически напоминала о своем существовании, интересовалась, как там дела, есть ли подвижки, в ответ игнор. *Ответа нет до сих пор.* Я впервые участвую в серьезном сореве и не совсем понимаю, что можно сделать в такой ситуации. Ждать новостей? Писать посты в твитер? Есть ли какой-то способ достучаться до админов?
Олсо, написала тут небольшую статейку про свое решение. https://spark-in.me/post/spacenet4
5th 2019 DS / ML digest
Highlights of the week
- New Adam version;
- POS tagging and semantic parsing in Russian;
- ML industrialization again;
https://spark-in.me/post/2019_ds_ml_digest_05
#digest
#data_science
#deep_learning
Highlights of the week
- New Adam version;
- POS tagging and semantic parsing in Russian;
- ML industrialization again;
https://spark-in.me/post/2019_ds_ml_digest_05
#digest
#data_science
#deep_learning
Spark in me
2019 DS/ML digest 05
2019 DS/ML digest 05
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Russian STT datasets
Anyone knows more proper datasets?
I found this (60 hours), but I could not find the link to the dataset:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/274_Paper.pdf
Anyway, here is the list I found:
- 20 hours of Bible https://github.com/festvox/datasets-CMU_Wilderness;
- https://www.kaggle.com/bryanpark/russian-single-speaker-speech-dataset - does not say how many hours
- Ofc audio book datasets - https://www.caito.de/data/Training/stt_tts/ + and some scraping scripts https://github.com/ainy/shershe/tree/master/scripts
- And some disappointment here https://voice.mozilla.org/ru/languages
#deep_learning
Anyone knows more proper datasets?
I found this (60 hours), but I could not find the link to the dataset:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/274_Paper.pdf
Anyway, here is the list I found:
- 20 hours of Bible https://github.com/festvox/datasets-CMU_Wilderness;
- https://www.kaggle.com/bryanpark/russian-single-speaker-speech-dataset - does not say how many hours
- Ofc audio book datasets - https://www.caito.de/data/Training/stt_tts/ + and some scraping scripts https://github.com/ainy/shershe/tree/master/scripts
- And some disappointment here https://voice.mozilla.org/ru/languages
#deep_learning
Inception v1 layers visualized on a map
A joint work by Google and OpenAI:
https://distill.pub/2019/activation-atlas/
https://distill.pub/2019/activation-atlas/app.html
https://blog.openai.com/introducing-activation-atlases/
https://ai.googleblog.com/2019/03/exploring-neural-networks.html
TLDR:
- Take 1M random images;
- Feed to a CNN, collect some spatial activation;
- Produce a corresponding idealized image that would result in such an activation;
- Plot in 2D (via UMAP), add grid, averaging, etc etc;
#deep_learning
A joint work by Google and OpenAI:
https://distill.pub/2019/activation-atlas/
https://distill.pub/2019/activation-atlas/app.html
https://blog.openai.com/introducing-activation-atlases/
https://ai.googleblog.com/2019/03/exploring-neural-networks.html
TLDR:
- Take 1M random images;
- Feed to a CNN, collect some spatial activation;
- Produce a corresponding idealized image that would result in such an activation;
- Plot in 2D (via UMAP), add grid, averaging, etc etc;
#deep_learning
Distill
Activation Atlas
By using feature inversion to visualize millions of activations from an image classification network, we create an explorable activation atlas of features the network has learned and what concepts it typically represents.
Our experiments with Transformers, BERT and generative language pre-training
TLDR
For morphologically rich languages pre-trained Transformers are not a silver bullet and from a layman's perspective they are not feasible unless someone invests huge computational resources into sub-word tokenization methods that work well + actually training these large networks.
On the other hand we have definitively shown that:
- Starting a transformer with Embedding bag initialized via FastText works and is relatively feasible;
- On complicated tasks - such transformer significantly outperforms training from scratch (as well as naive models) and shows decent results compared to state-of-the-art specialized models;
- Pre-training worked, but it overfitted more thatn FastText initialization and given the complexity required for such pre-training - it is not useful;
https://spark-in.me/post/bert-pretrain-ru
All in all this was a relatively large gamble, which did not pay off - on some more down-to-earth task we hoped the Transformer would excel at - it did not.
#deep_learning
TLDR
For morphologically rich languages pre-trained Transformers are not a silver bullet and from a layman's perspective they are not feasible unless someone invests huge computational resources into sub-word tokenization methods that work well + actually training these large networks.
On the other hand we have definitively shown that:
- Starting a transformer with Embedding bag initialized via FastText works and is relatively feasible;
- On complicated tasks - such transformer significantly outperforms training from scratch (as well as naive models) and shows decent results compared to state-of-the-art specialized models;
- Pre-training worked, but it overfitted more thatn FastText initialization and given the complexity required for such pre-training - it is not useful;
https://spark-in.me/post/bert-pretrain-ru
All in all this was a relatively large gamble, which did not pay off - on some more down-to-earth task we hoped the Transformer would excel at - it did not.
#deep_learning
An approach to ranking search results with no annotation
Just a small article with a novel idea:
- Instead of training a network with CE - just train it with BCE;
- Source additional structure from the inner structure of your domain (tags, matrix decomposition methods, heuristics, etc);
https://spark-in.me/post/classifier-result-sorting
Works best if your ontology is relatively simple.
#deep_learning
Just a small article with a novel idea:
- Instead of training a network with CE - just train it with BCE;
- Source additional structure from the inner structure of your domain (tags, matrix decomposition methods, heuristics, etc);
https://spark-in.me/post/classifier-result-sorting
Works best if your ontology is relatively simple.
#deep_learning
Spark in me
Learning to rank search results without annotation
Solving search ranking problem
Статьи автора - http://spark-in.me/author/adamnsandle
Блог - http://spark-in.me
Статьи автора - http://spark-in.me/author/adamnsandle
Блог - http://spark-in.me
Forwarded from Just links
www.callingbullshit.org
Calling Bullshit: Data Reasoning in a Digital World
The world is awash in bullshit. Politicians are unconstrained by facts. Science is conducted by press release. Higher education rewards bullshit over analytic thought. Startup culture elevates bullshit to high art. Advertisers wink conspiratorially and invite…
Forwarded from Just links
arXiv.org
Bag of Tricks for Image Classification with Convolutional Neural Networks
Much of the recent progress made in image classification research can be credited to training procedure refinements, such as changes in data augmentations and optimization methods. In the...
Forwarded from Just links
DropBlock: A regularization method for convolutional networks https://arxiv.org/abs/1810.12890
Our Transformer post was featured by Towards Data Science
https://medium.com/p/complexity-generalization-computational-cost-in-nlp-modeling-of-morphologically-rich-languages-7fa2c0b45909?source=email-f29885e9bef3--writer.postDistributed&sk=a56711f1436d60283d4b672466ba258b
#nlp
https://medium.com/p/complexity-generalization-computational-cost-in-nlp-modeling-of-morphologically-rich-languages-7fa2c0b45909?source=email-f29885e9bef3--writer.postDistributed&sk=a56711f1436d60283d4b672466ba258b
#nlp
Towards Data Science
Comparing complex NLP models for complex languages on a set of real tasks
Transformer is not yet really usable in practice for languages with rich morphology, but we take the first step in this direction