Spark in me

Dependency parsing and POS tagging in Russian

Less popular set of NLP tasks.

Popular tools reviewed
https://habr.com/ru/company/sberbank/blog/418701/

Only morphology:
(0) Well known pymorphy2 package;

Only POS tags and morphology:
(0) https://github.com/IlyaGusev/rnnmorph (easy to use);
(1) https://github.com/nlpub/pymystem3 (easy to use);

Full dependency parsing
(0) Russian spacy plugin:
- https://github.com/buriy/spacy-ru - installation
- https://github.com/buriy/spacy-ru/blob/master/examples/POS_and_syntax.ipynb - usage with examples
(1) Malt parser based solution (drawback - no examples)
- https://github.com/oxaoo/mp4ru
(2) Google's syntaxnet
- https://github.com/tensorflow/models/tree/master/research/syntaxnet

#nlp

Хабр

Изучаем синтаксические парсеры для русского языка

Привет! Меня зовут Денис Кирьянов, я работаю в Сбербанке и занимаюсь проблемами обработки естественного языка (NLP). Однажды нам понадобилось выбрать синтаксический парсер для работы с русским языком....

1.5K viewsAlexander, 13:02

Spark in me

LSTM vs TCN vs Trellis network

- Did not try the Trellis network - decided it was too complex;
- All the TCN properties from the digest https://spark-in.me/post/2018_ds_ml_digest_31 hold - did not test for very long sequences;
- Looks like a really simple and reasonable alternative for RNNs for modeling and ensembling;
- On a sensible benchmark - performes mostly the same as LSTM from a practical standpoint;

https://github.com/locuslab/TCN/blob/master/TCN/tcn.py

#deep_learning

Spark in me

2018 DS/ML digest 31

2018 DS/ML digest 31
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me

1.2K viewsAlexander, 07:16

Spark in me

https://youtu.be/eUzB0L0mSCI

YouTube

Can You Recover Sound From Images?

Is it possible to reconstruct sound from high-speed video images?
Part of this video was sponsored by LastPass: http://bit.ly/2SmRQkk
Special thanks to Dr. Abe Davis for revisiting his research with me: http://abedavis.com

This video was based on research…

1.1K viewsAlexander, 04:49

Spark in me

Tracking your hardware ... for data science

For a long time I though that if you really want to track all your servers' metrics you need Zabbix (which is very complicated).

A friend recommended me an amazing tool
- https://prometheus.io/docs/guides/node-exporter/

It installs and runs literally in minutes.
If you want to auto-start it properly, there are even a bit older Ubuntu packages and systemd examples
- https://github.com/prometheus/node_exporter/tree/master/examples/systemd

Dockerized metric exporters for GPUs by Nvidia
- https://github.com/NVIDIA/gpu-monitoring-tools/tree/master/exporters/prometheus-dcgm

It also features extensive alerting features, but they are very difficult to easily start, there being no minimal example
- https://prometheus.io/docs/alerting/overview/
- https://github.com/prometheus/docs/issues/581

#linux

prometheus.io

Monitoring Linux host metrics with the Node Exporter | Prometheus

An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.

1.1K viewsAlexander, 08:46

Spark in me

Anyone knows anyone from TopCoder?
As usual with competition platforms organization sometimes has its issues

884 viewsAlexander, 09:23

Spark in me

Forwarded from Анна

Привет!
Если кто не знает, кроме призовых за топ места, в спутниках была ещё одна классная фича - student's prize - приз для _студента_ с самым высоким скором. Там всё оказалось довольно неочевидно, отдельного лидерборда для студентов не было. Долго пыталась достучаться до админов, писала на почту, на форум, чтобы узнать больше подробностей. Спустя месяц админ таки ответил, что я единственный претендент на приз и, вроде, никаких проблем, всё улаживаем, кидай студак. И снова пропал. Периодически напоминала о своем существовании, интересовалась, как там дела, есть ли подвижки, в ответ игнор. *Ответа нет до сих пор.* Я впервые участвую в серьезном сореве и не совсем понимаю, что можно сделать в такой ситуации. Ждать новостей? Писать посты в твитер? Есть ли какой-то способ достучаться до админов?

Олсо, написала тут небольшую статейку про свое решение. https://spark-in.me/post/spacenet4

1.2K viewsAlexander, 09:23

Spark in me

5th 2019 DS / ML digest

Highlights of the week
- New Adam version;
- POS tagging and semantic parsing in Russian;
- ML industrialization again;

https://spark-in.me/post/2019_ds_ml_digest_05

#digest
#data_science
#deep_learning

Spark in me

2019 DS/ML digest 05

2019 DS/ML digest 05
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me

1.1K viewsAlexander, 10:31

Spark in me

PyTorch internals

https://speakerdeck.com/perone/pytorch-under-the-hood

#deep_learning

Speaker Deck

PyTorch under the hood

Presentation about PyTorch internals presented at the PyData Montreal in Feb 2019.

944 viewsAlexander, 06:47

Spark in me

Russian STT datasets

Anyone knows more proper datasets?

I found this (60 hours), but I could not find the link to the dataset:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/274_Paper.pdf

Anyway, here is the list I found:

- 20 hours of Bible https://github.com/festvox/datasets-CMU_Wilderness;
- https://www.kaggle.com/bryanpark/russian-single-speaker-speech-dataset - does not say how many hours
- Ofc audio book datasets - https://www.caito.de/data/Training/stt_tts/ + and some scraping scripts https://github.com/ainy/shershe/tree/master/scripts
- And some disappointment here https://voice.mozilla.org/ru/languages

#deep_learning

915 viewsAlexander, edited 09:58

Spark in me

Inception v1 layers visualized on a map

A joint work by Google and OpenAI:
https://distill.pub/2019/activation-atlas/
https://distill.pub/2019/activation-atlas/app.html
https://blog.openai.com/introducing-activation-atlases/
https://ai.googleblog.com/2019/03/exploring-neural-networks.html

TLDR:
- Take 1M random images;
- Feed to a CNN, collect some spatial activation;
- Produce a corresponding idealized image that would result in such an activation;
- Plot in 2D (via UMAP), add grid, averaging, etc etc;

#deep_learning

Distill

Activation Atlas

By using feature inversion to visualize millions of activations from an image classification network, we create an explorable activation atlas of features the network has learned and what concepts it typically represents.

945 viewsAlexander, edited 11:21

Spark in me

Our experiments with Transformers, BERT and generative language pre-training

TLDR

For morphologically rich languages pre-trained Transformers are not a silver bullet and from a layman's perspective they are not feasible unless someone invests huge computational resources into sub-word tokenization methods that work well + actually training these large networks.

On the other hand we have definitively shown that:

- Starting a transformer with Embedding bag initialized via FastText works and is relatively feasible;
- On complicated tasks - such transformer significantly outperforms training from scratch (as well as naive models) and shows decent results compared to state-of-the-art specialized models;
- Pre-training worked, but it overfitted more thatn FastText initialization and given the complexity required for such pre-training - it is not useful;

https://spark-in.me/post/bert-pretrain-ru

All in all this was a relatively large gamble, which did not pay off - on some more down-to-earth task we hoped the Transformer would excel at - it did not.

#deep_learning

1.3K viewsAlexander, edited 15:42

Spark in me

An approach to ranking search results with no annotation

Just a small article with a novel idea:
- Instead of training a network with CE - just train it with BCE;
- Source additional structure from the inner structure of your domain (tags, matrix decomposition methods, heuristics, etc);

https://spark-in.me/post/classifier-result-sorting

Works best if your ontology is relatively simple.

#deep_learning

Spark in me

Learning to rank search results without annotation

Solving search ranking problem
Статьи автора - http://spark-in.me/author/adamnsandle
Блог - http://spark-in.me

1.2K viewsAlexander, 15:46

Spark in me

This media is not supported in your browser

VIEW IN TELEGRAM

1.1K viewsAlexander, 12:56

Spark in me

Forwarded from Just links

https://callingbullshit.org/index.html

www.callingbullshit.org