Spark in me

NLP - Highlight of the week - LASER

- Hm, a new sentence embedding tool?
- Plain PyTorch 1.0 / numpy / FAISS based;
- [Release](https://code.fb.com/ai-research/laser-multilingual-sentence-embeddings/), [library](https://github.com/facebookresearch/LASER);
- Looks like an off-shoot of their "unsupervised" NMT project;

LASER’s vector representations of sentences are generic with respect to both the
input language and the NLP task. The tool maps a sentence in any language to
point in a high-dimensional space with the goal that the same statement in any
language will end up in the same neighborhood. This representation could be seen
as a universal language in a semantic vector space. We have observed that the
distance in that space correlates very well to the semantic closeness of the
sentences.

- Alleged pros:

It delivers extremely fast performance, processing up to 2,000 sentences per second on GPU.
    The sentence encoder is implemented in PyTorch with minimal external dependencies.
    Languages with limited resources can benefit from joint training over many languages.
    The model supports the use of multiple languages in one sentence.
    Performance improves as new languages are added, as the system learns to recognize characteristics of language families.

They essentially trained an NMT model with a shared encoder for many languages.

I tried training sth similar - but it quickly over-fitted into just memorizing the indexes of words.

#nlp
#deep_learning

#

Engineering at Meta

Zero-shot transfer across 93 languages: Open-sourcing enhanced LASER library

To accelerate the transfer of natural language processing (NLP) applications to many more languages, we have significantly expanded and enhanced our LASER (Language-Agnostic SEntence Representation…

1.3K viewsAlexander, edited 11:26

Spark in me

Downsides of using Common Crawl

Took a look at the Common Crawl data I myself pre-processed last year and could not find abstracts - only sentences.

Took a look at these - archives - http://data.statmt.org/ngrams/deduped/ - also only sentences, though they seem to be in logical order sometimes.

You can use any form of CC - but only to learn word representations. Not sentences.
Sad.

#nlp

1.3K viewsAlexander, edited 12:31

Spark in me

Old news ... but Attention works

Funny enough, but in the past my models :
- Either did not need attention;
- Attention was implemented by @thinline72 ;
- The domain was so complicated (NMT) so that I had to resort to boilerplate with key-value attention;

It was the first time I / we tried manually building a model with plain self attention from scratch.

An you know - it really adds 5-10% to all of the tracked metrics.

Best plain attention layer in PyTorch - simple, well documented ... and it works in real life applications:
https://gist.github.com/cbaziotis/94e53bdd6e4852756e0395560ff38aa4

#nlp
#deep_learning

Gist

SelfAttention implementation in PyTorch

SelfAttention implementation in PyTorch. GitHub Gist: instantly share code, notes, and snippets.

981 viewsAlexander, 06:22

Spark in me

Russian thesaurus that really works

https://nlpub.ru/Russian_Distributional_Thesaurus#.D0.93.D1.80.D0.B0.D1.84_.D0.BF.D0.BE.D0.B4.D0.BE.D0.B1.D0.B8.D1.8F_.D1.81.D0.BB.D0.BE.D0.B2

It knows so many peculiar / old-fashioned and cheeky synonyms for obscene words!

#nlp

nlpub.ru

Russian Distributional Thesaurus — NLPub

Russian Distributional Thesaurus (сокр. RDT) — проект создания открытого дистрибутивного тезауруса русского языка. На данный момент ресурс содержит несколько компонент: вектора слов (word embeddings), граф подобия слов (дистрибутивный тезаурус), множество…

1.1K viewsAlexander, 05:13

Spark in me

PyTorch NLP best practices

Very simple ideas, actually.

(1) Multi GPU parallelization and FP16 training

Do not bother reinventing the wheel.
Just use nvidia's apex, DistributedDataParallel, DataParallel.
Best examples [here](https://github.com/huggingface/pytorch-pretrained-BERT).

(2) Put as much as possible INSIDE of the model

Implement the as much as possible of your logic inside of nn.module.
Why?
So that you can seamleassly you all the abstractions from (1) with ease.
Also models are more abstract and reusable in general.

(3) Why have a separate train/val loop?

PyTorch 0.4 introduced context handlers.

You can simplify your train / val / test loops, and merge them into one simple function.

context = torch.no_grad() if loop_type=='Val' else torch.enable_grad()

if loop_type=='Train':
    model.train()
elif loop_type=='Val':
    model.eval()

with context:
    for i, (some_tensor) in enumerate(tqdm(train_loader)):
        # do your stuff here
        pass

(4) EmbeddingBag

Use EmbeddingBag layer for morphologically rich languages. Seriously!

(5) Writing trainers / training abstractions

This is waste of time imho if you follow (1), (2) and (3).

(6) Nice bonus

If you follow most of these, you can train on as many GPUs and machines as you wan for any language)

(7) Using tensorboard for logging

This goes without saying.

#nlp
#deep_learning

GitHub

GitHub - huggingface/transformers: 🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX. - GitHub - huggingface/transformers: 🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, a...

1.1K viewsAlexander, 09:02

Spark in me

Dependency parsing and POS tagging in Russian

Less popular set of NLP tasks.

Popular tools reviewed
https://habr.com/ru/company/sberbank/blog/418701/

Only morphology:
(0) Well known pymorphy2 package;

Only POS tags and morphology:
(0) https://github.com/IlyaGusev/rnnmorph (easy to use);
(1) https://github.com/nlpub/pymystem3 (easy to use);

Full dependency parsing
(0) Russian spacy plugin:
- https://github.com/buriy/spacy-ru - installation
- https://github.com/buriy/spacy-ru/blob/master/examples/POS_and_syntax.ipynb - usage with examples
(1) Malt parser based solution (drawback - no examples)
- https://github.com/oxaoo/mp4ru
(2) Google's syntaxnet
- https://github.com/tensorflow/models/tree/master/research/syntaxnet

#nlp

Хабр

Изучаем синтаксические парсеры для русского языка

Привет! Меня зовут Денис Кирьянов, я работаю в Сбербанке и занимаюсь проблемами обработки естественного языка (NLP). Однажды нам понадобилось выбрать синтаксический парсер для работы с русским языком....

1.5K viewsAlexander, 13:02

Spark in me

Our Transformer post was featured by Towards Data Science

https://medium.com/p/complexity-generalization-computational-cost-in-nlp-modeling-of-morphologically-rich-languages-7fa2c0b45909?source=email-f29885e9bef3--writer.postDistributed&sk=a56711f1436d60283d4b672466ba258b

#nlp

Towards Data Science

Comparing complex NLP models for complex languages on a set of real tasks

Transformer is not yet really usable in practice for languages with rich morphology, but we take the first step in this direction

1.6K viewsAlexander, 15:45

Spark in me

Normalization techniques other than batch norm:
(https://pics.spark-in.me/upload/aecc2c5fb356b6d803b4218fcb0bc3ec.png)

Weight normalization (used in TCN http://arxiv.org/abs/1602.07868):
- Decouples length of weight vectors from their direction;
- Does not introduce any dependencies between the examples in a minibatch;
- Can be applied successfully to recurrent models such as LSTMs;
- Tested only on small datasets (CIFAR + VAES + DQN);

Instance norm (used in [style transfer](https://arxiv.org/abs/1607.08022))
- Proposed for style transfer;
- Essentially is batch-norm for one image;
- The mean and standard-deviation are calculated per-dimension separately for each object in a mini-batch;

Layer norm (used in Transformers, [paper](https://arxiv.org/abs/1607.06450))
- Designed especially for sequntial networks;
- Computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case;
- The mean and standard-deviation are calculated separately over the last certain number dimensions;
- Unlike Batch Normalization and Instance Normalization, which applies scalar scale and bias for each entire channel/plane with the affine option, Layer Normalization applies per-element scale and bias;

#deep_learning
#nlp

2.9K viewsAlexander, edited 11:15

Spark in me

Sentiment datasets in Russian Just randomly found several links. - http://study.mokoron.com/ - annotated tweets - http://text-machine.cs.uml.edu/projects/rusentiment/ - some more posts from VK - https://github.com/dkulagin/kartaslov/tree/master/dataset/emo_dict…

Russian sentiment dataset

In a typical Russian fashion - one of these datasets was deleted by the request of bad people, whom I shall not name.
Luckily, some anonymous backed the dataset up.
Anyway - use it.

Yeah, it is small. But it is free, so whatever.

#nlp
#data_science

1.4K viewsAlexander, edited 04:44

Spark in me

Miniaturize / optimize your ... NLP models?

For CV applications there literally dozens of ways to make your models smaller.
And yeah, I do not mean some "moonshots" or special limited libraries (matrix decompositions, some custom pruning, etc etc).
I mean cheap and dirty hacks, that work in 95% of cases regardless of your stack / device / framework:
- Smaller images (x3-x4 easy);
- FP16 inference (30-40% maybe);
- Knowledge distillation into smaller networks (x3-x10);
- Naïve cascade optimizations (feed only Nth frame using some heuristic);

But what can you do with NLP networks?
Turns out not much.

But here are my ideas:
- Use a simpler model - embedding bag + plain self-attention + LSTM can solve 90% of tasks;

- Decrease embedding size from 300 to 50 (or maybe even more). Tried and tested, works like a charm. For harder tasks you lose just 1-3pp of your target metric, for smaller tasks - it is just the same;

- FP16 inference is supported in PyTorch for nn.Embedding, but not for nn.EmbeddingBag. But you get the idea;

_embedding_bag is not implemented for type torch.HalfTensor

- You can try distilling your vocabulary / embedding-bag model into a char level model. If it works, you can trade model size vs. inference time;

- If you have very long sentences or large batches - try distilling / swapping your recurrent network with a CNN / TCN. This way you can also trade model size vs. inference time but probably in a different direction;

#nlp
#deep_learning

1.5K viewsAlexander, edited 12:44

About

Blog

Apps

Platform