NLP - Highlight of the week - LASER
- Hm, a new sentence embedding tool?
- Plain PyTorch 1.0 / numpy / FAISS based;
- [Release](https://code.fb.com/ai-research/laser-multilingual-sentence-embeddings/), [library](https://github.com/facebookresearch/LASER);
- Looks like an off-shoot of their "unsupervised" NMT project;
I tried training sth similar - but it quickly over-fitted into just memorizing the indexes of words.
#nlp
#deep_learning
#
- Hm, a new sentence embedding tool?
- Plain PyTorch 1.0 / numpy / FAISS based;
- [Release](https://code.fb.com/ai-research/laser-multilingual-sentence-embeddings/), [library](https://github.com/facebookresearch/LASER);
- Looks like an off-shoot of their "unsupervised" NMT project;
LASER’s vector representations of sentences are generic with respect to both the- Alleged pros:
input language and the NLP task. The tool maps a sentence in any language to
point in a high-dimensional space with the goal that the same statement in any
language will end up in the same neighborhood. This representation could be seen
as a universal language in a semantic vector space. We have observed that the
distance in that space correlates very well to the semantic closeness of the
sentences.
It delivers extremely fast performance, processing up to 2,000 sentences per second on GPU.They essentially trained an NMT model with a shared encoder for many languages.
The sentence encoder is implemented in PyTorch with minimal external dependencies.
Languages with limited resources can benefit from joint training over many languages.
The model supports the use of multiple languages in one sentence.
Performance improves as new languages are added, as the system learns to recognize characteristics of language families.
I tried training sth similar - but it quickly over-fitted into just memorizing the indexes of words.
#nlp
#deep_learning
#
Engineering at Meta
Zero-shot transfer across 93 languages: Open-sourcing enhanced LASER library
To accelerate the transfer of natural language processing (NLP) applications to many more languages, we have significantly expanded and enhanced our LASER (Language-Agnostic SEntence Representation…
Downsides of using Common Crawl
Took a look at the Common Crawl data I myself pre-processed last year and could not find abstracts - only sentences.
Took a look at these - archives - http://data.statmt.org/ngrams/deduped/ - also only sentences, though they seem to be in logical order sometimes.
You can use any form of CC - but only to learn word representations. Not sentences.
Sad.
#nlp
Took a look at the Common Crawl data I myself pre-processed last year and could not find abstracts - only sentences.
Took a look at these - archives - http://data.statmt.org/ngrams/deduped/ - also only sentences, though they seem to be in logical order sometimes.
You can use any form of CC - but only to learn word representations. Not sentences.
Sad.
#nlp
Old news ... but Attention works
Funny enough, but in the past my models :
- Either did not need attention;
- Attention was implemented by @thinline72 ;
- The domain was so complicated (NMT) so that I had to resort to boilerplate with key-value attention;
It was the first time I / we tried manually building a model with plain self attention from scratch.
An you know - it really adds 5-10% to all of the tracked metrics.
Best plain attention layer in PyTorch - simple, well documented ... and it works in real life applications:
https://gist.github.com/cbaziotis/94e53bdd6e4852756e0395560ff38aa4
#nlp
#deep_learning
Funny enough, but in the past my models :
- Either did not need attention;
- Attention was implemented by @thinline72 ;
- The domain was so complicated (NMT) so that I had to resort to boilerplate with key-value attention;
It was the first time I / we tried manually building a model with plain self attention from scratch.
An you know - it really adds 5-10% to all of the tracked metrics.
Best plain attention layer in PyTorch - simple, well documented ... and it works in real life applications:
https://gist.github.com/cbaziotis/94e53bdd6e4852756e0395560ff38aa4
#nlp
#deep_learning
Gist
SelfAttention implementation in PyTorch
SelfAttention implementation in PyTorch. GitHub Gist: instantly share code, notes, and snippets.
Russian thesaurus that really works
https://nlpub.ru/Russian_Distributional_Thesaurus#.D0.93.D1.80.D0.B0.D1.84_.D0.BF.D0.BE.D0.B4.D0.BE.D0.B1.D0.B8.D1.8F_.D1.81.D0.BB.D0.BE.D0.B2
It knows so many peculiar / old-fashioned and cheeky synonyms for obscene words!
#nlp
https://nlpub.ru/Russian_Distributional_Thesaurus#.D0.93.D1.80.D0.B0.D1.84_.D0.BF.D0.BE.D0.B4.D0.BE.D0.B1.D0.B8.D1.8F_.D1.81.D0.BB.D0.BE.D0.B2
It knows so many peculiar / old-fashioned and cheeky synonyms for obscene words!
#nlp
nlpub.ru
Russian Distributional Thesaurus — NLPub
Russian Distributional Thesaurus (сокр. RDT) — проект создания открытого дистрибутивного тезауруса русского языка. На данный момент ресурс содержит несколько компонент: вектора слов (word embeddings), граф подобия слов (дистрибутивный тезаурус), множество…
PyTorch NLP best practices
Very simple ideas, actually.
(1) Multi GPU parallelization and FP16 training
Do not bother reinventing the wheel.
Just use nvidia's
Best examples [here](https://github.com/huggingface/pytorch-pretrained-BERT).
(2) Put as much as possible INSIDE of the model
Implement the as much as possible of your logic inside of
Why?
So that you can seamleassly you all the abstractions from (1) with ease.
Also models are more abstract and reusable in general.
(3) Why have a separate train/val loop?
PyTorch 0.4 introduced context handlers.
You can simplify your train / val / test loops, and merge them into one simple function.
(4) EmbeddingBag
Use EmbeddingBag layer for morphologically rich languages. Seriously!
(5) Writing trainers / training abstractions
This is waste of time imho if you follow (1), (2) and (3).
(6) Nice bonus
If you follow most of these, you can train on as many GPUs and machines as you wan for any language)
(7) Using tensorboard for logging
This goes without saying.
#nlp
#deep_learning
Very simple ideas, actually.
(1) Multi GPU parallelization and FP16 training
Do not bother reinventing the wheel.
Just use nvidia's
apex
, DistributedDataParallel
, DataParallel
.Best examples [here](https://github.com/huggingface/pytorch-pretrained-BERT).
(2) Put as much as possible INSIDE of the model
Implement the as much as possible of your logic inside of
nn.module
.Why?
So that you can seamleassly you all the abstractions from (1) with ease.
Also models are more abstract and reusable in general.
(3) Why have a separate train/val loop?
PyTorch 0.4 introduced context handlers.
You can simplify your train / val / test loops, and merge them into one simple function.
context = torch.no_grad() if loop_type=='Val' else torch.enable_grad()
if loop_type=='Train':
model.train()
elif loop_type=='Val':
model.eval()
with context:
for i, (some_tensor) in enumerate(tqdm(train_loader)):
# do your stuff here
pass
(4) EmbeddingBag
Use EmbeddingBag layer for morphologically rich languages. Seriously!
(5) Writing trainers / training abstractions
This is waste of time imho if you follow (1), (2) and (3).
(6) Nice bonus
If you follow most of these, you can train on as many GPUs and machines as you wan for any language)
(7) Using tensorboard for logging
This goes without saying.
#nlp
#deep_learning
GitHub
GitHub - huggingface/transformers: 🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX. - huggingface/transformers
Dependency parsing and POS tagging in Russian
Less popular set of NLP tasks.
Popular tools reviewed
https://habr.com/ru/company/sberbank/blog/418701/
Only morphology:
(0) Well known
Only POS tags and morphology:
(0) https://github.com/IlyaGusev/rnnmorph (easy to use);
(1) https://github.com/nlpub/pymystem3 (easy to use);
Full dependency parsing
(0) Russian spacy plugin:
- https://github.com/buriy/spacy-ru - installation
- https://github.com/buriy/spacy-ru/blob/master/examples/POS_and_syntax.ipynb - usage with examples
(1) Malt parser based solution (drawback - no examples)
- https://github.com/oxaoo/mp4ru
(2) Google's syntaxnet
- https://github.com/tensorflow/models/tree/master/research/syntaxnet
#nlp
Less popular set of NLP tasks.
Popular tools reviewed
https://habr.com/ru/company/sberbank/blog/418701/
Only morphology:
(0) Well known
pymorphy2
package;Only POS tags and morphology:
(0) https://github.com/IlyaGusev/rnnmorph (easy to use);
(1) https://github.com/nlpub/pymystem3 (easy to use);
Full dependency parsing
(0) Russian spacy plugin:
- https://github.com/buriy/spacy-ru - installation
- https://github.com/buriy/spacy-ru/blob/master/examples/POS_and_syntax.ipynb - usage with examples
(1) Malt parser based solution (drawback - no examples)
- https://github.com/oxaoo/mp4ru
(2) Google's syntaxnet
- https://github.com/tensorflow/models/tree/master/research/syntaxnet
#nlp
Хабр
Изучаем синтаксические парсеры для русского языка
Привет! Меня зовут Денис Кирьянов, я работаю в Сбербанке и занимаюсь проблемами обработки естественного языка (NLP). Однажды нам понадобилось выбрать синтаксический парсер для работы с русским языком....
Our Transformer post was featured by Towards Data Science
https://medium.com/p/complexity-generalization-computational-cost-in-nlp-modeling-of-morphologically-rich-languages-7fa2c0b45909?source=email-f29885e9bef3--writer.postDistributed&sk=a56711f1436d60283d4b672466ba258b
#nlp
https://medium.com/p/complexity-generalization-computational-cost-in-nlp-modeling-of-morphologically-rich-languages-7fa2c0b45909?source=email-f29885e9bef3--writer.postDistributed&sk=a56711f1436d60283d4b672466ba258b
#nlp
Towards Data Science
Comparing complex NLP models for complex languages on a set of real tasks
Transformer is not yet really usable in practice for languages with rich morphology, but we take the first step in this direction
Normalization techniques other than batch norm:
(https://pics.spark-in.me/upload/aecc2c5fb356b6d803b4218fcb0bc3ec.png)
Weight normalization (used in TCN http://arxiv.org/abs/1602.07868):
- Decouples length of weight vectors from their direction;
- Does not introduce any dependencies between the examples in a minibatch;
- Can be applied successfully to recurrent models such as LSTMs;
- Tested only on small datasets (CIFAR + VAES + DQN);
Instance norm (used in [style transfer](https://arxiv.org/abs/1607.08022))
- Proposed for style transfer;
- Essentially is batch-norm for one image;
- The mean and standard-deviation are calculated per-dimension separately for each object in a mini-batch;
Layer norm (used in Transformers, [paper](https://arxiv.org/abs/1607.06450))
- Designed especially for sequntial networks;
- Computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case;
- The mean and standard-deviation are calculated separately over the last certain number dimensions;
- Unlike Batch Normalization and Instance Normalization, which applies scalar scale and bias for each entire channel/plane with the affine option, Layer Normalization applies per-element scale and bias;
#deep_learning
#nlp
(https://pics.spark-in.me/upload/aecc2c5fb356b6d803b4218fcb0bc3ec.png)
Weight normalization (used in TCN http://arxiv.org/abs/1602.07868):
- Decouples length of weight vectors from their direction;
- Does not introduce any dependencies between the examples in a minibatch;
- Can be applied successfully to recurrent models such as LSTMs;
- Tested only on small datasets (CIFAR + VAES + DQN);
Instance norm (used in [style transfer](https://arxiv.org/abs/1607.08022))
- Proposed for style transfer;
- Essentially is batch-norm for one image;
- The mean and standard-deviation are calculated per-dimension separately for each object in a mini-batch;
Layer norm (used in Transformers, [paper](https://arxiv.org/abs/1607.06450))
- Designed especially for sequntial networks;
- Computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case;
- The mean and standard-deviation are calculated separately over the last certain number dimensions;
- Unlike Batch Normalization and Instance Normalization, which applies scalar scale and bias for each entire channel/plane with the affine option, Layer Normalization applies per-element scale and bias;
#deep_learning
#nlp
Spark in me
Sentiment datasets in Russian Just randomly found several links. - http://study.mokoron.com/ - annotated tweets - http://text-machine.cs.uml.edu/projects/rusentiment/ - some more posts from VK - https://github.com/dkulagin/kartaslov/tree/master/dataset/emo_dict…
Russian sentiment dataset
In a typical Russian fashion - one of these datasets was deleted by the request of bad people, whom I shall not name.
Luckily, some
Anyway - use it.
Yeah, it is small. But it is free, so whatever.
#nlp
#data_science
In a typical Russian fashion - one of these datasets was deleted by the request of bad people, whom I shall not name.
Luckily, some
anonymous
backed the dataset up.Anyway - use it.
Yeah, it is small. But it is free, so whatever.
#nlp
#data_science
Miniaturize / optimize your ... NLP models?
For CV applications there literally dozens of ways to make your models smaller.
And yeah, I do not mean some "moonshots" or special limited libraries (matrix decompositions, some custom pruning, etc etc).
I mean cheap and dirty hacks, that work in 95% of cases regardless of your stack / device / framework:
- Smaller images (x3-x4 easy);
- FP16 inference (30-40% maybe);
- Knowledge distillation into smaller networks (x3-x10);
- Naïve cascade optimizations (feed only Nth frame using some heuristic);
But what can you do with NLP networks?
Turns out not much.
But here are my ideas:
- Use a simpler model - embedding bag + plain self-attention + LSTM can solve 90% of tasks;
- Decrease embedding size from 300 to 50 (or maybe even more). Tried and tested, works like a charm. For harder tasks you lose just 1-3pp of your target metric, for smaller tasks - it is just the same;
- FP16 inference is supported in PyTorch for
- If you have very long sentences or large batches - try distilling / swapping your recurrent network with a CNN / TCN. This way you can also trade model size vs. inference time but probably in a different direction;
#nlp
#deep_learning
For CV applications there literally dozens of ways to make your models smaller.
And yeah, I do not mean some "moonshots" or special limited libraries (matrix decompositions, some custom pruning, etc etc).
I mean cheap and dirty hacks, that work in 95% of cases regardless of your stack / device / framework:
- Smaller images (x3-x4 easy);
- FP16 inference (30-40% maybe);
- Knowledge distillation into smaller networks (x3-x10);
- Naïve cascade optimizations (feed only Nth frame using some heuristic);
But what can you do with NLP networks?
Turns out not much.
But here are my ideas:
- Use a simpler model - embedding bag + plain self-attention + LSTM can solve 90% of tasks;
- Decrease embedding size from 300 to 50 (or maybe even more). Tried and tested, works like a charm. For harder tasks you lose just 1-3pp of your target metric, for smaller tasks - it is just the same;
- FP16 inference is supported in PyTorch for
nn.Embedding
, but not for nn.EmbeddingBag
. But you get the idea;_embedding_bag is not implemented for type torch.HalfTensor- You can try distilling your vocabulary / embedding-bag model into a char level model. If it works, you can trade model size vs. inference time;
- If you have very long sentences or large batches - try distilling / swapping your recurrent network with a CNN / TCN. This way you can also trade model size vs. inference time but probably in a different direction;
#nlp
#deep_learning