Serialization of large objects in Python
So far found no sane way for this with 1M chunks / 10GB+ object size.
Of course, chunking / plain
Feather / parquet - fail with 2+GB size.
Pickle works, but it is kind of slow.
=(
#data_science
So far found no sane way for this with 1M chunks / 10GB+ object size.
Of course, chunking / plain
txt
works.Feather / parquet - fail with 2+GB size.
Pickle works, but it is kind of slow.
=(
#data_science
Jupiter widgets + pandas
https://towardsdatascience.com/interactive-controls-for-jupyter-notebooks-f5c94829aee6
Amazing.
#data_science
https://towardsdatascience.com/interactive-controls-for-jupyter-notebooks-f5c94829aee6
With the @interact decorator, the IPywidgets library automatically gives us a text box and a slider for choosing a column and number! It looks at the inputs
Amazing.
#data_science
Medium
Interactive Controls in Jupyter Notebooks
How to use IPywidgets to enhance your data exploration and analysis
Second 2019 DS / ML digest
Highlight of the week - Facebook's LASER.
https://spark-in.me/post/2019_ds_ml_digest_02
#digest
#data_science
#deep_learning
Highlight of the week - Facebook's LASER.
https://spark-in.me/post/2019_ds_ml_digest_02
#digest
#data_science
#deep_learning
Spark in me
2019 DS/ML digest 02
2019 DS/ML digest 02
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Forwarded from Анна
Checked out sentence embeddings in LASER:
- installation guide is a bit messy
- works on FAISS lib, performance is pretty fast ( <1 minute to encode 250k sentences on 1080Ti)
- better generalization comparing to ft baseline. A difference is clear even for small sentences: 'добрый день!' and 'здравствуйте!' embeddings are much closer in LASER's space than in ft
- looks like LASER embeddings is more about similarity, not only substitutability and better in synonym's recognition
- seems to work better on short sentences
- installation guide is a bit messy
- works on FAISS lib, performance is pretty fast ( <1 minute to encode 250k sentences on 1080Ti)
- better generalization comparing to ft baseline. A difference is clear even for small sentences: 'добрый день!' and 'здравствуйте!' embeddings are much closer in LASER's space than in ft
- looks like LASER embeddings is more about similarity, not only substitutability and better in synonym's recognition
- seems to work better on short sentences
A new paradigm in ML?
https://jontysinai.github.io/jekyll/update/2019/01/18/understanding-neural-odes.html
#deep_learning
#odes
https://jontysinai.github.io/jekyll/update/2019/01/18/understanding-neural-odes.html
#deep_learning
#odes
jontysinai.github.io
Understanding Neural ODE's - Jonty Sinai
In this blogpost I explore how ODE’s can be used to solve data modelling problems. I take a deep dive into the data modelling problem at hand and present ODE...
Third 2019 DS / ML digest
Highlights of the week
- quaternions;
- ODEs;
https://spark-in.me/post/2019_ds_ml_digest_03
#digest
#data_science
#deep_learning
Highlights of the week
- quaternions;
- ODEs;
https://spark-in.me/post/2019_ds_ml_digest_03
#digest
#data_science
#deep_learning
Spark in me
2019 DS/ML digest 03
2019 DS/ML digest 03
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Old news ... but Attention works
Funny enough, but in the past my models :
- Either did not need attention;
- Attention was implemented by @thinline72 ;
- The domain was so complicated (NMT) so that I had to resort to boilerplate with key-value attention;
It was the first time I / we tried manually building a model with plain self attention from scratch.
An you know - it really adds 5-10% to all of the tracked metrics.
Best plain attention layer in PyTorch - simple, well documented ... and it works in real life applications:
https://gist.github.com/cbaziotis/94e53bdd6e4852756e0395560ff38aa4
#nlp
#deep_learning
Funny enough, but in the past my models :
- Either did not need attention;
- Attention was implemented by @thinline72 ;
- The domain was so complicated (NMT) so that I had to resort to boilerplate with key-value attention;
It was the first time I / we tried manually building a model with plain self attention from scratch.
An you know - it really adds 5-10% to all of the tracked metrics.
Best plain attention layer in PyTorch - simple, well documented ... and it works in real life applications:
https://gist.github.com/cbaziotis/94e53bdd6e4852756e0395560ff38aa4
#nlp
#deep_learning
Gist
SelfAttention implementation in PyTorch
SelfAttention implementation in PyTorch. GitHub Gist: instantly share code, notes, and snippets.
Russian thesaurus that really works
https://nlpub.ru/Russian_Distributional_Thesaurus#.D0.93.D1.80.D0.B0.D1.84_.D0.BF.D0.BE.D0.B4.D0.BE.D0.B1.D0.B8.D1.8F_.D1.81.D0.BB.D0.BE.D0.B2
It knows so many peculiar / old-fashioned and cheeky synonyms for obscene words!
#nlp
https://nlpub.ru/Russian_Distributional_Thesaurus#.D0.93.D1.80.D0.B0.D1.84_.D0.BF.D0.BE.D0.B4.D0.BE.D0.B1.D0.B8.D1.8F_.D1.81.D0.BB.D0.BE.D0.B2
It knows so many peculiar / old-fashioned and cheeky synonyms for obscene words!
#nlp
nlpub.ru
Russian Distributional Thesaurus — NLPub
Russian Distributional Thesaurus (сокр. RDT) — проект создания открытого дистрибутивного тезауруса русского языка. На данный момент ресурс содержит несколько компонент: вектора слов (word embeddings), граф подобия слов (дистрибутивный тезаурус), множество…
PyTorch NLP best practices
Very simple ideas, actually.
(1) Multi GPU parallelization and FP16 training
Do not bother reinventing the wheel.
Just use nvidia's
Best examples [here](https://github.com/huggingface/pytorch-pretrained-BERT).
(2) Put as much as possible INSIDE of the model
Implement the as much as possible of your logic inside of
Why?
So that you can seamleassly you all the abstractions from (1) with ease.
Also models are more abstract and reusable in general.
(3) Why have a separate train/val loop?
PyTorch 0.4 introduced context handlers.
You can simplify your train / val / test loops, and merge them into one simple function.
(4) EmbeddingBag
Use EmbeddingBag layer for morphologically rich languages. Seriously!
(5) Writing trainers / training abstractions
This is waste of time imho if you follow (1), (2) and (3).
(6) Nice bonus
If you follow most of these, you can train on as many GPUs and machines as you wan for any language)
(7) Using tensorboard for logging
This goes without saying.
#nlp
#deep_learning
Very simple ideas, actually.
(1) Multi GPU parallelization and FP16 training
Do not bother reinventing the wheel.
Just use nvidia's
apex
, DistributedDataParallel
, DataParallel
.Best examples [here](https://github.com/huggingface/pytorch-pretrained-BERT).
(2) Put as much as possible INSIDE of the model
Implement the as much as possible of your logic inside of
nn.module
.Why?
So that you can seamleassly you all the abstractions from (1) with ease.
Also models are more abstract and reusable in general.
(3) Why have a separate train/val loop?
PyTorch 0.4 introduced context handlers.
You can simplify your train / val / test loops, and merge them into one simple function.
context = torch.no_grad() if loop_type=='Val' else torch.enable_grad()
if loop_type=='Train':
model.train()
elif loop_type=='Val':
model.eval()
with context:
for i, (some_tensor) in enumerate(tqdm(train_loader)):
# do your stuff here
pass
(4) EmbeddingBag
Use EmbeddingBag layer for morphologically rich languages. Seriously!
(5) Writing trainers / training abstractions
This is waste of time imho if you follow (1), (2) and (3).
(6) Nice bonus
If you follow most of these, you can train on as many GPUs and machines as you wan for any language)
(7) Using tensorboard for logging
This goes without saying.
#nlp
#deep_learning
GitHub
GitHub - huggingface/transformers: 🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX. - huggingface/transformers
PyTorch DataLoader, GIL thrashing and CNNs
Well all of this seems a bit like magic to me, but hear me out.
I abused my GPU box for weeks running CNNs on 2-4 GPUs.
Nothing broke.
And then my GPU box started shutting down for no apparent reason.
No, this was not:
- CPU overheating (I have a massive cooler, I checked - it works);
- PSU;
- Overclocking;
- It also adds to confusion that AMD has weird temperature readings;
To cut the story short - if you have a very fast Dataset class and you use PyTorch's DataLoader with
It is obvious in retrospect, but it is not when you face this issue.
#deep_learning
#pytorch
Well all of this seems a bit like magic to me, but hear me out.
I abused my GPU box for weeks running CNNs on 2-4 GPUs.
Nothing broke.
And then my GPU box started shutting down for no apparent reason.
No, this was not:
- CPU overheating (I have a massive cooler, I checked - it works);
- PSU;
- Overclocking;
- It also adds to confusion that AMD has weird temperature readings;
To cut the story short - if you have a very fast Dataset class and you use PyTorch's DataLoader with
workers > 0
it can lead to system instability instead of speeding up.It is obvious in retrospect, but it is not when you face this issue.
#deep_learning
#pytorch
*
(2) is valid for models with complex forward pass and models with large embedding layers
(2) is valid for models with complex forward pass and models with large embedding layers
Whict type of content do you / would you like most on the channel?
Anonymous Poll
27%
Weekly / bi-weekly digests;
10%
Full articles;
9%
Podcasts with actual ML practicioners;
23%
Practical bits on real applied NLP;
9%
Pre-trained BERT with Embedding Bags for Russian;
16%
Paper reviews;
7%
Jokes / memes / cats;
Pinned post
What is this channel about?
(0)
This channel is a practitioner's channel on the following topics: Internet, Data Science, Deep Learning, Python, NLP
(1)
Don't get your opinion in a twist if your opinion differs.
You are welcome to contact me via telegram @snakers41 and email - aveysov@gmail.com
(2)
No BS and ads - I already rejected 3-4 crappy ad deals
(4)
DS ML digests - in the RSS or via URLs like this
https://spark-in.me/post/2019_ds_ml_digest_01
Donations
(0)
Buy me a coffee 🤟 https://buymeacoff.ee/8oneCIN
Give us a rating:
(0)
https://telegram.me/tchannelsbot?start=snakers4
Our chat
(0)
https://t.me/joinchat/Bv9tjkH9JHYvOr92hi5LxQ
More links
(0)
Our website http://spark-in.me
(1)
Our chat https://t.me/joinchat/Bv9tjkH9JHYvOr92hi5LxQ
(2)
DS courses review (RU) - very old
http://goo.gl/5VGU5A
https://spark-in.me/post/learn-data-science
(3)
2017 - 2018 SpaceNet Challenge
https://spark-in.me/post/spacenet-three-challenge
(4)
DS Bowl 2018
https://spark-in.me/post/playing-with-dwt-and-ds-bowl-2018
(7)
Data Science tag on the website
https://spark-in.me/tag/data-science
(7)
Profi.ru project
http://towardsdatascience.com/building-client-routing-semantic-search-in-the-wild-14db04687c7e
(8)
CFT 2018 competition
https://spark-in.me/post/cft-spelling-2018
(9)
2018 retrospective
https://spark-in.me/post/2018
More amazing NLP-related articles incoming!
Maybe finally we will make podcasts?
What is this channel about?
(0)
This channel is a practitioner's channel on the following topics: Internet, Data Science, Deep Learning, Python, NLP
(1)
Don't get your opinion in a twist if your opinion differs.
You are welcome to contact me via telegram @snakers41 and email - aveysov@gmail.com
(2)
No BS and ads - I already rejected 3-4 crappy ad deals
(4)
DS ML digests - in the RSS or via URLs like this
https://spark-in.me/post/2019_ds_ml_digest_01
Donations
(0)
Buy me a coffee 🤟 https://buymeacoff.ee/8oneCIN
Give us a rating:
(0)
https://telegram.me/tchannelsbot?start=snakers4
Our chat
(0)
https://t.me/joinchat/Bv9tjkH9JHYvOr92hi5LxQ
More links
(0)
Our website http://spark-in.me
(1)
Our chat https://t.me/joinchat/Bv9tjkH9JHYvOr92hi5LxQ
(2)
DS courses review (RU) - very old
http://goo.gl/5VGU5A
https://spark-in.me/post/learn-data-science
(3)
2017 - 2018 SpaceNet Challenge
https://spark-in.me/post/spacenet-three-challenge
(4)
DS Bowl 2018
https://spark-in.me/post/playing-with-dwt-and-ds-bowl-2018
(7)
Data Science tag on the website
https://spark-in.me/tag/data-science
(7)
Profi.ru project
http://towardsdatascience.com/building-client-routing-semantic-search-in-the-wild-14db04687c7e
(8)
CFT 2018 competition
https://spark-in.me/post/cft-spelling-2018
(9)
2018 retrospective
https://spark-in.me/post/2018
More amazing NLP-related articles incoming!
Maybe finally we will make podcasts?
Spark in me
2019 DS/ML digest 01
2019 DS/ML digest 01
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Spark in me pinned «Pinned post What is this channel about? (0) This channel is a practitioner's channel on the following topics: Internet, Data Science, Deep Learning, Python, NLP (1) Don't get your opinion in a twist if your opinion differs. You are welcome to contact…»
A bit of lazy Sunday admin stuff
Monitoring you CPU temperature with email notifications
- Change CPU temp to any metric you like
- Rolling log
- Sending email only one time, if the metric becomes critical (you can add an email when metric becomes non-critical again)
https://gist.github.com/snakers4/cf0ffd57c3ef7f4e2e25f6b3347dcdec
Setting up a GPU box on Ubuntu 18.04 from scratch
https://github.com/snakers4/gpu-box-setup/
#deep_learning
#linux
Monitoring you CPU temperature with email notifications
- Change CPU temp to any metric you like
- Rolling log
- Sending email only one time, if the metric becomes critical (you can add an email when metric becomes non-critical again)
https://gist.github.com/snakers4/cf0ffd57c3ef7f4e2e25f6b3347dcdec
Setting up a GPU box on Ubuntu 18.04 from scratch
https://github.com/snakers4/gpu-box-setup/
#deep_learning
#linux
Gist
Plain temperature monitoring in Ubuntu 18.04
Plain temperature monitoring in Ubuntu 18.04. GitHub Gist: instantly share code, notes, and snippets.
4th 2019 DS / ML digest
Highlights of the week
- OpenAI controversy;
- BERT pre-training;
- Using transformer for conversational challenges;
https://spark-in.me/post/2019_ds_ml_digest_04
#digest
#data_science
#deep_learning
Highlights of the week
- OpenAI controversy;
- BERT pre-training;
- Using transformer for conversational challenges;
https://spark-in.me/post/2019_ds_ml_digest_04
#digest
#data_science
#deep_learning
Spark in me
2019 DS/ML digest 04
2019 DS/ML digest 04
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
New variation of Adam?
- [Website](https://www.luolc.com/publications/adabound/);
- [Code](https://github.com/Luolc/AdaBound);
-
-
- Dynamic bound on learning rates. Inspired by gradient clipping;
- Not very sensitive to the hyperparameters, especially compared with Sgd(M);
- Tested on MNIST, CIFAR, Penn Treebank - no serious datasets;
#deep_learning
- [Website](https://www.luolc.com/publications/adabound/);
- [Code](https://github.com/Luolc/AdaBound);
-
Eliminate the generalization gap between adaptive methods and SGD
;-
TL;DR: A Faster And Better Optimizer with Highly Robust Performance
;- Dynamic bound on learning rates. Inspired by gradient clipping;
- Not very sensitive to the hyperparameters, especially compared with Sgd(M);
- Tested on MNIST, CIFAR, Penn Treebank - no serious datasets;
#deep_learning
Luolc
Adaptive Gradient Methods with Dynamic Bound of Learning Rate
Abstract Adaptive optimization methods such as AdaGrad, RMSProp and Adam have been proposed to achieve a rapid training process with an element-wise scaling term on learning rates. Though prevailing, they are observed to generalize poorly compared with Sgd…
We tried it
... yeah we tried it on a real task
just adam is a bit better
... yeah we tried it on a real task
just adam is a bit better
Dependency parsing and POS tagging in Russian
Less popular set of NLP tasks.
Popular tools reviewed
https://habr.com/ru/company/sberbank/blog/418701/
Only morphology:
(0) Well known
Only POS tags and morphology:
(0) https://github.com/IlyaGusev/rnnmorph (easy to use);
(1) https://github.com/nlpub/pymystem3 (easy to use);
Full dependency parsing
(0) Russian spacy plugin:
- https://github.com/buriy/spacy-ru - installation
- https://github.com/buriy/spacy-ru/blob/master/examples/POS_and_syntax.ipynb - usage with examples
(1) Malt parser based solution (drawback - no examples)
- https://github.com/oxaoo/mp4ru
(2) Google's syntaxnet
- https://github.com/tensorflow/models/tree/master/research/syntaxnet
#nlp
Less popular set of NLP tasks.
Popular tools reviewed
https://habr.com/ru/company/sberbank/blog/418701/
Only morphology:
(0) Well known
pymorphy2
package;Only POS tags and morphology:
(0) https://github.com/IlyaGusev/rnnmorph (easy to use);
(1) https://github.com/nlpub/pymystem3 (easy to use);
Full dependency parsing
(0) Russian spacy plugin:
- https://github.com/buriy/spacy-ru - installation
- https://github.com/buriy/spacy-ru/blob/master/examples/POS_and_syntax.ipynb - usage with examples
(1) Malt parser based solution (drawback - no examples)
- https://github.com/oxaoo/mp4ru
(2) Google's syntaxnet
- https://github.com/tensorflow/models/tree/master/research/syntaxnet
#nlp
Хабр
Изучаем синтаксические парсеры для русского языка
Привет! Меня зовут Денис Кирьянов, я работаю в Сбербанке и занимаюсь проблемами обработки естественного языка (NLP). Однажды нам понадобилось выбрать синтаксический парсер для работы с русским языком....