Spark in me - Internet, data science, math, deep learning, philosophy
1.8K members
196 photos
30 files
1.6K links
All this - lost like tears in rain.

Data science, ML, a bit of philosophy and math. No bs.

Our website
- http://spark-in.me
Our chat
- https://t.me/joinchat/Bv9tjkH9JHbxiV5hr91a0w
DS courses review
- http://goo.gl/5VGU5A
- https://goo.gl/YzVUKf
Download Telegram
to view and join the conversation
Chainer - a predecessor of PyTorch

Looks like
- PyTorch was based not only on Torch, but also its autograd was forked from Chainer;
- Chainer looks like PyTorch ... but not by Facebook, but by independent Japanese group;
- A quick glance through the docs confirms that PyTorch and Chainer APIs look 90% identical (both numpy inspired, but using different back-ends);
- Open Images 2nd place was taken by people using Chainer with 512 GPUs;
- I have yet to confirm myself that PyTorch can work with a cluster (but other people have done it) https://github.com/eladhoffer/convNet.pytorch;

https://www.reddit.com/r/MachineLearning/comments/7lb5n1/d_chainer_vs_pytorch/
https://docs.chainer.org/en/stable/comparison.html

#deep_learning
Also - thanks for all DO referral link supporters - now finally hosting of my website is free (at least for next ~6 months)!

Also today I published a 200th post on spark-in.me. Ofc not all of these are proper long articles, but nevertheless it's cool.
SeNet
- http://arxiv.org/abs/1709.01507;
- A 2017 Imagenet winner;
- Mostly ResNet-152 inspired network;
- Transfers well (ResNet);
- Squeeze and Excitation (SE) block, that adaptively recalibratess channel-wise feature responses by explicitly modelling in- terdependencies between channels;
- Intuitively looks like - convolution meet the attention mechanism;
- SE block:
- https://pics.spark-in.me/upload/aa50a2559f56faf705ad6639ac973a38.jpg
- Reduction ratio r to be 16 in all experiments;
- Results:
- https://pics.spark-in.me/upload/db2c98330744a6fd4dab17259d5f9d14.jpg

#deep_learning
Useful Python / PyTorch bits

dot.notation access to dictionary attributes

class dotdict(dict):
__getattr__ = dict.get
__setattr__ = dict.__setitem__
__delattr__ = dict.__delitem__

PyTorch embedding layer - ignore padding

nn.Embedding has a padding_idx attribute not to update the padding token embedding.

#python
#pytorch
Gensim's fast-text subwords

Some monkey patching to get subwords from Gensim's fast-text


from gensim.models.utils_any2vec import _compute_ngrams,_ft_hash

def subword(self, word):
ngram_lst = []
ngrams = _compute_ngrams(word, self.min_n, self.max_n)

for ngram in ngrams:
ngram_hash = _ft_hash(ngram) % self.bucket
if ngram_hash in self.hash2index:
ngram_lst.append(ngram)
return ngram_lst

gensim.models.keyedvectors.FastTextKeyedVectors.subword = subword
Understanding the current SOTA NMT / NLP model - transformer

A list of articles that really help to do so:
- Understanding attention https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
- Annotated transformer http://nlp.seas.harvard.edu/2018/04/03/attention.html
- Illustrated transformer https://jalammar.github.io/illustrated-transformer/

Playing with transformer in practice

This repo turned out to be really helpful
https://github.com/huggingface/pytorch-openai-transformer-lm

It features:
- Decent well encapsulated model and loss;
- Several head for different tasks;
- It works;
- Ofc their data-loading scheme is crappy and over-engineered;

My impressions on actually training the transformer model for classification:
- It works;
- It is high capacity;
- Inference time is ~`5x` higher than char-level or plain RNNs;
- It serves as a classifier as well as an LM;
- Capacity is enough to tackle most challenging tasks;
- It can be deployed on CPU for small texts (!);
- On smaller tasks there is no clear difference between plain RNNs and Transformer;

#nlp
Using sklearn pairwise cosine similarity

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html#sklearn.metrics.pairwise.cosine_similarity

On 7k * 7k example with 300-dimensional vectors it turned out to be MUCH faster than doing the same:
- In 10 processes;
- Using numba;

The more you know.
If you have used it - please PM me.

#nlp
DS/ML digest 24

Key topics of this one:
- New method to calculate phrase/n-gram/sentence embeddings for rare and OOV words;
- So many releases from Google;

https://spark-in.me/post/2018_ds_ml_digest_24

If you like our digests, you can support the channel via:
- Sharing / reposting;
- Giving an article a decent comment / a thumbs-up;
- Buying me a coffee (links on the digest);

#digest
#deep_learning
#data_science
Araneum russicum maximum

TLDR - largest corpus for Russian Internet. Fast-text embeddings pre-trained on this corpus work best for broad internet related domains.

Pre-processed version can be downloaded from rusvectores.
Afaik, this link is not yet on their website (?)

wget http://rusvectores.org/static/rus_araneum_maxicum.txt.gz

#nlp
New fast.ai course

Mainly decision tree practice.

A lot about decision tree visualization
- http://www.fast.ai/2018/09/26/ml-launch/

I personally would check out the visualization bits.
At least it looks like they are not pushing their crappy library =)
The problem with any such visualizations is that they work only for toy datasets.
Drop / shuffle method seems to be more robust.

#data_science
If you are mining for a large web-corpus

... for any language other than English. And you do not want to scrape anything, buy proxies or learn this a bit shady "stack".

In case of Russian the above posted link to araneum has already processed data - which may be useless for certain domains.

What to do?

(1)
In case of Russian you can write here
https://tatianashavrina.github.io/taiga_site/
The author will share her 90+GB RAW corpus with you

(2)
In case of any other language there is a second way
- Go to common crawl website;
- Download the index (200 GB);
- Choose domains in your country / language (now they also have language detection);
- Download only plain-text files you need;

Links to start with
- http://commoncrawl.org/connect/blog/
- http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
- https://www.slideshare.net/RobertMeusel/mining-a-large-web-corpus

#nlp
Andrew Ng book

Looks like its draft is finished.

It describes in plain terms how to build ML pipelines:
- https://drive.google.com/open?id=1aHVZ9pcsGtIcgarZxV-Qfkb0JEvtSLDK

#data_science
Forwarded from Админим с Буквой (bykva)
Github и ssh-ключи

Узнал о такой фишке в гитхабе, что по такой ссылке можно забрать публичный ключ аккаунта. Типа удобно передавать ключ просто ссылкой.

https://github.com/bykvaadm.keys

#github
Amazingly simple code to mimic fast-texts n-gram subword routine

Nuff said. Try it for yourself.

from fastText import load_model

model = load_model('official_fasttext_wiki_200_model')

def find_ngrams(string, n):
ngrams = zip(*[string[i:] for i in range(n)])
ngrams = [''.join(_) for _ in ngrams]
return ngrams

string = 'грёзоблаженствующий'

ngrams = []
for i in range(3,7):
ngrams.extend(find_ngrams('<'+string+'>',i))

ft_ngrams, ft_indexes = model.get_subwords(string)

ngrams = set(ngrams)
ft_ngrams = set(ft_ngrams)

print(sorted(ngrams),sorted(ft_ngrams))
print(set(ft_ngrams).difference(set(ngrams)),set(ngrams).difference(set(ft_ngrams)))
#nlp