Spark in me - Internet, data science, math, deep learning, philosophy
1.8K members
196 photos
30 files
1.6K links
All this - lost like tears in rain.

Data science, ML, a bit of philosophy and math. No bs.

Our website
- http://spark-in.me
Our chat
- https://t.me/joinchat/Bv9tjkH9JHbxiV5hr91a0w
DS courses review
- http://goo.gl/5VGU5A
- https://goo.gl/YzVUKf
Download Telegram
to view and join the conversation
Training a MNASNET from scratch ... and failing

As a small side hobby we tried training new Google's mobile network from scratch and failed:
- https://spark-in.me/post/mnasnet-fail-alas
- https://github.com/snakers4/mnasnet-pytorch

Maybe you know how to train it properly?

Also now you can upvote articles on spark in me! =)

#deep_learning
MySQL - replacing window functions

Older versions of MySQL (and maybe newer ones) do not have all the goodness you can find in PostgreSQL. Ofc you can do plain session matching in Python, but sometimes you just need to do it in plain SQL.

In Postgres you usually use window functions for this purpose if you need PLAIN SQL (ofc there are stored procedures / views / mat views etc).

In MySQL it can be elegantly solved like this:

SET @session_number = 0, @last_uid = '0', @current_id = '0', @dif=0;

SELECT
t1.some_field,
t2.some_field,
...
@last_uid:=@current_uid,
@current_uid:=t1.uid,
@dif:=TIMESTAMPDIFF(MINUTE, t2.session_ts, t1.session_ts),
if(@last_uid=@current_uid, if(@dif > 30,@session_number:=@session_number+1,@session_number),@session_number:=0) as session
FROM
table1 t1
JOIN table2 t2 on t1.id = t2.id+1

#data_science
DS/ML digest 23

The key topic of this one - is this is insanity
- vid2vid
- unsupervised NMT

https://spark-in.me/post/2018_ds_ml_digest_23

If you like our digests, you can support the channel via:
- Sharing / reposting;
- Giving an article a decent comment / a thumbs-up;
- Buying me a coffee (links on the digest);

Let's spread the right DS/ML ideas together.

#digest
#deep_learning
#data_science
Chainer - a predecessor of PyTorch

Looks like
- PyTorch was based not only on Torch, but also its autograd was forked from Chainer;
- Chainer looks like PyTorch ... but not by Facebook, but by independent Japanese group;
- A quick glance through the docs confirms that PyTorch and Chainer APIs look 90% identical (both numpy inspired, but using different back-ends);
- Open Images 2nd place was taken by people using Chainer with 512 GPUs;
- I have yet to confirm myself that PyTorch can work with a cluster (but other people have done it) https://github.com/eladhoffer/convNet.pytorch;

https://www.reddit.com/r/MachineLearning/comments/7lb5n1/d_chainer_vs_pytorch/
https://docs.chainer.org/en/stable/comparison.html

#deep_learning
Also - thanks for all DO referral link supporters - now finally hosting of my website is free (at least for next ~6 months)!

Also today I published a 200th post on spark-in.me. Ofc not all of these are proper long articles, but nevertheless it's cool.
SeNet
- http://arxiv.org/abs/1709.01507;
- A 2017 Imagenet winner;
- Mostly ResNet-152 inspired network;
- Transfers well (ResNet);
- Squeeze and Excitation (SE) block, that adaptively recalibratess channel-wise feature responses by explicitly modelling in- terdependencies between channels;
- Intuitively looks like - convolution meet the attention mechanism;
- SE block:
- https://pics.spark-in.me/upload/aa50a2559f56faf705ad6639ac973a38.jpg
- Reduction ratio r to be 16 in all experiments;
- Results:
- https://pics.spark-in.me/upload/db2c98330744a6fd4dab17259d5f9d14.jpg

#deep_learning
Useful Python / PyTorch bits

dot.notation access to dictionary attributes

class dotdict(dict):
__getattr__ = dict.get
__setattr__ = dict.__setitem__
__delattr__ = dict.__delitem__

PyTorch embedding layer - ignore padding

nn.Embedding has a padding_idx attribute not to update the padding token embedding.

#python
#pytorch
Gensim's fast-text subwords

Some monkey patching to get subwords from Gensim's fast-text


from gensim.models.utils_any2vec import _compute_ngrams,_ft_hash

def subword(self, word):
ngram_lst = []
ngrams = _compute_ngrams(word, self.min_n, self.max_n)

for ngram in ngrams:
ngram_hash = _ft_hash(ngram) % self.bucket
if ngram_hash in self.hash2index:
ngram_lst.append(ngram)
return ngram_lst

gensim.models.keyedvectors.FastTextKeyedVectors.subword = subword
Understanding the current SOTA NMT / NLP model - transformer

A list of articles that really help to do so:
- Understanding attention https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
- Annotated transformer http://nlp.seas.harvard.edu/2018/04/03/attention.html
- Illustrated transformer https://jalammar.github.io/illustrated-transformer/

Playing with transformer in practice

This repo turned out to be really helpful
https://github.com/huggingface/pytorch-openai-transformer-lm

It features:
- Decent well encapsulated model and loss;
- Several head for different tasks;
- It works;
- Ofc their data-loading scheme is crappy and over-engineered;

My impressions on actually training the transformer model for classification:
- It works;
- It is high capacity;
- Inference time is ~`5x` higher than char-level or plain RNNs;
- It serves as a classifier as well as an LM;
- Capacity is enough to tackle most challenging tasks;
- It can be deployed on CPU for small texts (!);
- On smaller tasks there is no clear difference between plain RNNs and Transformer;

#nlp
Using sklearn pairwise cosine similarity

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html#sklearn.metrics.pairwise.cosine_similarity

On 7k * 7k example with 300-dimensional vectors it turned out to be MUCH faster than doing the same:
- In 10 processes;
- Using numba;

The more you know.
If you have used it - please PM me.

#nlp
DS/ML digest 24

Key topics of this one:
- New method to calculate phrase/n-gram/sentence embeddings for rare and OOV words;
- So many releases from Google;

https://spark-in.me/post/2018_ds_ml_digest_24

If you like our digests, you can support the channel via:
- Sharing / reposting;
- Giving an article a decent comment / a thumbs-up;
- Buying me a coffee (links on the digest);

#digest
#deep_learning
#data_science
Araneum russicum maximum

TLDR - largest corpus for Russian Internet. Fast-text embeddings pre-trained on this corpus work best for broad internet related domains.

Pre-processed version can be downloaded from rusvectores.
Afaik, this link is not yet on their website (?)

wget http://rusvectores.org/static/rus_araneum_maxicum.txt.gz

#nlp
New fast.ai course

Mainly decision tree practice.

A lot about decision tree visualization
- http://www.fast.ai/2018/09/26/ml-launch/

I personally would check out the visualization bits.
At least it looks like they are not pushing their crappy library =)
The problem with any such visualizations is that they work only for toy datasets.
Drop / shuffle method seems to be more robust.

#data_science
If you are mining for a large web-corpus

... for any language other than English. And you do not want to scrape anything, buy proxies or learn this a bit shady "stack".

In case of Russian the above posted link to araneum has already processed data - which may be useless for certain domains.

What to do?

(1)
In case of Russian you can write here
https://tatianashavrina.github.io/taiga_site/
The author will share her 90+GB RAW corpus with you

(2)
In case of any other language there is a second way
- Go to common crawl website;
- Download the index (200 GB);
- Choose domains in your country / language (now they also have language detection);
- Download only plain-text files you need;

Links to start with
- http://commoncrawl.org/connect/blog/
- http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
- https://www.slideshare.net/RobertMeusel/mining-a-large-web-corpus

#nlp
Andrew Ng book

Looks like its draft is finished.

It describes in plain terms how to build ML pipelines:
- https://drive.google.com/open?id=1aHVZ9pcsGtIcgarZxV-Qfkb0JEvtSLDK

#data_science