Spark in me
2.27K subscribers
739 photos
47 videos
114 files
2.63K links
Lost like tears in rain. DS, ML, a bit of philosophy and math. No bs or ads.
Download Telegram
Normalization techniques other than batch norm:
(https://pics.spark-in.me/upload/aecc2c5fb356b6d803b4218fcb0bc3ec.png)

Weight normalization (used in TCN http://arxiv.org/abs/1602.07868):
- Decouples length of weight vectors from their direction;
- Does not introduce any dependencies between the examples in a minibatch;
- Can be applied successfully to recurrent models such as LSTMs;
- Tested only on small datasets (CIFAR + VAES + DQN);


Instance norm (used in [style transfer](https://arxiv.org/abs/1607.08022))
- Proposed for style transfer;
- Essentially is batch-norm for one image;
- The mean and standard-deviation are calculated per-dimension separately for each object in a mini-batch;

Layer norm (used in Transformers, [paper](https://arxiv.org/abs/1607.06450))
- Designed especially for sequntial networks;
- Computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case;
- The mean and standard-deviation are calculated separately over the last certain number dimensions;
- Unlike Batch Normalization and Instance Normalization, which applies scalar scale and bias for each entire channel/plane with the affine option, Layer Normalization applies per-element scale and bias;

#deep_learning
#nlp
Wow, we are not alone with our love for Embedding bag!
Forwarded from Neural Networks Engineering (Andrey)
FastText embeddings done right


An important feature of FastText embeddings is the usage of subword information.
In addition to the vocabulary FastText also contains word's ngrams.
This additional information is useful for the following: handling Out-Of-Vocabulary words, extracting sense from word's etymology and dealing with misspellings.

But unfortunately all this advantages are not used in most open source projects.
We can easily discover it via GitHub (pic.). The point is that regular Embedding layer maps the whole word into a single stored in memory fixed vector. In this case all the word vectors should be generated in advance, so none of the cool features work.

The good thing is that using FastText correctly is not so difficult! FacebookResearch provides an example of the proper way to use FastText in PyTorch framework.
Instead of Embedding you should choose EmbeddingBag layer. It will combine ngrams into single word vector which can be used as usual.
Now we will obtain all advantages in our neural network.
... or you can just extend collate_fn that is passed to DataLoader in pytorch =)
Forwarded from Neural Networks Engineering (Andrey)
Parallel preprocessing with multiprocessing

Using multiple processes to construct train batches may significantly reduce total training time of your network.
Basically, if you are using GPU for training, you can reduce additional batch construction time almost to zero. This is achieved through pipelining of computations: while GPU crunches numbers, CPU makes preprocessing. Python multiprocessing module allows us to implement such pipelining as elegant as it is possible in the language with GIL.

PyTorch DataLoader class, for example, also uses multiprocessing in it's internals.
Unfortunately DataLoader suffers lack of flexibility. It's impossible to create batch with any complex structure within standard DataLoader class. So it should be useful to be able to apply raw multiprocessing.

multiprocessing gives us a set of useful APIs to distribute computations among several processes. Processes does not share memory with each other, so data is transmitted via inter-process communication protocols. For example in linux-like operation systems multiprocessing uses pipes. Such organization leads to some pitfalls that I am going to tell you.

* map vs imap

Methods map and imap may be used to apply preprocessing to batches. Both of them take processing function and iterable as argument. The difference is that imap is lazy. It will return processed elements as soon as they are ready. In this case all processed batched should not be stored in RAM simultaneously. For training NN you should always prefer imap:

def process(batch_reader):
with Pool(threads) as pool:
for batch in pool.imap(foo, batch_reader):
....
yield batch
....


* Serialization

Other pitfall is associated with the need to transfer objects via pipes. In addition to the processing results, multiprocessing will also serialize transformation object if it is used like this: pool.imap(transformer.foo, batch_reader). transformer will be serialized and send to subprocess. It may lead to some problems if transformer object has large properties. In this case it may be better to store large properties as singleton class variables:


class Transformer():
large_dictinary = None

def __init__(self, large_dictinary, **kwargs):
self.__class__.large_dictinary = large_dictinary

def foo(self, x):
....
y = self.large_dictinary[x]
....


Another difficulty that you may encounter is if the preprocessor is faster than GPU learning. In this case unprocessed batches accumulate in memory. If your memory is not to large enough you will get Out-of-Memory error. One way to solve this problem is to limit batch preprocessing until GPU learning is done.
Semaphore is perfect solution for this task:

def batch_reader(semaphore):
for batch in source:
semaphore.acquire()
yield batch


def process(x):
return x + 1


def pooling():
with Pool(threads) as pool:
semaphore = Semaphore(limit)
for x in pool.imap(plus, batch_reader(semaphore)):
yield x
semaphore.release()


for x in pooling():
learn_gpu(x)


Semaphore has internal counter syncronized across all working processes. It's logic will block execution if some process tries to increase counet value above limit with semaphore.acquire ()
Good old OLS regression

I needed some quick boilerplate to create an OLS regression with confidence intervals for a very plain task.

Found some nice statsmodels examples here:
http://www.statsmodels.org/devel/examples/notebooks/generated/ols.html

#data_science
2019 DS / ML digest number 7

Highlights of the week
- NN normalization techniques (not batch norm);
- Jetson nano for US$99 released;
- A bitter lesson in AI;

https://spark-in.me/post/2019_ds_ml_digest_07

#digest
#deep_learning
Spark in me
Sentiment datasets in Russian Just randomly found several links. - http://study.mokoron.com/ - annotated tweets - http://text-machine.cs.uml.edu/projects/rusentiment/ - some more posts from VK - https://github.com/dkulagin/kartaslov/tree/master/dataset/emo_dict…
Russian sentiment dataset

In a typical Russian fashion - one of these datasets was deleted by the request of bad people, whom I shall not name.
Luckily, some anonymous backed the dataset up.
Anyway - use it.

Yeah, it is small. But it is free, so whatever.

#nlp
#data_science
Miniaturize / optimize your ... NLP models?

For CV applications there literally dozens of ways to make your models smaller.
And yeah, I do not mean some "moonshots" or special limited libraries (matrix decompositions, some custom pruning, etc etc).
I mean cheap and dirty hacks, that work in 95% of cases regardless of your stack / device / framework:
- Smaller images (x3-x4 easy);
- FP16 inference (30-40% maybe);
- Knowledge distillation into smaller networks (x3-x10);
- Naïve cascade optimizations (feed only Nth frame using some heuristic);

But what can you do with NLP networks?
Turns out not much.

But here are my ideas:
- Use a simpler model - embedding bag + plain self-attention + LSTM can solve 90% of tasks;

- Decrease embedding size from 300 to 50 (or maybe even more). Tried and tested, works like a charm. For harder tasks you lose just 1-3pp of your target metric, for smaller tasks - it is just the same;

- FP16 inference is supported in PyTorch for nn.Embedding, but not for nn.EmbeddingBag. But you get the idea;

_embedding_bag is not implemented for type torch.HalfTensor

- You can try distilling your vocabulary / embedding-bag model into a char level model. If it works, you can trade model size vs. inference time;

- If you have very long sentences or large batches - try distilling / swapping your recurrent network with a CNN / TCN. This way you can also trade model size vs. inference time but probably in a different direction;

#nlp
#deep_learning
Finally! Cool features like SyncBN or CyclicLR migrate to Pytorch!


2019 DS / ML digest number 8

Highlights of the week
- Transformer from Facebook with sub-word information;
- How to generate endless sentiment annotation;
- 1M breast cancer images;

https://spark-in.me/post/2019_ds_ml_digest_08

#digest
#deep_learning
Using snakeviz for profiling Python code

Why
To profile complicated and convoluted code.
Snakeviz is a cool GUI tool to analyze cProfile profile files.
https://jiffyclub.github.io/snakeviz/

Just launch your code like this
python3 -m cProfile -o profile_file.cprofile

And then just analyze with snakeviz.

GUI

They have a server GUI and a jupyter notebook plugin.
Also you can launch their tool from within a docker container:
snakeviz -s -H 0.0.0.0 profile_file.cprofile
Do not forget to EXPOSE necessary ports. SSH tunnel to a host is also an option.

#data_science
2019 DS / ML digest 9

Highlights of the week
- Stack Overlow survey;
- Unsupervised STT (ofc not!);
- A mix between detection and semseg?;

https://spark-in.me/post/2019_ds_ml_digest_09

#digest
#deep_learning
Tricky rsync flags

Rsync is the best program ever.

I find these flags the most useful
--ignore-existing (ignores existing files)
--update (updates to newer versions of files based on ts)
--size-only (uses file-size to compare files)
-e 'ssh -p 22 -i /path/to/private/key' (use custom ssh identity)

Sometimes first three flags get confusing.

#linux