Spark in me
2.26K subscribers
753 photos
48 videos
114 files
2.65K links
Lost like tears in rain. DS, ML, a bit of philosophy and math. No bs or ads.
Download Telegram
Miniaturize / optimize your ... NLP models?

For CV applications there literally dozens of ways to make your models smaller.
And yeah, I do not mean some "moonshots" or special limited libraries (matrix decompositions, some custom pruning, etc etc).
I mean cheap and dirty hacks, that work in 95% of cases regardless of your stack / device / framework:
- Smaller images (x3-x4 easy);
- FP16 inference (30-40% maybe);
- Knowledge distillation into smaller networks (x3-x10);
- Naïve cascade optimizations (feed only Nth frame using some heuristic);

But what can you do with NLP networks?
Turns out not much.

But here are my ideas:
- Use a simpler model - embedding bag + plain self-attention + LSTM can solve 90% of tasks;

- Decrease embedding size from 300 to 50 (or maybe even more). Tried and tested, works like a charm. For harder tasks you lose just 1-3pp of your target metric, for smaller tasks - it is just the same;

- FP16 inference is supported in PyTorch for nn.Embedding, but not for nn.EmbeddingBag. But you get the idea;

_embedding_bag is not implemented for type torch.HalfTensor

- You can try distilling your vocabulary / embedding-bag model into a char level model. If it works, you can trade model size vs. inference time;

- If you have very long sentences or large batches - try distilling / swapping your recurrent network with a CNN / TCN. This way you can also trade model size vs. inference time but probably in a different direction;

#nlp
#deep_learning
Finally! Cool features like SyncBN or CyclicLR migrate to Pytorch!


2019 DS / ML digest number 8

Highlights of the week
- Transformer from Facebook with sub-word information;
- How to generate endless sentiment annotation;
- 1M breast cancer images;

https://spark-in.me/post/2019_ds_ml_digest_08

#digest
#deep_learning
Using snakeviz for profiling Python code

Why
To profile complicated and convoluted code.
Snakeviz is a cool GUI tool to analyze cProfile profile files.
https://jiffyclub.github.io/snakeviz/

Just launch your code like this
python3 -m cProfile -o profile_file.cprofile

And then just analyze with snakeviz.

GUI

They have a server GUI and a jupyter notebook plugin.
Also you can launch their tool from within a docker container:
snakeviz -s -H 0.0.0.0 profile_file.cprofile
Do not forget to EXPOSE necessary ports. SSH tunnel to a host is also an option.

#data_science
2019 DS / ML digest 9

Highlights of the week
- Stack Overlow survey;
- Unsupervised STT (ofc not!);
- A mix between detection and semseg?;

https://spark-in.me/post/2019_ds_ml_digest_09

#digest
#deep_learning
Tricky rsync flags

Rsync is the best program ever.

I find these flags the most useful
--ignore-existing (ignores existing files)
--update (updates to newer versions of files based on ts)
--size-only (uses file-size to compare files)
-e 'ssh -p 22 -i /path/to/private/key' (use custom ssh identity)

Sometimes first three flags get confusing.

#linux
More about STT from also us ... soon)
Forwarded from Yuri Baburov
Вторая экспериментальная гостевая лекция курса.
Один из семинаристов курса, Юрий Бабуров, расскажет о распознавании речи и работе с аудио.

1-го мая в 8:40 Мск (12:40 Нск, 10:40 вечера 30-го апреля по PST).

Deep Learning на пальцах 11 - Аудио и Speech Recognition (Юрий Бабуров)
https://www.youtube.com/watch?v=wm4H2Ym33Io
Poor man's computing cluster

So, when I last checked, Amazon's p3.4xlarge instances cost around US$12 per hour (unless you reserve them for a year). A tower supercomputer from Nvidia costs probably US$40-50k or more (it was announced at around US$69k).

It is not difficult to crunch the numbers and see, that 1 month of renting such a machine would cost at least US$8-10k. Also there will the additional cost / problem of actually storing your large datasets. When I last used Amazon - their cheap storage was sloooooow, and fast storage was prohibitively expensive.


So, why I am saying this?


Let's assume (according to my miner friends' experience) - that consumer Nvidia GPUs can work 2-3 years non-stop given proper cooling and care (test before buying!). Also let's assume that 4xTesla V100 is roughly the same as 7-8 * 1080Ti.

Yeah, I know that you will point out at least one reason why this does not hold, but for practical purposes this is fine (yes, I know that Teslas have some cool features like Nvlink).

Now let me drop the ball - modern professional motherboards often boast 2-3 Ethernet ports. And sometimes you can even get 2x10Gbit/s ports (!!!).

It means, that you actually can connect at least 2 (or maybe you can daisy chain them?) machines into a computing cluster.

Now let's crunch the numbers

According to quotes I collected through the years, you can build a cluster roughly equivalent to Amazon's p3.4xlarge for US$10k (but with storage!) with used GPUs (miners sell them like crazy now). If you buy second market drives, motherboards, CPUs and processors you can lower the cost to US$5k or less.

So, a cluster, that would serve you at least one year (if you test everything properly and take care of it) costing US$10k is roughly equivalent to:
- 20-25% of DGX desktop;
- 1 month of renting on Amazon;

Assuming that all the hardware will just break in a year:
- It is 4-5x cheaper than buying from Nvidia;
- It is 10x cheaper than renting;

If you buy everything used, then it is 10x and 20x cheaper!

I would buy that for a dollar!
Ofc you have to invest your free time.

See my calculations here:
http://bit.ly/spark00001

#deep_learning
#hardware
Russian Open Speech To Text (STT/ASR) Dataset
4000 hours of STT data in Russian

Made by us. Yes, really. I am not joking.
It was a lot of work.

The dataset:
https://github.com/snakers4/open_stt/

Accompanying post:
https://spark-in.me/post/russian-open-stt-part1

TLDR:
- On third release, we have ~4000 hours;
- Contributors and help wanted;
- Let's bring the Imagenet moment in STT closer together!;

Please repost this as much as you can.

#stt
#asr
#data_science
#deep_learning
PyTorch DP / DDP / model parallel

Finally they made proper tutorials:
- https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
- https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
- https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

Model parallel = have parts of the same model on different devices
Data Parallel (DP) = wrapper to use multi-GPU withing a single parent process
Distributed Data Parallel = multiple processes are spawned across cluster / on the same machine

#deep_learning