Dockerfile
Updated my DL/ML dockerfile with
- cuda 10
- PyTorch 1.0
https://github.com/snakers4/gpu-box-setup/
TF now also works with cuda 10
#deep_learning
Updated my DL/ML dockerfile with
- cuda 10
- PyTorch 1.0
https://github.com/snakers4/gpu-box-setup/
TF now also works with cuda 10
#deep_learning
GitHub
GitHub - snakers4/gpu-box-setup
Contribute to snakers4/gpu-box-setup development by creating an account on GitHub.
Miniaturize / optimize your ... NLP models?
For CV applications there literally dozens of ways to make your models smaller.
And yeah, I do not mean some "moonshots" or special limited libraries (matrix decompositions, some custom pruning, etc etc).
I mean cheap and dirty hacks, that work in 95% of cases regardless of your stack / device / framework:
- Smaller images (x3-x4 easy);
- FP16 inference (30-40% maybe);
- Knowledge distillation into smaller networks (x3-x10);
- Naïve cascade optimizations (feed only Nth frame using some heuristic);
But what can you do with NLP networks?
Turns out not much.
But here are my ideas:
- Use a simpler model - embedding bag + plain self-attention + LSTM can solve 90% of tasks;
- Decrease embedding size from 300 to 50 (or maybe even more). Tried and tested, works like a charm. For harder tasks you lose just 1-3pp of your target metric, for smaller tasks - it is just the same;
- FP16 inference is supported in PyTorch for
- If you have very long sentences or large batches - try distilling / swapping your recurrent network with a CNN / TCN. This way you can also trade model size vs. inference time but probably in a different direction;
#nlp
#deep_learning
For CV applications there literally dozens of ways to make your models smaller.
And yeah, I do not mean some "moonshots" or special limited libraries (matrix decompositions, some custom pruning, etc etc).
I mean cheap and dirty hacks, that work in 95% of cases regardless of your stack / device / framework:
- Smaller images (x3-x4 easy);
- FP16 inference (30-40% maybe);
- Knowledge distillation into smaller networks (x3-x10);
- Naïve cascade optimizations (feed only Nth frame using some heuristic);
But what can you do with NLP networks?
Turns out not much.
But here are my ideas:
- Use a simpler model - embedding bag + plain self-attention + LSTM can solve 90% of tasks;
- Decrease embedding size from 300 to 50 (or maybe even more). Tried and tested, works like a charm. For harder tasks you lose just 1-3pp of your target metric, for smaller tasks - it is just the same;
- FP16 inference is supported in PyTorch for
nn.Embedding
, but not for nn.EmbeddingBag
. But you get the idea;_embedding_bag is not implemented for type torch.HalfTensor- You can try distilling your vocabulary / embedding-bag model into a char level model. If it works, you can trade model size vs. inference time;
- If you have very long sentences or large batches - try distilling / swapping your recurrent network with a CNN / TCN. This way you can also trade model size vs. inference time but probably in a different direction;
#nlp
#deep_learning
Finally! Cool features like SyncBN or CyclicLR migrate to Pytorch!
Forwarded from Just links
2019 DS / ML digest number 8
Highlights of the week
- Transformer from Facebook with sub-word information;
- How to generate endless sentiment annotation;
- 1M breast cancer images;
https://spark-in.me/post/2019_ds_ml_digest_08
#digest
#deep_learning
Spark in me
2019 DS/ML digest 08
2019 DS/ML digest 08
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
PyTorch DataParallel scalability
TLDR - it works fine for 2-3 GPUs.
For more GPUs - use DDP.
https://github.com/NVIDIA/sentiment-discovery/blob/master/analysis/scale.md
https://github.com/SeanNaren/deepspeech.pytorch/issues/211
#deep_learning
TLDR - it works fine for 2-3 GPUs.
For more GPUs - use DDP.
https://github.com/NVIDIA/sentiment-discovery/blob/master/analysis/scale.md
https://github.com/SeanNaren/deepspeech.pytorch/issues/211
#deep_learning
GitHub
sentiment-discovery/analysis/scale.md at master · NVIDIA/sentiment-discovery
Unsupervised Language Modeling at scale for robust sentiment classification - NVIDIA/sentiment-discovery
Using snakeviz for profiling Python code
Why
To profile complicated and convoluted code.
Snakeviz is a cool GUI tool to analyze cProfile profile files.
https://jiffyclub.github.io/snakeviz/
Just launch your code like this
And then just analyze with snakeviz.
GUI
They have a server GUI and a jupyter notebook plugin.
Also you can launch their tool from within a docker container:
Do not forget to
#data_science
Why
To profile complicated and convoluted code.
Snakeviz is a cool GUI tool to analyze cProfile profile files.
https://jiffyclub.github.io/snakeviz/
Just launch your code like this
python3 -m cProfile -o profile_file.cprofile
And then just analyze with snakeviz.
GUI
They have a server GUI and a jupyter notebook plugin.
Also you can launch their tool from within a docker container:
snakeviz -s -H 0.0.0.0 profile_file.cprofile
Do not forget to
EXPOSE
necessary ports. SSH tunnel to a host is also an option.#data_science
jiffyclub.github.io
SnakeViz
SnakeViz is a browser based graphical viewer for the output of Python's cProfile module.
Archive team ... makes monthly Twitter archives
With all the BS with politics / "Russian hackers" / Arab spring - twitter how has closed its developer API.
No problem.
Just pay a visit to archive team page
https://archive.org/details/twitterstream?and[]=year%3A%222018%22
Donate them here
https://archive.org/donate/
#data_science
#nlp
#nlp
With all the BS with politics / "Russian hackers" / Arab spring - twitter how has closed its developer API.
No problem.
Just pay a visit to archive team page
https://archive.org/details/twitterstream?and[]=year%3A%222018%22
Donate them here
https://archive.org/donate/
#data_science
#nlp
#nlp
archive.org
Archive Team: The Twitter Stream Grab : Free Web : Free Download, Borrow and Streaming : Internet Archive
A simple collection of JSON grabbed from the general twitter stream, for the purposes of research, history, testing and memory. This is the Spritzer version, the most light and shallow of Twitter grabs. Unfortunately, we do not currently have access to the…
Cool docker function
View aggregate load stats by container
https://docs.docker.com/engine/reference/commandline/stats/
#linux
View aggregate load stats by container
https://docs.docker.com/engine/reference/commandline/stats/
#linux
Docker Documentation
docker container stats
2019 DS / ML digest 9
Highlights of the week
- Stack Overlow survey;
- Unsupervised STT (ofc not!);
- A mix between detection and semseg?;
https://spark-in.me/post/2019_ds_ml_digest_09
#digest
#deep_learning
Highlights of the week
- Stack Overlow survey;
- Unsupervised STT (ofc not!);
- A mix between detection and semseg?;
https://spark-in.me/post/2019_ds_ml_digest_09
#digest
#deep_learning
Spark in me
2019 DS/ML digest 09
2019 DS/ML digest 09
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Tricky rsync flags
Rsync is the best program ever.
I find these flags the most useful
Sometimes first three flags get confusing.
#linux
Rsync is the best program ever.
I find these flags the most useful
--ignore-existing (ignores existing files)
--update (updates to newer versions of files based on ts)
--size-only (uses file-size to compare files)
-e 'ssh -p 22 -i /path/to/private/key' (use custom ssh identity)
Sometimes first three flags get confusing.
#linux
Forwarded from Yuri Baburov
Вторая экспериментальная гостевая лекция курса.
Один из семинаристов курса, Юрий Бабуров, расскажет о распознавании речи и работе с аудио.
1-го мая в 8:40 Мск (12:40 Нск, 10:40 вечера 30-го апреля по PST).
Deep Learning на пальцах 11 - Аудио и Speech Recognition (Юрий Бабуров)
https://www.youtube.com/watch?v=wm4H2Ym33Io
Один из семинаристов курса, Юрий Бабуров, расскажет о распознавании речи и работе с аудио.
1-го мая в 8:40 Мск (12:40 Нск, 10:40 вечера 30-го апреля по PST).
Deep Learning на пальцах 11 - Аудио и Speech Recognition (Юрий Бабуров)
https://www.youtube.com/watch?v=wm4H2Ym33Io
Spark in me
Вторая экспериментальная гостевая лекция курса. Один из семинаристов курса, Юрий Бабуров, расскажет о распознавании речи и работе с аудио. 1-го мая в 8:40 Мск (12:40 Нск, 10:40 вечера 30-го апреля по PST). Deep Learning на пальцах 11 - Аудио и Speech Recognition…
YouTube
Deep Learning на пальцах 11 - Аудио и распознавание речи (Юрий Бабуров)
Курс: http://dlcourse.ai
Слайды: https://www.dropbox.com/s/tv3cv0ihq2l0u9f/Lecture%2011%20-%20Audio%20and%20Speech.pdf?dl=0
Слайды: https://www.dropbox.com/s/tv3cv0ihq2l0u9f/Lecture%2011%20-%20Audio%20and%20Speech.pdf?dl=0
Poor man's computing cluster
So, when I last checked, Amazon's p3.4xlarge instances cost around US$12 per hour (unless you reserve them for a year). A tower supercomputer from Nvidia costs probably US$40-50k or more (it was announced at around US$69k).
It is not difficult to crunch the numbers and see, that 1 month of renting such a machine would cost at least US$8-10k. Also there will the additional cost / problem of actually storing your large datasets. When I last used Amazon - their cheap storage was sloooooow, and fast storage was prohibitively expensive.
So, why I am saying this?
Let's assume (according to my miner friends' experience) - that consumer Nvidia GPUs can work 2-3 years non-stop given proper cooling and care (test before buying!). Also let's assume that 4xTesla V100 is roughly the same as 7-8 * 1080Ti.
Yeah, I know that you will point out at least one reason why this does not hold, but for practical purposes this is fine (yes, I know that Teslas have some cool features like Nvlink).
Now let me drop the ball - modern professional motherboards often boast 2-3 Ethernet ports. And sometimes you can even get 2x10Gbit/s ports (!!!).
It means, that you actually can connect at least 2 (or maybe you can daisy chain them?) machines into a computing cluster.
Now let's crunch the numbers
According to quotes I collected through the years, you can build a cluster roughly equivalent to Amazon's p3.4xlarge for US$10k (but with storage!) with used GPUs (miners sell them like crazy now). If you buy second market drives, motherboards, CPUs and processors you can lower the cost to US$5k or less.
So, a cluster, that would serve you at least one year (if you test everything properly and take care of it) costing US$10k is roughly equivalent to:
- 20-25% of DGX desktop;
- 1 month of renting on Amazon;
Assuming that all the hardware will just break in a year:
- It is 4-5x cheaper than buying from Nvidia;
- It is 10x cheaper than renting;
If you buy everything used, then it is 10x and 20x cheaper!
I would buy that for a dollar!
Ofc you have to invest your free time.
See my calculations here:
http://bit.ly/spark00001
#deep_learning
#hardware
So, when I last checked, Amazon's p3.4xlarge instances cost around US$12 per hour (unless you reserve them for a year). A tower supercomputer from Nvidia costs probably US$40-50k or more (it was announced at around US$69k).
It is not difficult to crunch the numbers and see, that 1 month of renting such a machine would cost at least US$8-10k. Also there will the additional cost / problem of actually storing your large datasets. When I last used Amazon - their cheap storage was sloooooow, and fast storage was prohibitively expensive.
So, why I am saying this?
Let's assume (according to my miner friends' experience) - that consumer Nvidia GPUs can work 2-3 years non-stop given proper cooling and care (test before buying!). Also let's assume that 4xTesla V100 is roughly the same as 7-8 * 1080Ti.
Yeah, I know that you will point out at least one reason why this does not hold, but for practical purposes this is fine (yes, I know that Teslas have some cool features like Nvlink).
Now let me drop the ball - modern professional motherboards often boast 2-3 Ethernet ports. And sometimes you can even get 2x10Gbit/s ports (!!!).
It means, that you actually can connect at least 2 (or maybe you can daisy chain them?) machines into a computing cluster.
Now let's crunch the numbers
According to quotes I collected through the years, you can build a cluster roughly equivalent to Amazon's p3.4xlarge for US$10k (but with storage!) with used GPUs (miners sell them like crazy now). If you buy second market drives, motherboards, CPUs and processors you can lower the cost to US$5k or less.
So, a cluster, that would serve you at least one year (if you test everything properly and take care of it) costing US$10k is roughly equivalent to:
- 20-25% of DGX desktop;
- 1 month of renting on Amazon;
Assuming that all the hardware will just break in a year:
- It is 4-5x cheaper than buying from Nvidia;
- It is 10x cheaper than renting;
If you buy everything used, then it is 10x and 20x cheaper!
I would buy that for a dollar!
Ofc you have to invest your free time.
See my calculations here:
http://bit.ly/spark00001
#deep_learning
#hardware
Google Docs
computing_cluster
config
Server,Part,Approx quote,Quote date,Price, USD,Comment,RUR/USD,65,Yes, I know that you should have historical exchange rates
1,Thermaltake Core X9 Black,12,220,11/22/2018,188
1,Gigabyte X399 AORUS XTREMESocket TR4, AMD X399, 8xDDR-4, 7.1CH, 2x1000…
Server,Part,Approx quote,Quote date,Price, USD,Comment,RUR/USD,65,Yes, I know that you should have historical exchange rates
1,Thermaltake Core X9 Black,12,220,11/22/2018,188
1,Gigabyte X399 AORUS XTREMESocket TR4, AMD X399, 8xDDR-4, 7.1CH, 2x1000…
Russian Open Speech To Text (STT/ASR) Dataset
4000 hours of STT data in Russian
Made by us. Yes, really. I am not joking.
It was a lot of work.
The dataset:
https://github.com/snakers4/open_stt/
Accompanying post:
https://spark-in.me/post/russian-open-stt-part1
TLDR:
- On third release, we have ~4000 hours;
- Contributors and help wanted;
- Let's bring the Imagenet moment in STT closer together!;
Please repost this as much as you can.
#stt
#asr
#data_science
#deep_learning
4000 hours of STT data in Russian
Made by us. Yes, really. I am not joking.
It was a lot of work.
The dataset:
https://github.com/snakers4/open_stt/
Accompanying post:
https://spark-in.me/post/russian-open-stt-part1
TLDR:
- On third release, we have ~4000 hours;
- Contributors and help wanted;
- Let's bring the Imagenet moment in STT closer together!;
Please repost this as much as you can.
#stt
#asr
#data_science
#deep_learning
GitHub
GitHub - snakers4/open_stt: Open STT
Open STT. Contribute to snakers4/open_stt development by creating an account on GitHub.
PyTorch
PyTorch 1.1
https://github.com/pytorch/pytorch/releases/tag/v1.1.0
- Tensorboard (beta);
- DistributedDataParallel new functionality and tutorials;
- Multi-headed attention;
- EmbeddingBag enhancements;
- Other cool, but more niche features:
-
-
#deep_learning
PyTorch 1.1
https://github.com/pytorch/pytorch/releases/tag/v1.1.0
- Tensorboard (beta);
- DistributedDataParallel new functionality and tutorials;
- Multi-headed attention;
- EmbeddingBag enhancements;
- Other cool, but more niche features:
-
nn.SyncBatchNorm
;-
optim.lr_scheduler.CyclicLR
;#deep_learning
GitHub
Release Official TensorBoard Support, Attributes, Dicts, Lists and User-defined types in JIT / TorchScript, Improved Distributed…
Note: CUDA 8.0 is no longer supported
Highlights
TensorBoard (currently experimental)
First-class and native support for visualization and model debugging with TensorBoard, a web application suite ...
Highlights
TensorBoard (currently experimental)
First-class and native support for visualization and model debugging with TensorBoard, a web application suite ...
PyTorch DP / DDP / model parallel
Finally they made proper tutorials:
- https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
- https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
- https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
Model parallel = have parts of the same model on different devices
Data Parallel (DP) = wrapper to use multi-GPU withing a single parent process
Distributed Data Parallel = multiple processes are spawned across cluster / on the same machine
#deep_learning
Finally they made proper tutorials:
- https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
- https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
- https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
Model parallel = have parts of the same model on different devices
Data Parallel (DP) = wrapper to use multi-GPU withing a single parent process
Distributed Data Parallel = multiple processes are spawned across cluster / on the same machine
#deep_learning
The State of ML, eof 2018 in Russian
Quite down-to-earth and clever lecture
https://www.youtube.com/watch?v=l6djLCYnOKw
Some nice examples for TTS and some interesting forecasts (some of them happened already).
#deep_learning
Quite down-to-earth and clever lecture
https://www.youtube.com/watch?v=l6djLCYnOKw
Some nice examples for TTS and some interesting forecasts (some of them happened already).
#deep_learning
YouTube
Сергей Марков: "Искусственный интеллект и машинное обучение: итоги 2018 года."
Лекция состоялась в научно-популярном лектории центра "Архэ" (http://arhe.msk.ru) 16 января 2019 года.
Лектор: Сергей Марков — автор одной из сильнейших российских шахматных программ, специалист по методам машинного обучения и основатель портала XX2 ВЕК…
Лектор: Сергей Марков — автор одной из сильнейших российских шахматных программ, специалист по методам машинного обучения и основатель портала XX2 ВЕК…