Spark in me
2.2K subscribers
822 photos
48 videos
116 files
2.68K links
Lost like tears in rain. DS, ML, a bit of philosophy and math. No bs or ads.
Download Telegram
Some Additional Thoughts on DDP

DDP docs say that you cannot use multiple DDP processes on one GPU (otherwise you would have to use their RPC framework, which is a bit too much hassle and complication, at least for now for me personally!).

Turns out you can. But the speed up was negligible in my case:

- GPU utilization 70-80% 1 process per GPU => GPU utilization 90%-100%;
- Total epoch time decreased by 3-5%;
- Interestingly, I tried 2 DDP workers on 2 GPUs vs 4 DDP workers on 2 GPUs ans 3 DDP workers on 2 GPUs (1 on master, 2 on other GPU), and 3 workers were much slower, so probably it is the compute bottleneck, not the communication bottleneck (we will see with Ampere GPUs!);
- Following advice from Nvidia, I also tried MPS (which is supposed help several processes run smoothly on one GPU), but I just could not make it work with DDP, it failed with cryptic errors at first after cuda.empty.cache() and then just randomly. Sad times;

#deep_learning
First Experience With 3090 Gpus

(0)
Under 100% load they are indeed 15-20 degrees cooler.

(1)
Lol, gpu-burn shows strange results using default settings - 2x less Gflops compared to 1080 Ti

./gpu_burn 600:

- 1080 Ti 8000 - 8500
- Titan X (Maxwell) ~4300
- 3090 (Ampere) ~3000

./gpu-burn -tc 600
- 3090 (Ampere) ~3000

Idk, maybe it's me, maybe it's gpu-test, need to test on real tasks!

PS
I had an old image, maybe bumping CUDA / CUDNN will help.

#deep_learning
Spark in me
First Experience With 3090 Gpus (0) Under 100% load they are indeed 15-20 degrees cooler. (1) Lol, gpu-burn shows strange results using default settings - 2x less Gflops compared to 1080 Ti ./gpu_burn 600: - 1080 Ti 8000 - 8500 - Titan X (Maxwell) ~4300…
Update

After migrating to CUDA 11 and CUDNN 8, now:

./gpu_burn 120

- 1080 Ti 8000 - 8,500
- Titan X (Maxwell) ~4,300
- 3090 (Ampere) ~16,500

./gpu-burn -tc 120

- 3090 (Ampere) ~38,500

Magic
Anyone used this - https://t.me/snakers4/2351 - is this worth an update (also plz comment)?
Anonymous Poll
13%
Yes
6%
Just dockerfiles
21%
No
60%
What is this?
Some More Observations About 3090

- torch.cuda.empty_cache() does not seem to do anything for networks with variable depth / sequence length / girth

- DDP + AMP ... seems 3x slower instead of 2x faster (lol) for some networks, we are looking for the cause

- For some networks, 2x speed bump using AMP out of the box

- Now DDP prevents me from using 2 processes on 1 GPU with

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729096996/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

- Looks like they are much more efficient in parallelizing and keeping high utilization (80-100%), same networks train ~2x-3x faster compared to Titan X (Maxwell) and 1080 Ti without any tweaks to the code

- Same networks use more RAM with 3090 compared to 1080 Ti (?)

- I kind of was afraid that these cards would be under-utilized (50%), but they are just faster. Magic


#deep_learning
2020 DS / ML Digest 13

Highlights
:

- Silero models now has an experimental Ukrainian model
- CV inference 101
- High-Resolution 3D Human Digitization
- Background Features in Google Meet
- How to Build an Open-Domain Question Answering System?
- A case for … Keeping encryption elitist
- Objectron dataset
- See the above posts about 3090 ... and hopefully new posts comparing Titan X / 1080 Ti / 3090 / A100 =)

Please like / share / repost!

https://spark-in.me/post/2020_ds_ml_digest_13

#digest
First Experience With A100 GPUs

(0)
Under 100% load they are indeed 15-20 degrees cooler, i.e. 60 - 70C (similar to 3090).

(1)
./gpu_burn 120

- 1080 Ti 8000 - 8,500
- Titan X (Maxwell) ~4,300
- 3090 (Ampere) ~16,500
- A100 (wo MIG) ~16,700 Gflop/s

./gpu-burn -tc 120

- 3090 (Ampere) ~38,500
- A100 (wo MIG) ~81,500 Gflop/s

(2)
Using MIG is kind of straight-forward, but obviously it does not work properly with gpu-burn out of the box.

Obviously, the most interesting thing is to test MIG 2,3,7 setups against 2x 3090 / 1080 Ti / Titan X.

#deep_learning
Translate into English?
Anonymous Poll
31%
Yes
45%
No
24%
Google Translate works fine
Getting The Most Out of AMP

We were digging deep into understanding how to utilize AMP properly. Surprise-surprise:

- It works better with large networks, wide networks
- It works poorly with separable convolutions
-You need a bit more involved design considerations than just "have your channels divisible by 8":

For matrix multiplication:
On FP16 inputs, all three dimensions (M, N, K) must be multiples of 8.

For convolution:
On FP16 inputs, input and output channels must be multiples of 8.

Also:

Prefer dense math operations.
For example, vanilla convolutions have much higher arithmetic intensity than depth-wise separable convolutions.

Also:

Choose mini-batch to be a multiple of 8
Choose linear layer dimensions to be a multiple of 8
Choose convolution layer channel counts to be a multiple of 8
For classification problems, pad vocabulary to be a multiple of 8
For sequence problems, pad the sequence length to be a multiple of 8

Please see
- https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html


#deep_learning
GitHub Discussions

Microsoft, among many other enterprise-y things, launched GitHub discussions, which to my amazement does not suck.

95% of popular ML projects are on GitHub and most of them currently have 3 types of issues:

(0) Semi-automated or structured team communication
(1) Questions / ideas / collaboration by the community
(2) If authors have not invested enough time in proper usability and docs - sometimes you have to read / search all of the issues manually

Given enough development and effort by the community, probably this thing can solve (2) and tap into the crowd knowledge that gets lost with time.

Apparently it also looks like a stab at things like discourse, but free and integrated with your repo and you code. But discourse is probably used by 1% of projects. Too often you see a nice project with a mess or tumbleweeds in the issues tab.

One final question of course - when will Microsoft use it as a means of censorship as all large American companies inevitably do.
Speeding Up Your PyTorch Networks for CPU Inference

Key ingredients:

- PyTorch native network
- CPU inference / deploy
- JIT, ONNX, int8 quantization

Some notes on how much you can speed up your networks mostly out of the box with very few tweaks. These conclusions hold for very small networks (1M params, 10-30 layers, and medium-sized networks (20M params, 20-40 layers):

- Just using JIT can give you up to a 30% boost. With smaller batch-sizes (and feature map sizes) there is a smaller boost - 5-10%. Boost saturates with a certain batch-size / feature map size;

- Just using int8 quantization can give you up to a 30% boost. Same caveats as with JIT;

- Same with JIT+ int8, total speed ups up to 50%, also more equal speed ups for small batches and feature maps;

- Using ONNX however is generally faster than PyTorch out-of-the-box, but it is most pronounced for small feature-maps, e.g. you can get a 40% speed-up for small batch and zero speed-up for a large batch;

- ONNX + int8 does not seem to work in PyTorch now. We have not tried porting networks manually from ONNX to quantized ONNX;

We are not comparing apples to apples here, but ONNX inference with quantization seems the most promising provided its wide support of back-ends.

#deep_learning
mT5: A massively multilingual pre-trained text-to-text transformer
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel
Статья: https://arxiv.org/abs/2010.11934
Код: https://github.com/google-research/multilingual-t5

Ещё одна свежая работа про большой масштаб. Гугл без лишней шумихи и ажиотажа выпустил многоязычный вариант своего T5 — mT5.

T5 (https://arxiv.org/abs/1910.10683) вообще почему-то малоупоминаемая работа, и мы сами грешны, почти про него не писали. Исправляемся.

Ещё год назад Гугл сделал огромную работу. Собрал огромный датасет “Colossal Clean Crawled Corpus” (C4) на 750Gb. Собрал большой полный трансформер (энкодер+декодер) и предобучил его в unsupervised или скорее self-supervised режиме с denoising objective похожей на бертовский MLM, только здесь заменяются не отдельные токены, а небольшие спаны. Полученная модель потом файн-тюнилась на задачи из GLUE/SuperGLUE, SQuAD, WMT и т.п.

Интересно, что модель эта — полноценный seq2seq, то есть даже задачи классификации сводились к генерации текстовой метки класса, например, “entailment”, “contradiction” или “neutral” на MNLI, а задачи регрессии к выдаче строки вида “2.6”. Назвали эту модель “Text-to-Text Transfer Transformer” или сокращённо T5 (до T-800 или T-1000 ещё далековато).

Огромность работы заключалась в том, что исследователи проверили кучу гипотез: какая objective лучше, какая структура модели, сколько токенов в обучении надо, как модель скейлится по разным измерениям и т.п. и выбрали лучшие варианты. Получили в разных местах SoTA, как водится.

Обучили и выложили (https://github.com/google-research/text-to-text-transfer-transformer#released-model-checkpoints) модели разных размеров: 60M, 220M, 770M, 3B, 11B параметров (напомню, самый большой GPT-2 в тот момент был 774M, а 1.5B открыли уже после этой статьи https://openai.com/blog/gpt-2-1-5b-release/ в ноябре 2019).

Обратите внимание, Гугл сразу выложил обученную модель на 11B параметров! И при этом не кричали как OpenAI. И вообще до сих пор довольно скромно про это рассказывали.

Так вот, на днях пришёл черёд мультиязычной модели, mT5. Для неё тоже собрали датасет (теперь он называется mC4), с поддержкой 107 языков (из них правда 6 языков это вариации основного, но латиницей, что забавно среди них есть ru-Latn). Русский, кстати, на втором месте в этом датасете, набрали на 713B токенов! Ну и вообще нам на самом деле есть чем гордиться с русским языком, в вебе он второй после английского: https://en.wikipedia.org/wiki/Languages_used_on_the_Internet#Content_languages_for_websites.

Датасет выложен: https://www.tensorflow.org/datasets/catalog/c4#c4multilingual_nights_stay

Далее на этом датасете обучили mT5 по лучшим рецептам из T5 (точнее даже из T5.1.1). Тщательно выбирали параметры сэмплинга языков, чтобы редкие обучились, но не переобучились. По модели особо не мудрствовали, старались следовать родным рецептам. По ходу дела поэкспериментировали с разными ablations, описали эффекты, выбрали лучшие параметры.

Так же опубликовали (https://github.com/google-research/multilingual-t5#released-model-checkpoints) набор моделей: 300M, 600M, 1B, 4B и 13B (у mT5 больший словарь по сравнению с чисто английским T5, так что увеличение размера моделей отсюда).

Тестировали на задачах из XTREME benchmark, так же получили SoTA и побили идейно близкие модели типа mBERT, XLM/XLM-R.

В общем это мегакруто. У нас теперь есть набор моделей разного размера (до реально очень больших, которые не так-то просто гонять будет) с поддержкой 101 языка и готовых к файнтюнингу на более-менее любые seq2seq задачи. Респект Гуглу!