Spark in me

1.5K viewsAlexander, 08:37

Some Additional Thoughts on DDP

DDP docs say that you cannot use multiple DDP processes on one GPU (otherwise you would have to use their RPC framework, which is a bit too much hassle and complication, at least for now for me personally!).

Turns out you can. But the speed up was negligible in my case:

- GPU utilization 70-80% 1 process per GPU => GPU utilization 90%-100%;
- Total epoch time decreased by 3-5%;
- Interestingly, I tried 2 DDP workers on 2 GPUs vs 4 DDP workers on 2 GPUs ans 3 DDP workers on 2 GPUs (1 on master, 2 on other GPU), and 3 workers were much slower, so probably it is the compute bottleneck, not the communication bottleneck (we will see with Ampere GPUs!);
- Following advice from Nvidia, I also tried MPS (which is supposed help several processes run smoothly on one GPU), but I just could not make it work with DDP, it failed with cryptic errors at first after cuda.empty.cache() and then just randomly. Sad times;

#deep_learning

1.2K viewsAlexander, edited 09:09

Spark in me

Forwarded from Small Data Science for Russian Adventurers

#интересно
Оказывается, iPavlova больше нет...
https://www.facebook.com/olga.kairova/posts/10157719960593034

Facebook

See posts, photos and more on Facebook.

1.1K viewsAlexander, 09:10

Spark in me

1.3K viewsAlexander, 09:10

Spark in me

First Experience With 3090 Gpus

(0)
Under 100% load they are indeed 15-20 degrees cooler.

(1)
Lol, gpu-burn shows strange results using default settings - 2x less Gflops compared to 1080 Ti

./gpu_burn 600:

- 1080 Ti 8000 - 8500
- Titan X (Maxwell) ~4300
- 3090 (Ampere) ~3000

 ./gpu-burn -tc 600

- 3090 (Ampere) ~3000

Idk, maybe it's me, maybe it's gpu-test, need to test on real tasks!

PS
I had an old image, maybe bumping CUDA / CUDNN will help.

#deep_learning

1.2K viewsAlexander, edited 06:53

Spark in me

First Experience With 3090 Gpus (0) Under 100% load they are indeed 15-20 degrees cooler. (1) Lol, gpu-burn shows strange results using default settings - 2x less Gflops compared to 1080 Ti ./gpu_burn 600: - 1080 Ti 8000 - 8500 - Titan X (Maxwell) ~4300…

Update

After migrating to CUDA 11 and CUDNN 8, now:

./gpu_burn 120

- 1080 Ti 8000 - 8,500
- Titan X (Maxwell) ~4,300
- 3090 (Ampere) ~16,500

./gpu-burn -tc 120

- 3090 (Ampere) ~38,500

Magic

1.3K viewsAlexander, edited 09:18

Spark in me

Anyone used this - https://t.me/snakers4/2351 - is this worth an update (also plz comment)?

Anonymous Poll

63 voters1.3K viewsAlexander, 10:20

Spark in me

Some More Observations About 3090

- torch.cuda.empty_cache() does not seem to do anything for networks with variable depth / sequence length / girth

- DDP + AMP ... seems 3x slower instead of 2x faster (lol) for some networks, we are looking for the cause

- For some networks, 2x speed bump using AMP out of the box

- Now DDP prevents me from using 2 processes on 1 GPU with

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729096996/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

- Looks like they are much more efficient in parallelizing and keeping high utilization (80-100%), same networks train ~2x-3x faster compared to Titan X (Maxwell) and 1080 Ti without any tweaks to the code

- Same networks use more RAM with 3090 compared to 1080 Ti (?)

- I kind of was afraid that these cards would be under-utilized (50%), but they are just faster. Magic

#deep_learning

1.6K viewsAlexander, edited 07:11

Spark in me

2020 DS / ML Digest 13

Highlights:

- Silero models now has an experimental Ukrainian model
- CV inference 101
- High-Resolution 3D Human Digitization
- Background Features in Google Meet
- How to Build an Open-Domain Question Answering System?
- A case for … Keeping encryption elitist
- Objectron dataset
- See the above posts about 3090 ... and hopefully new posts comparing Titan X / 1080 Ti / 3090 / A100 =)

Please like / share / repost!

https://spark-in.me/post/2020_ds_ml_digest_13

#digest

1.6K viewsAlexander, 08:01

Spark in me

First Experience With A100 GPUs

(0)
Under 100% load they are indeed 15-20 degrees cooler, i.e. 60 - 70C (similar to 3090).

(1)

./gpu_burn 120

- 1080 Ti 8000 - 8,500
- Titan X (Maxwell) ~4,300
- 3090 (Ampere) ~16,500
- A100 (wo MIG) ~16,700 Gflop/s

./gpu-burn -tc 120

- 3090 (Ampere) ~38,500
- A100 (wo MIG) ~81,500 Gflop/s

(2)
Using MIG is kind of straight-forward, but obviously it does not work properly with gpu-burn out of the box.

Obviously, the most interesting thing is to test MIG 2,3,7 setups against 2x 3090 / 1080 Ti / Titan X.

#deep_learning

1.3K viewsAlexander, 17:35

Spark in me

Trying Out New Ampere GPUs and MIG (RU)

Играемся с Новыми GPU на базе Ampere от Nvidia и пробуем MIG

https://habr.com/ru/post/530986/

Please like / share / repost!

#hardware
#deep_learning

Хабр

Играемся с 3090 и пробуем MIG на A100

Каждый раз, когда встает заветный вопрос, апгрейдить ли карточки в серверной или нет, я просматриваю подобные статьи и смотрю такие видосы (нет, маркетинговым м...

1.3K viewsAlexander, edited 12:20

Spark in me

Translate into English?

Anonymous Poll

Google Translate works fine

159 voters1.2K viewsAlexander, 13:16

Spark in me

Getting The Most Out of AMP

We were digging deep into understanding how to utilize AMP properly. Surprise-surprise:

- It works better with large networks, wide networks
- It works poorly with separable convolutions
-You need a bit more involved design considerations than just "have your channels divisible by 8":

For matrix multiplication:
On FP16 inputs, all three dimensions (M, N, K) must be multiples of 8.

For convolution:
On FP16 inputs, input and output channels must be multiples of 8.

Also:

Prefer dense math operations.
For example, vanilla convolutions have much higher arithmetic intensity than depth-wise separable convolutions.

Also:

Choose mini-batch to be a multiple of 8
Choose linear layer dimensions to be a multiple of 8
Choose convolution layer channel counts to be a multiple of 8
For classification problems, pad vocabulary to be a multiple of 8
For sequence problems, pad the sequence length to be a multiple of 8

Please see
- https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html

#deep_learning

1.5K viewsAlexander, edited 14:23

Spark in me

Trying Out New Ampere GPUs and MIG (RU) Играемся с Новыми GPU на базе Ampere от Nvidia и пробуем MIG https://habr.com/ru/post/530986/ Please like / share / repost! #hardware #deep_learning

Trying Out New Ampere GPUs and MIG

Our hands on experience with new Ampere GPUs - 3090 and A100 (with multi instance GPU)

https://habr.com/ru/post/531436/

Please like / share / repost!

#hardware
#deep_learning

Habr

Playing with Nvidia's New Ampere GPUs and Trying MIG

Every time when the essential question arises, whether to upgrade the cards in the server room or not, I look through similar articles and watch such videos.

13.9K viewsAlexander, edited 10:09

Spark in me

Forwarded from Silero News (Alexander)

Russian EE Models Sizing (Russian)

https://www.silero.ai/ee-model-sizing/

Silero

🦄 Enterprise Edition Версия Silero - Скорость, Пропускная Способность и Сайзинги

Обновление от 2023-10-25

Сейчас в список наших основных поддерживаемых решений входят:

* Программный модуль "СИЛЕРО" для синтеза речи на русском языке и сопутствующих задач на основе нейронных сетей и алгоритмов машинного обучения (ссылка, Реестровая запись…

1.2K viewsAlexander, 19:04

Spark in me

Forwarded from Silero News (Alexander)

Russian EE Models Sizing (Russian) Extended

A more in-depth explanation, hosted on Habr (several older posts on Silero blog combined into one Habr post)

- https://habr.com/ru/post/531524/

Хабр

Насколько Быстрой Можно Сделать Систему STT?

Нам приходилось слышать абсолютно разные оценки скорости (ну или наоборот — оценки потребности в железе) систем распознавания речи, отличающиеся даже на порядок...

1.3K viewsAlexander, 05:45

Spark in me

This blog is just a marvel
https://martinheinz.dev/blog/38

martinheinz.dev

Networking Tools Every Developer Needs to Know

<p>
Networking is often overlooked topic by both developers and DevOps engineers. Yet it's at least to a certain extent integral part of every system and a...

1.3K viewsAlexander, 09:28

Spark in me

GitHub Discussions

Microsoft, among many other enterprise-y things, launched GitHub discussions, which to my amazement does not suck.

95% of popular ML projects are on GitHub and most of them currently have 3 types of issues:

(0) Semi-automated or structured team communication
(1) Questions / ideas / collaboration by the community
(2) If authors have not invested enough time in proper usability and docs - sometimes you have to read / search all of the issues manually

Given enough development and effort by the community, probably this thing can solve (2) and tap into the crowd knowledge that gets lost with time.

Apparently it also looks like a stab at things like discourse, but free and integrated with your repo and you code. But discourse is probably used by 1% of projects. Too often you see a nice project with a mess or tumbleweeds in the issues tab.

One final question of course - when will Microsoft use it as a means of censorship as all large American companies inevitably do.

The GitHub Blog

New from Universe 2020: Dark mode, GitHub Sponsors for companies, and more

Check out the latest announcements from GitHub Universe 2020, including dark mode, Sponsors for companies, improvements to Actions, and more.

1.3K viewsAlexander, 07:24

Spark in me

Speeding Up Your PyTorch Networks for CPU Inference

Key ingredients:

- PyTorch native network
- CPU inference / deploy
- JIT, ONNX, int8 quantization

Some notes on how much you can speed up your networks mostly out of the box with very few tweaks. These conclusions hold for very small networks (1M params, 10-30 layers, and medium-sized networks (20M params, 20-40 layers):

- Just using JIT can give you up to a 30% boost. With smaller batch-sizes (and feature map sizes) there is a smaller boost - 5-10%. Boost saturates with a certain batch-size / feature map size;

- Just using int8 quantization can give you up to a 30% boost. Same caveats as with JIT;

- Same with JIT+ int8, total speed ups up to 50%, also more equal speed ups for small batches and feature maps;

- Using ONNX however is generally faster than PyTorch out-of-the-box, but it is most pronounced for small feature-maps, e.g. you can get a 40% speed-up for small batch and zero speed-up for a large batch;

- ONNX + int8 does not seem to work in PyTorch now. We have not tried porting networks manually from ONNX to quantized ONNX;

We are not comparing apples to apples here, but ONNX inference with quantization seems the most promising provided its wide support of back-ends.

#deep_learning

1.4K viewsAlexander, 06:44

Spark in me

Which Inference Framework Do You Use?

Anonymous Poll

58%

PyTorch (JIT / PyTorch Mobile / Natively)

16%

ONNX with some back-end

Triton or similar

TensorRT

22%

TensorFlow (lite, mobile, tf.js)

Other - we deploy on edge devices

Other (please comment)

193 voters1.3K viewsAlexander, 07:51

Spark in me

Forwarded from gonzo-обзоры ML статей

mT5: A massively multilingual pre-trained text-to-text transformer
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel
Статья: https://arxiv.org/abs/2010.11934
Код: https://github.com/google-research/multilingual-t5

Ещё одна свежая работа про большой масштаб. Гугл без лишней шумихи и ажиотажа выпустил многоязычный вариант своего T5 — mT5.

T5 (https://arxiv.org/abs/1910.10683) вообще почему-то малоупоминаемая работа, и мы сами грешны, почти про него не писали. Исправляемся.

Ещё год назад Гугл сделал огромную работу. Собрал огромный датасет “Colossal Clean Crawled Corpus” (C4) на 750Gb. Собрал большой полный трансформер (энкодер+декодер) и предобучил его в unsupervised или скорее self-supervised режиме с denoising objective похожей на бертовский MLM, только здесь заменяются не отдельные токены, а небольшие спаны. Полученная модель потом файн-тюнилась на задачи из GLUE/SuperGLUE, SQuAD, WMT и т.п.

Интересно, что модель эта — полноценный seq2seq, то есть даже задачи классификации сводились к генерации текстовой метки класса, например, “entailment”, “contradiction” или “neutral” на MNLI, а задачи регрессии к выдаче строки вида “2.6”. Назвали эту модель “Text-to-Text Transfer Transformer” или сокращённо T5 (до T-800 или T-1000 ещё далековато).

Огромность работы заключалась в том, что исследователи проверили кучу гипотез: какая objective лучше, какая структура модели, сколько токенов в обучении надо, как модель скейлится по разным измерениям и т.п. и выбрали лучшие варианты. Получили в разных местах SoTA, как водится.

Обучили и выложили (https://github.com/google-research/text-to-text-transfer-transformer#released-model-checkpoints) модели разных размеров: 60M, 220M, 770M, 3B, 11B параметров (напомню, самый большой GPT-2 в тот момент был 774M, а 1.5B открыли уже после этой статьи https://openai.com/blog/gpt-2-1-5b-release/ в ноябре 2019).

Обратите внимание, Гугл сразу выложил обученную модель на 11B параметров! И при этом не кричали как OpenAI. И вообще до сих пор довольно скромно про это рассказывали.

Так вот, на днях пришёл черёд мультиязычной модели, mT5. Для неё тоже собрали датасет (теперь он называется mC4), с поддержкой 107 языков (из них правда 6 языков это вариации основного, но латиницей, что забавно среди них есть ru-Latn). Русский, кстати, на втором месте в этом датасете, набрали на 713B токенов! Ну и вообще нам на самом деле есть чем гордиться с русским языком, в вебе он второй после английского: https://en.wikipedia.org/wiki/Languages_used_on_the_Internet#Content_languages_for_websites.

Датасет выложен: https://www.tensorflow.org/datasets/catalog/c4#c4multilingual_nights_stay

Далее на этом датасете обучили mT5 по лучшим рецептам из T5 (точнее даже из T5.1.1). Тщательно выбирали параметры сэмплинга языков, чтобы редкие обучились, но не переобучились. По модели особо не мудрствовали, старались следовать родным рецептам. По ходу дела поэкспериментировали с разными ablations, описали эффекты, выбрали лучшие параметры.

Так же опубликовали (https://github.com/google-research/multilingual-t5#released-model-checkpoints) набор моделей: 300M, 600M, 1B, 4B и 13B (у mT5 больший словарь по сравнению с чисто английским T5, так что увеличение размера моделей отсюда).

Тестировали на задачах из XTREME benchmark, так же получили SoTA и побили идейно близкие модели типа mBERT, XLM/XLM-R.

В общем это мегакруто. У нас теперь есть набор моделей разного размера (до реально очень больших, которые не так-то просто гонять будет) с поддержкой 101 языка и готовых к файнтюнингу на более-менее любые seq2seq задачи. Респект Гуглу!

GitHub

GitHub - google-research/multilingual-t5

Contribute to google-research/multilingual-t5 development by creating an account on GitHub.

1.3K viewsAlexander, 13:19

About

Blog

Apps

Platform