Speeding Up Your PyTorch Networks for CPU Inference
Key ingredients:
- PyTorch native network
- CPU inference / deploy
- JIT, ONNX, int8 quantization
Some notes on how much you can speed up your networks mostly out of the box with very few tweaks. These conclusions hold for very small networks (1M params, 10-30 layers, and medium-sized networks (20M params, 20-40 layers):
- Just using
- Just using
- Same with
- Using
-
We are not comparing apples to apples here, but ONNX inference with quantization seems the most promising provided its wide support of back-ends.
#deep_learning
Key ingredients:
- PyTorch native network
- CPU inference / deploy
- JIT, ONNX, int8 quantization
Some notes on how much you can speed up your networks mostly out of the box with very few tweaks. These conclusions hold for very small networks (1M params, 10-30 layers, and medium-sized networks (20M params, 20-40 layers):
- Just using
JIT
can give you up to a 30% boost. With smaller batch-sizes (and feature map sizes) there is a smaller boost - 5-10%. Boost saturates with a certain batch-size / feature map size;- Just using
int8
quantization can give you up to a 30% boost. Same caveats as with JIT;- Same with
JIT
+ int8
, total speed ups up to 50%, also more equal speed ups for small batches and feature maps;- Using
ONNX
however is generally faster than PyTorch out-of-the-box, but it is most pronounced for small feature-maps, e.g. you can get a 40% speed-up for small batch and zero speed-up for a large batch;-
ONNX
+ int8
does not seem to work in PyTorch now. We have not tried porting networks manually from ONNX to quantized ONNX;We are not comparing apples to apples here, but ONNX inference with quantization seems the most promising provided its wide support of back-ends.
#deep_learning
A New Promising Pooling Module?
Wow, need to test this myself!
https://www.microsoft.com/en-us/research/blog/seeing-on-tiny-battery-powered-microcontrollers-with-rnnpool/
#deep_learning
Wow, need to test this myself!
https://www.microsoft.com/en-us/research/blog/seeing-on-tiny-battery-powered-microcontrollers-with-rnnpool/
#deep_learning
Microsoft Research
Bringing vision intelligence to the edge
Limited compute and memory make it challenging to deploy state-of-the-art computer vision architectures on edge devices. Researchers introduce RNNPool, a pooling operator that can enable sophisticated vision intelligence in 200KB of RAM.
Writing TB Logs to S3 in PyTorch
Looks like TensorboardX supports writing to S3 and TB itself natively supports writing to S3 (anyone tried GCS?).
I am not sure whether the PyTorch writer supports S3, but judging by these threads (1 and 2) it should work if you handle your ENV variables properly:
#deep_learning
Looks like TensorboardX supports writing to S3 and TB itself natively supports writing to S3 (anyone tried GCS?).
I am not sure whether the PyTorch writer supports S3, but judging by these threads (1 and 2) it should work if you handle your ENV variables properly:
The torch.utils.tensorboard implementation uses a writer in core TensorBoard that supports GCS and S3 if TF is installed and S3 if not installed.I did not know this, this is very cool!
#deep_learning
GitHub
tensorboardX/tensorboardX/record_writer.py at master · lanpa/tensorboardX
tensorboard for pytorch (and chainer, mxnet, numpy, ...) - lanpa/tensorboardX
Interesting Loss Weighting Idea - Gradient Adaptive Factor
When you have 2+ losses in your NN, sometimes loss weighting is not really straightforward. Usually total loss is:
Of course you can tune these "lambdas" manually or using some naïve NAS (or some ad hoc heuristic, i.e. this loss more important), but all these approaches have 2 drawbacks:
- Slow / compute intensive / ad hoc;
- There is no guarantee that these values are always optimal;
Usually when something is not stable (and multiple losses often explode on init) some sort of adaptive clipping is employed. I just stumbled upon a technique called Gradient Adaptive Factor, see an example here.
The idea is simple - balance your losses so that their gradient sizes are roughly similar.
#deep_learning
When you have 2+ losses in your NN, sometimes loss weighting is not really straightforward. Usually total loss is:
loss = loss_0 + lambda_1 * loss_1 + ...
Of course you can tune these "lambdas" manually or using some naïve NAS (or some ad hoc heuristic, i.e. this loss more important), but all these approaches have 2 drawbacks:
- Slow / compute intensive / ad hoc;
- There is no guarantee that these values are always optimal;
Usually when something is not stable (and multiple losses often explode on init) some sort of adaptive clipping is employed. I just stumbled upon a technique called Gradient Adaptive Factor, see an example here.
The idea is simple - balance your losses so that their gradient sizes are roughly similar.
#deep_learning
Gist
Gradient Adaptive Factor
Gradient Adaptive Factor. GitHub Gist: instantly share code, notes, and snippets.
Last Week in AI
Most "AI" newsletters have fizzled down, became mostly unreadable noise and / or ads.
The remaining ones are usually mostly populist. Here are a couple with SNR > 0:
- https://lastweekin.ai/p/103 (by The Gradient, lol I did not know this existed)
- https://newsletter.airstreet.com/issues/your-guide-to-ai-january-2021-308710
Enjoy.
#deep_learning
Most "AI" newsletters have fizzled down, became mostly unreadable noise and / or ads.
The remaining ones are usually mostly populist. Here are a couple with SNR > 0:
- https://lastweekin.ai/p/103 (by The Gradient, lol I did not know this existed)
- https://newsletter.airstreet.com/issues/your-guide-to-ai-january-2021-308710
Enjoy.
#deep_learning
lastweekin.ai
AI-powered coronavirus screening, banning deepfake porn, and more!
Last Week in AI #103
PyTorch 1.8 Released
- https://pytorch.org/blog/pytorch-1.8-released/
- https://github.com/pytorch/pytorch/releases
Apart from mostly fixes, and some nice quantization (still no transformer!) and ONNX improvements, I really like this additions:
(0)
(1)
(2)
New beta benchmark utils
Link
(3)
New PyTorch Mobile demos
(4)
New quantization API
link
(5)
New related libraries release (i.e.
#deep_learning
- https://pytorch.org/blog/pytorch-1.8-released/
- https://github.com/pytorch/pytorch/releases
Apart from mostly fixes, and some nice quantization (still no transformer!) and ONNX improvements, I really like this additions:
(0)
PyTorch Lite Interpreter is a streamlined version of the PyTorch runtime that can execute PyTorch programs in resource constrained devices, with reduced binary size footprint. This prototype feature reduces binary sizes by up to 70% compared to the current on-device runtime in the current release.Link
(1)
Starting in PyTorch 1.8, we have added support for ROCm wheels providing an easy onboarding to using AMD GPUs.Link
(2)
New beta benchmark utils
Link
(3)
New PyTorch Mobile demos
(4)
New quantization API
link
(5)
New related libraries release (i.e.
torchaudio
, torchvision
), looks like they are tied to PyTorch releases now#deep_learning
PyTorch
PyTorch 1.8 Release, including Compiler and Distributed Training updates, and New Mobile Tutorials
We are excited to announce the availability of PyTorch 1.8. This release is composed of more than 3,000 commits since 1.7. It includes major updates and new features for compilation, code optimization, frontend APIs for scientific computing, and AMD ROCm…
Torch FX
- https://pytorch.org/docs/master/fx.html
Over the years, I became quite good in monkey patching PyTorch code just using python's and pytorch tools (e.g.
One thing comes to mind immediately - when you have the same models with static control flows and you need to create a quantized / torch script version of it. Now it is a pain in the ass - because it requires manually switching them back and forth (switch on, create a quantized TorchScript version one, switch back, create another one, etc).
Will I use it? I guess I need to sleep on it. We ended up not using static quantization very much. Looks very cool and flexible, serves a real purpose, but usually stupid one line hacks can do the same without learning a new tool.
So idk, what do you think? Do you like any of the examples? I like the invert one.
#deep_learning
- https://pytorch.org/docs/master/fx.html
X is a toolkit for developers to use to transform nn.Module instances. FX consists of three main components: a symbolic tracer, an intermediate representation, and Python code generation.I understand that people building PyTorch usually favour flexible toolkits (and they expose a lot to an end user) and most likely they just realized that static quantization was too complex for an average user to handle and they wrote this as an engine for automated quantization transformations, which is cool. Designing a proper API is always a balancing act.
Over the years, I became quite good in monkey patching PyTorch code just using python's and pytorch tools (e.g.
module.named_modules()
). So I wonder what the killer use case of this feature would be?One thing comes to mind immediately - when you have the same models with static control flows and you need to create a quantized / torch script version of it. Now it is a pain in the ass - because it requires manually switching them back and forth (switch on, create a quantized TorchScript version one, switch back, create another one, etc).
Will I use it? I guess I need to sleep on it. We ended up not using static quantization very much. Looks very cool and flexible, serves a real purpose, but usually stupid one line hacks can do the same without learning a new tool.
So idk, what do you think? Do you like any of the examples? I like the invert one.
#deep_learning
New Benchmarking Tool in PyTorch
https://pytorch.org/tutorials/recipes/recipes/benchmark.html#pytorch-benchmark
Looks a bit over-complicated at the first glance (why provide classes for random tensor generation, I have no idea), but it has a few very nice features:
- Automated
- Automated CUDA synchronization
- Report generation, storing the results, comparing the results
But I suppose there is nothing wrong just using
#deep_learning
https://pytorch.org/tutorials/recipes/recipes/benchmark.html#pytorch-benchmark
Looks a bit over-complicated at the first glance (why provide classes for random tensor generation, I have no idea), but it has a few very nice features:
- Automated
num_threads
handling- Automated CUDA synchronization
- Report generation, storing the results, comparing the results
But I suppose there is nothing wrong just using
%%timeit
manually setting num_threads
.#deep_learning
Building Your Own Supercomputer Cheap (RU)
My guest post on ODS @ habr:
- https://habr.com/ru/company/ods/blog/546808/
EDIT - some awesome comments!
#deep_learning
My guest post on ODS @ habr:
- https://habr.com/ru/company/ods/blog/546808/
EDIT - some awesome comments!
#deep_learning
Хабр
Собираем Свой Суперкомпьютер Недорого
Нынче никого не удивишь достижениями искусственного интеллекта машинного обучения (ML) в самых разных областях. При этом доверчивые граждане редко задают два вопроса: (i) а какая собственно цена...