Spark in me
2.2K subscribers
822 photos
48 videos
116 files
2.68K links
Lost like tears in rain. DS, ML, a bit of philosophy and math. No bs or ads.
Download Telegram
📎

BIFURCATED RISER X16 TO 2X8 (SET)

Remember that there is a very limited number of motherboards with 5+ PCIE slots?

Now there are risers like this - https://riser.maxcloudon.com/ru/bifurcated-risers/25-bifurcated-riser-x16-to-2x8-set.html

Has anyone tried something similar for DL?

#deep_learning
PyTorch + AMD Inference

We were benchmarking our networks on Intel vs AMD processors using out-of-the-box official build.

And Intel mostly is better (with the same number of threads and roughly the same core speed and lack of overclocking). I was wondering why this is, and then I found this thread.

To be honest I have little motivation to invest time in redoing our environment builds from scratch with OpenBLAS + CUDA (and most likely it is not worth the time since in production most likely there will be Intel CPUs).

But I wonder, does anyone in the community have dockerized dev environment builds based around CUDA + OpenBLAS? Because looks like out of the box PyTorch ships with MKL by Intel.

#deep_learning
Factorized Networks

I really like the idea from this article - https://www.microsoft.com/en-us/research/blog/factorized-layers-revisited-compressing-deep-networks-without-playing-the-lottery/

Basically you do not prune networks (which does not readily transfer into inference) or distill your teacher network into a student, but train a low-rank factorized version of the network with some optimizations from scratch.

This article even has code, but basically ... this is an older fork of fairseq imported via 1 commit. So good luck doing what authors did not bother to do (providing a stand-alone implementation).

So the question is - has anyone seen a minimalist stand-alone implementation for something similar?

#deep_learning
Finally Proper GPU Support in Compose!

It happened finally (some time ago, I just checked now)!

Now this obsolete runtime: nvidia syntax can be replaced with this more versatile syntax:

    deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
capabilities: [gpu]


This together with CUDA_VISIBLE_DEVICES gives you full control of your GPU environment within compose.

https://docs.docker.com/compose/gpu-support/

#deep_learning
MLPerf Inference v1.0

- Inference Edge v1.0 https://mlcommons.org/en/inference-edge-10/
- Inference Datacenter v1.0 https://mlcommons.org/en/inference-datacenter-10/

The immediate conclusion (as expected) - enterprise kinky party. The second conclusion - they mostly compare vastly different systems (mostly HPC), which is good.

Honestly I do not really care for A100 vs. A?0 vs. Quadro vs. T4, but edge benchmarks are always rare and nice.

The most interesting spreadsheet IMO is this one.

And here I see some quite interesting stuff:

- Firefly-RK3399 looks similar to RPI 4 in performance (has anyone used it?)
- NVIDIA Jetson AGX Xavier looks ~2x faster than both of them (and probably is much more expensive and unobtainable)
- TFLite / ArmNN - but no ONNX or PyTorch on ARM, I wonder why
- int8 very much a must-have on these devices, I see performance boosts up to 2x

PS

Firefly-RK3399 has a PCIE M2 slot, so theoretically you can plug in PCIE accelerator sticks there? =)
It also runs on Ubuntu?

#hardware
#deep_learning
Einops and Einsum in PyTorch

Previously there was an attempt of making DL code more readable via named tensors (still a prototype, they "imported" a third party library). A cool idea, but I have never really seen anyone using it (me too).

Now a similar thing with (not) new Einstein notation for Deep learning:

- https://pytorch.org/docs/stable/generated/torch.einsum.html
- https://stackoverflow.com/questions/55894693/understanding-pytorch-einsum
- https://github.com/arogozhnikov/einops

Will it stick? No idea. Einsum may be a blessing for some complex code. It is not necessarily more readable generally if you got used to basic APIs (like bmm for example).

Also I believe it may be adapted into PyTorch as syntactic sugar.

#deep_learning
Some Mobile DNN Inference Benches

https://ai-benchmark.com/ranking.html
https://ai-benchmark.com/ranking_detailed.html
https://mlcommons.org/en/inference-mobile-10/

TLDR

- Using GPUs is 3-10x faster
- All of them test with TF / native APIs
- Expensive phones are faster (orly)
- No PyTorch (lite) accelerated results yet

#deep_learning
Huge Image + Text Contrastive Representation Learning

Typically, huge data + compute does not equal research news. People also are always over-excited about GANs and trillion parameter sized networks for some reason that eludes me. Yet the most useful "huge" compute networks are not shared (for obvious reasons - because they are useful in real life).

This is a bit different - looks like Google scraped 1.8B images with their text annotations and applied contrastive dual-encoder learning. The networks are large, but not needlessly so (they are small enough to fit on one conventional GPU?).

If you interpret posts by MS / FAIR / Google as "they throw all the shit and ton of compute at the wall and see what sticks", probably this is the minimum scale on which this task kind of works (there were some similar attempts by FAIR).

What impresses me though is real zero-shot performance on Imagenet, i.e. not fake academic "zero-shot" performance when networks are pre-trained on 10-100x sized datasets and then tuned on 5-10% of Imagenet, but true "zero-shot" performance.

This kind of points at the compute required (typically you can divide compute requirements by Google by 10, but still) :

The learning rate is warmed up linearly to 1e-3 from zero in 10k steps, and then linearly decay to zero in 1.2M steps (∼12 epochs). We train the model on 1024 Cloud TPUv3 cores with 16 positive pairs on each core. Therefore the total effective batch size is 16384.
5.1.


Also I could not find any mention of public weights. Google usually does not publish huge compute networks ... that are actually useful in production.

https://ai.googleblog.com/2021/05/align-scaling-up-visual-and-vision.html

I also remember the bs hype wave about OpenAI large transformer ... but Google published a similar network - and no one cared. Looks this kind of answers if those networks were useful.

#deep_learning
PyTorch 1.9 Released

https://pytorch.org/blog/pytorch-1.9-released/
https://github.com/pytorch/audio/releases/tag/v0.9.0

Major improvements to support scientific computing, including torch.linalg, torch.special, and Complex Autograd

Major improvements in on-device binary size with Mobile Interpreter (!)

Native support for elastic-fault tolerance training through the upstreaming of TorchElastic into PyTorch Core

Major updates to the PyTorch RPC framework to support large scale distributed training with GPU support

New APIs to optimize performance and packaging for model inference deployment (!)

Support for Distributed training, GPU utilization and SM efficiency in the PyTorch Profiler

#deep_learning
Embedding Quantization in PyTorch

Currently the following modules are supported by the following quantization modes:

- CNN (i.e. nn.Conv) => static quantization (you need to store statistics);
- RNN or Transformer (i.e. nn.Linear) => dynamic (you just convert weights);

Looks like less love was given to the Embedding layers, because they require static quantization: https://github.com/pytorch/pytorch/issues/65185

In any case, someone tested it and provided a simple recipe, which is absent from docs.

#deep_learning