Spark in me
2.28K subscribers
728 photos
47 videos
114 files
2.62K links
Lost like tears in rain. DS, ML, a bit of philosophy and math. No bs or ads.
Download Telegram
Very interesting high quality dataset with satellite images + LB

- https://project.inria.fr/aerialimagelabeling/leaderboard/

Just scientific interest.

#data_science
#deep_learing
ONNX Model Dynamic Quantization

Before I tried transferring the native quantized PyTorch models into ONNX models and it failed with cryptic errors.

ONNX export is a bleeding edge feature, quantization also is (and also is not fully stable). So it it safe to assume that their combination would not work.

But ONNX can quantize its own models now. I was shocked that this works.

Link - https://tomwildenhain-microsoft.github.io/onnxruntime/docs/how-to/quantization.html

import onnx 
from onnxruntime.quantization import quantize_dynamic, QuantType

model_fp32 = 'en_v5.onnx'
model_quant = 'en_v5_q.onnx'
quantized_model = quantize_dynamic(model_fp32, model_quant, weight_type=QuantType.QUInt8)


#deep_learing
Building Scalable, Explainable, and Adaptive NLP Models with Retrieval

I am so tired of "trillion param model go brrr" bs. It is nice to see that at least some people are trying to create something useful for a change.

Is retrieval “all you need”?

The black-box nature of large language models like T5 and GPT-3 makes them inefficient to train and deploy, opaque in their knowledge representations and in backing their claims with provenance, and static in facing a constantly evolving world and diverse downstream contexts. This post explores retrieval-based NLP, where models retrieve information pertinent to solving their tasks from a plugged-in text corpus. This paradigm allows NLP models to leverage the representational strengths of language models, while needing much smaller architectures, offering transparent provenance for claims, and enabling efficient updates and adaptation.

We surveyed much of the existing and emerging work in this space and highlighted some of our work at Stanford, including ColBERT for scaling up expressive retrieval to massive corpora via late interaction, ColBERT-QA for accurately answering open-domain questions by adapting high-recall retrieval to the task, and Baleen for solving tasks that demand information from several independent sources using a condensed retrieval architecture. We continue to actively maintain our code as open source.


http://ai.stanford.edu/blog/retrieval-based-NLP/

#deep_learing
#nlp
The State of "AI" Report 2021

Blog post
https://www.stateof.ai/2021-report-launch.html

Email
http://newsletter.airstreet.com/issues/state-of-ai-report-2021-694504

Slides themselves
https://docs.google.com/presentation/d/1bwJDRC777rAf00Drthi9yT2c9b0MabWO5ZlksfvFzx8/edit#slide=id.gf171287819_0_165

TLDR

Research

The Transformer architecture has expanded far beyond NLP and is emerging as a general purpose architecture for machine learning.

Large language models (LLM) are in the scale-out phase and have become “nationalised” where each country wants their own LLM.

AI-first approaches have taken structural biology by storm: proteins and RNA (cellular machinery) is being simulated with high fidelity.

JAX emerges as a popular ML framework as the pace of research productivity accelerates/researchers become first class citizens.

Talent

China universities have rocketed from publishing no AI research in 1980 to the largest volume of quality AI research today.

The de-democratisation of AI research continues as big tech companies collaborate with elite, but not lower tier, universities.

Academic groups struggle to compete on compute resources, while 88% of top AI faculty have received funding from big tech.

Industry

The AI and data company ecosystem has matured significantly with significant IPOs, signalling the entry into the deployment phase of AI.

Two major AI-first drug discovery and development companies complete IPOs with drugs in the clinic, further validating their potential.

AI-first products are deployed for high-stakes use cases: the UK’s National Grid (energy), employee health and safety, and warehouses.

The community brings a renewed focus on data issues that affect model performance in production (bias, drift, specification, labels, etc).

Semiconductor-related companies accelerate massively as nations seek supply chain sovereignty and NVIDIA’s Arm takeover is investigated.

Politics

AI is now literally an arms race: autonomous weapons have been deployed on the battlefield with more testing happening regularly.

AI safety is now top of mind, but fewer than 50 researchers are working in this domain full-time at the major AI labs.

New experiments on AI governance emerge: totally distributed + open source, private + open source, and public benefit corporation.

AI regulation begins in Europe.


#deep_learing
8-bit Optimizers via Block-wise Quantization

https://www.youtube.com/watch?v=IxrlHAJtqKE&ab_channel=TimDettmers

While this is very cool in theory and on paper you allegedly reduce your optimizer memory by 50%+, this is very loaded politically and practically:

- Most likely this would work well only on Ampere GPUs (though they list Requirements: anaconda, cudatoolkit, pytorch Hardware requirements: NVIDIA Maxwell GPU or newer (>=GTX 9XX) Supported CUDA versions: 9.2 - 11.3);

- This is heavily loaded with perverse incentives. Huge LLMs are yet to prove the investment, but let's build even bigger models, i.e. trillion-param sized models!;

- This is maintained by FAIR, but packaged separately. So it can go either way - be abandoned after being used in achieving new trillion param model record or be then somehow merged into PyTorch or even maybe Nvidia products;

I can see one real practical use-case for this. If e.g. on older hardware you are training a large model (100 - 200M params) and you have problems with small batch-size, this can be a life-saver (or in case of Video or 3D or similar). Otherwise I am not sure.

Also note, they do not mention speed or convergence speed on different GPUs.

#deep_learing
Spark in me
8-bit Optimizers via Block-wise Quantization https://www.youtube.com/watch?v=IxrlHAJtqKE&ab_channel=TimDettmers While this is very cool in theory and on paper you allegedly reduce your optimizer memory by 50%+, this is very loaded politically and practically:…
INT8 Optimizer + PyTorch Native AMP

So, we tried the recent INT8 optimizer, also together with PyTorch native AMP.

We were also migrating this job from 1080 Ti to 3090 based environment.

Maybe the network is too small (~100M params including the embedding layer), but looks like apples to apples 3090 provides a 3x throughput boost, and AMP + INT 10-15% boost tops.

Well, I understand the selling point of being able to tune large networks, but ultimately on real down-to-earth task the speed improvement is marginal.

It is very cool that this even works and the optimizations described look very cool and sane, but the political underpinnings look very much like marginalization of independent research. I.e. if you cannot afford 1000 GPUs, then please do not even try anything or just enjoy scraps from our table.

#deep_learing
Accelerating PyTorch with CUDA Graphs

Today, we are pleased to announce a new advanced CUDA feature, CUDA Graphs, has been brought to PyTorch. Modern DL frameworks have complicated software stacks that incur significant overheads associated with the submission of each operation to the GPU. When DL workloads are strong-scaled to many GPUs for performance, the time taken by each GPU operation diminishes to just a few microseconds and, in these cases, the high work submission latencies of frameworks often lead to low utilization of the GPU. As GPUs get faster and workloads are scaled to more devices, the likelihood of workloads suffering from these launch-induced stalls increases. To overcome these performance overheads, NVIDIA engineers worked with PyTorch developers to enable CUDA graph execution natively in PyTorch. This design was instrumental in scaling NVIDIA’s MLPerf workloads (implemented in PyTorch) to over 4000 GPUs in order to achieve record-breaking performance.


- Link - https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/

I wonder, should (has) it be tried at all given such advertised figures?

#deep_learing
PyTorch Release 1.11

As
usual, the odd version is a bit more future oriented / experimental:

- https://github.com/pytorch/pytorch/releases/tag/v1.11.0
- https://pytorch.org/blog/pytorch-1.11-released/

Summary:

- TorchData is a new library for common modular data loading primitives for easily constructing flexible and performant data pipelines.

The rationale seems a bit weird:

We have found that the existing DataLoader bundled too many features together and can be difficult to extend.

Looks like they just want to collect all of their primitives for their docs and examples in one place. But I am not sure what this is solving for the end user, since in real life you end up rewriting such things anyway.

- functorch, a library that adds composable function transforms to PyTorch, is now available in beta.

Composable function transforms can help with a number of use cases that are tricky to do in PyTorch today:

computing per-sample-gradients (or other per-sample quantities)
running ensembles of models on a single machine
efficiently batching together tasks in the inner-loop of MAML
efficiently computing Jacobians and Hessians as well as batched ones

Also not sure

- Distributed Data Parallel (DDP) static graph optimizations available in stable.

Looks like it is required for huge models?

- Some interesting quantization fixes and additions

#deep_learing
Stupid Hack for Single PyTorch Layer Quantization

Kind of.

Quantization and model packing with PyTorch and ONNX are in a weird state right now.

On one hand, everything just works for most cases for PyTorch (there are competing and unstable new APIs, but that was to be expected).

For ONNX, it also just works, but adding a single "if" to the model proved to be a challenge, forget about more complex logic. To expose or not to expose (and how to obfuscate) some logic into external wrapper utilities is a design decision (also out of scope for this short post).

The problem is, the pre-packaged versions of PyTorch do not work properly with quantized models on older CPUs (1, 2 + literally dozens of similar questions in telegram chats). Typically people report having a "10 year old laptop" with some old Intel CPU or something similar.

Of course, no one would tweak or rebuild anything. So, unless a TTS model for example is fully quantized (or somehow cleverly packaged into ONNX) it does not make sense to quantize some parts of the model or expose some logic outside of jit / pt packages even if it reduces package size significantly.

But there is a third solution. If there is a single large layer / module (e.g. nn.Embedding - the best candidate) there is a dirty hack:

- Do not quantize the model;
- Quantize the weight matrix manually;
- Save the checkpoint with int8 weights;
- Store scale and zero_point separately;
- On loading, just convert int8 into float32 manually;

(Basically the same approach as dynamic quantization).

Your mileage may vary, but basic conversions is as follows:

qmax = 127
qmin = -128
scale = (weight.max() - weight.min()) / (qmax - qmin)
zero_point = qmin - weight.min() / scale

Obviously we tried going below int8, but the dynamic range for nn.Embedding was somewhere around 2**6, so we decided not to.

If this faces some further real world hurdles, I will provide an update.

#deep_learing
CoCa: Contrastive Captioners are Image-Text Foundation Models

Looks like Google is dead set on developing a production grade dual Image-Text encoder / captioning model:

we unify single-encoder, dual-encoder and encoder-decoder paradigms, and train one image-text foundation model that subsumes the capabilities of all three approaches


The idea of using all of the available noisy data and approaches and creatively sharing the compute is a good pattern, unless you read this line:

Pretraining CoCa takes about 5 days on 2,048 CloudTPUv4 chips


Research and compute siloing, of course, but the pattern itself is nice.

#deep_learing
Рейтинг русскоязычных энкодеров предложений

Полезные в реальной жизни энкодеры предложений на русском - птица редкая.

Поэтому я просто без лишних слов возьму и репостну эту статью:

- https://habr.com/ru/post/669674/

Мой развернутый комментарий - https://habr.com/ru/post/669674/#comment_24412620

Максимальный репост.

#deep_learing
Разруха не в клозетах, или чтобы стать крылатым нужно стремление к полету

Репостил тут недавно вот статью про полезный русский BERT. И ... будучи выше на голову, чем прошлая такая же статья автора она набрала на Хабре +20. Хм.

Недавно Хабр объявил итоги своего очередного конкурса статей ... и в номинации ML они отдали приз статье-реферату. Это распрекрасный и полезный реферат, но если бы отдали свой "приз" статье-переводу на тему очередного хайпа, было бы еще показательнее.

Не то чтобы раньше у нас или у меня были прямо идеальные статьи-кандидаты (именно про ML они выбирали годные, но неконструктивные статьи и раньше), но в 2021 году нас была статья-единорог, набравшая +205 с 45к просмотрами.

И естественно там есть еще парочка конструктивных статей в этой категории (где люди сами что-то сделали своими руками) ... но по состоянию на сейчас Хабр естественно уже удалил эту страницу (https://habr.com/ru/technotext/ml/).

И тут мы приходим к основной идее этого поста. Показывать людям, что они могут - опасно. Надо гасить весь конструктивизм и поддерживать пустые вскрики. Нужно топить за карго-культ и максимально кричащие и бессмысленные заголовки.

Вам это ничего и никого не напоминает?