Vol Building AGI
580 subscribers
116 photos
9 videos
12 files
199 links
Past topics: speech synthesis, transformers, LSTM, recurrence
Download Telegram
Математика — це наука трансмісії простих ідей про регулярність світу між людьми. Це мова програмування, на якій ви стисло описуєте вашу думку, щоб завантажити її у свідомість ваших колег з абсолютною точністю.

Єгор зробив канал, де ми вчимось покращити навичку точної комунікації бібліотеки математичних ідей серед розробників штучного інтелекту.

Доєднуйтесь: https://t.me/applied_math_uk
2
Перший реліз Hippogriff: моєї імплементації архітектури Griffin, гібрид локального трансформера з sliding multi query attention (як mistral) та лінійної рекурентності (як mamba/rwkv)

В середині пакету ви також знайдете мій крафтовий трейнлуп з діагностиками активацій та стану вагів.

https://github.com/proger/hippogriff
👍3
Media is too big
VIEW IN TELEGRAM
I love MATLAB/Octave. It's plotting experience is so smooth compared to matplotlib! Numpy/torch have their array APIs copied from MATLAB, so the amount of things you need to remember to move from Python is very small.
🤯1
To train transformers, you need a lot of diverse data. Let's use online RL to generate data!

Check out my new repo, control: Soft Actor Critic to produce experience trajectories

https://github.com/proger/control
🔥2
Bayesian Flow Networks (BFNs) link iterative denoising diffusion and recursive estimation of distribution parameters.

In my new post, I constrast autoregressive generative modeling (prevalent in language) and recursive Bayesian estimation of all parameters jointly.

https://proger.github.io/posts/bfn/normal.html
Discrete Bayesian Flow Networks teaser
I stumbled on this paper on Efficient Backprop from LeCun et al when discussing the differences between internal covariate shift and input whitening.

This work provides a comprehensive overview of tricks that are necessary succeessfully train deep models — why and how to initialize weights, choose nonlinearities (to some extent), how to choose and preprocess training data, how to choose learning rates, what is the basic optimization dynamics behavior and how to use the Hessian to diagnose it: https://cseweb.ucsd.edu/classes/wi08/cse253/Handouts/lecun-98b.pdf
The same principle (sequence length distribution needs to be uniform) actually applies to RNNs too. I trained a SHA-RNN on byte-level ukpron (grapheme to phoneme task) and making sequence lengths uniform was key to get the model to work: https://huggingface.co/darkproger/ukpron
When we were already training uk4b, Karpathy posted a note that padding the number of rows in the tied input-output embedding table to a multiple of 8 (from 50257 to 50304) gave a significant training speedup. This thread from Horace He characterizes the space of phenomena that are related to this https://twitter.com/cHHillee/status/1630274804795445248

In short, GEMM kernels access memory in tile blocks, and when the blocks are aligned to the cache line the SM spends the least amount of memory access operations to feed the kernel.

In Accelerated Scan (it is my high performance training kernel for linear RNNs — it's responsible for computing the recurrence for all tokens when training the network or reading the prompt at inference), the backward kernel would read the memory in reverse. Alex Nichol found that loading the memory in reverse in chunks of 4 would make the access aligned — and speed up the loads!

That change in turn allowed fusing the reverse scan with a few other kernels, resulting in 30-50% training speedups. Coming in Accelerated Scan 0.2.
Channel name was changed to «Vol Scaling RNNs»
Vol Building AGI
Balancing sequence lengths in your dataset is the best augmentation you can do to successfully train a Transformer https://aclanthology.org/2021.emnlp-main.650/
Continuing the saga of sequence length being a data domain: even modern transformers with RoPE, Alibi, Transformer XL-like relative positional encodings non-robustly generalize over sequence length and depend on the format of training data.

Recent work has been using index hints when formatting inputs — in NLP a common instance of an index hint could be punctuation (e.g. finishing sentences with a dot). In practice, we've seen that models often refuse to perform the task if you forget punctuation.

https://arxiv.org/abs/2402.09371
👍1
https://app.suno.ai/song/f44b528c-0bce-45fe-b5b6-bca579f97ed2

Hi Pedro, you tweeted:

This is outrageously wrong, Attention was invented at U. Montreal by Bahdanau, Cho and Bengio. Transformers were just an extension. This is the paper
that really invented modern AI.

THIS IS OUTRAGEOUSLY WRONG.

The first Transformer variant was published over 30 years ago.

It is now called "unnormalized linear Transformer".

THIS IS OUTRAGEOUSLY WRONG.

See my eye see em ell paper. Attention terminology was introduced in ninety three.


Back in the day, compute costs were high,
A million times more, oh my, oh my!
Transformers unnormalized, they'd fly,
Linear, efficient, no cry.

No quadratic woes, just linear stride,
Scaling with input, ain't no need to RAG.
Compute constraints, they pushed the tide,
In Transformer's journey, they'd confide.

So listen up, in the computational spree,
Efficiency's the key, it's plain to see.
Sequence length scalability, that's the decree,
In the Transformer world, we're wild and free!

THIS IS OUTRAGEOUSLY WRONG.

Vaswani did not cite this.

THIS IS OUTRAGEOUSLY WRONG.

Here is a well-known tweet on this

THIS IS OUTRAGEOUSLY WRONG.

Quadratic ChatGPT

THIS IS OUTRAGEOUSLY WRONG.

THIS IS OUTRAGEOUSLY WRONG.

MACHINE LEARNING IS THE SCIENCE OF CREDIT ASSIGNMENT
ETH has an Engineering-focused course on Category Theory, checking out its notes: https://applied-compositional-thinking.engineering/resources/
Wow — time to get your kids into writing NeurIPS papers
Parameter-shared state expansion is the next frontier in efficient RNNs. It's a bit cumbersome concept that has been explored in linear transformers but really became popular with Mamba.

Now, HGRN2 has introduced an elegant implementation of the most simple expansion rule — the outer product. An outer product is an operation that multiplies two vectors of shape Nx1 and 1xN making an NxN matrix — a quadratic number of pairwise multiplicative interactions.

Outer products work well with Accelerated Scan and are already supported in the mqar branch of Hippogriff

https://twitter.com/SonglinYang4/status/1778897457159758010
Outer product state expansion learns *really fast*