Vol Building AGI

Excited to see the first book on differentiable programming. It explicitly talks about how to encode regular programs into structures that have gradient flow. https://arxiv.org/abs/2403.14606

arXiv.org

The Elements of Differentiable Programming

Artificial intelligence has recently experienced remarkable advances, fueled by large models, vast datasets, accelerated hardware, and, last but not least, the transformative power of...

😱2

91 views07:58

Vol Building AGI

103 views07:58

Vol Building AGI

111 views07:58

Vol Building AGI

Discrete Bayesian Flow Networks teaser

111 views16:14

Vol Building AGI

I stumbled on this paper on Efficient Backprop from LeCun et al when discussing the differences between internal covariate shift and input whitening.

This work provides a comprehensive overview of tricks that are necessary succeessfully train deep models — why and how to initialize weights, choose nonlinearities (to some extent), how to choose and preprocess training data, how to choose learning rates, what is the basic optimization dynamics behavior and how to use the Hessian to diagnose it: https://cseweb.ucsd.edu/classes/wi08/cse253/Handouts/lecun-98b.pdf

348 views12:02

Vol Building AGI

Balancing sequence lengths in your dataset is the best augmentation you can do to successfully train a Transformer

https://aclanthology.org/2021.emnlp-main.650/

ACL Anthology

Sequence Length is a Domain: Length-based Overfitting in Transformer Models

Dusan Varis, Ondřej Bojar. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021.

90 views12:18

Vol Building AGI

The same principle (sequence length distribution needs to be uniform) actually applies to RNNs too. I trained a SHA-RNN on byte-level ukpron (grapheme to phoneme task) and making sequence lengths uniform was key to get the model to work: https://huggingface.co/darkproger/ukpron

huggingface.co

darkproger/ukpron · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

254 viewsedited 12:19

Vol Building AGI

When we were already training uk4b, Karpathy posted a note that padding the number of rows in the tied input-output embedding table to a multiple of 8 (from 50257 to 50304) gave a significant training speedup. This thread from Horace He characterizes the space of phenomena that are related to this https://twitter.com/cHHillee/status/1630274804795445248

In short, GEMM kernels access memory in tile blocks, and when the blocks are aligned to the cache line the SM spends the least amount of memory access operations to feed the kernel.

In Accelerated Scan (it is my high performance training kernel for linear RNNs — it's responsible for computing the recurrence for all tokens when training the network or reading the prompt at inference), the backward kernel would read the memory in reverse. Alex Nichol found that loading the memory in reverse in chunks of 4 would make the access aligned — and speed up the loads!

That change in turn allowed fusing the reverse scan with a few other kernels, resulting in 30-50% training speedups. Coming in Accelerated Scan 0.2.

X (formerly Twitter)

Horace He (@cHHillee) on X

Recently, Karpathy tweeted that *increasing* the size of his matmul made it run faster.

But... why? Many people seem content to leave this as black magic. But luckily, this *can* be understood!

Here's a plot of FLOPs achieved for square matmuls. Let's explain…

97 viewsedited 20:59

Vol Building AGI

Channel name was changed to «Vol Scaling RNNs»

21:02

Vol Building AGI

Balancing sequence lengths in your dataset is the best augmentation you can do to successfully train a Transformer https://aclanthology.org/2021.emnlp-main.650/

Continuing the saga of sequence length being a data domain: even modern transformers with RoPE, Alibi, Transformer XL-like relative positional encodings non-robustly generalize over sequence length and depend on the format of training data.

Recent work has been using index hints when formatting inputs — in NLP a common instance of an index hint could be punctuation (e.g. finishing sentences with a dot). In practice, we've seen that models often refuse to perform the task if you forget punctuation.

https://arxiv.org/abs/2402.09371

👍1

77 viewsedited 13:19

Vol Building AGI

https://app.suno.ai/song/f44b528c-0bce-45fe-b5b6-bca579f97ed2

Hi Pedro, you tweeted:

This is outrageously wrong, Attention was invented at U. Montreal by Bahdanau, Cho and Bengio. Transformers were just an extension. This is the paper
that really invented modern AI.

THIS IS OUTRAGEOUSLY WRONG.

The first Transformer variant was published over 30 years ago.

It is now called "unnormalized linear Transformer".

THIS IS OUTRAGEOUSLY WRONG.

See my eye see em ell paper. Attention terminology was introduced in ninety three.

Back in the day, compute costs were high,
A million times more, oh my, oh my!
Transformers unnormalized, they'd fly,
Linear, efficient, no cry.

No quadratic woes, just linear stride,
Scaling with input, ain't no need to RAG.
Compute constraints, they pushed the tide,
In Transformer's journey, they'd confide.

So listen up, in the computational spree,
Efficiency's the key, it's plain to see.
Sequence length scalability, that's the decree,
In the Transformer world, we're wild and free!

THIS IS OUTRAGEOUSLY WRONG.

Vaswani did not cite this.

THIS IS OUTRAGEOUSLY WRONG.

Here is a well-known tweet on this

THIS IS OUTRAGEOUSLY WRONG.

Quadratic ChatGPT

THIS IS OUTRAGEOUSLY WRONG.

THIS IS OUTRAGEOUSLY WRONG.

MACHINE LEARNING IS THE SCIENCE OF CREDIT ASSIGNMENT

app.suno.ai

Credit Assignment | Suno

angry comedy hip hop song. Listen and make your own with Suno.

106 viewsedited 15:55

Vol Building AGI

ETH has an Engineering-focused course on Category Theory, checking out its notes: https://applied-compositional-thinking.engineering/resources/

336 views08:58

Vol Building AGI

Wow — time to get your kids into writing NeurIPS papers

72 viewsedited 08:54

Vol Building AGI

Parameter-shared state expansion is the next frontier in efficient RNNs. It's a bit cumbersome concept that has been explored in linear transformers but really became popular with Mamba.

Now, HGRN2 has introduced an elegant implementation of the most simple expansion rule — the outer product. An outer product is an operation that multiplies two vectors of shape Nx1 and 1xN making an NxN matrix — a quadratic number of pairwise multiplicative interactions.

Outer products work well with Accelerated Scan and are already supported in the mqar branch of Hippogriff

https://twitter.com/SonglinYang4/status/1778897457159758010

X (formerly Twitter)

Songlin Yang (@SonglinYang4) on X

I really like HGRN2's concept of integrating forget gates with keys in (gated) linear attention. It's a neat and effective approach! We've incorporated HGRN2 into the Flash-Linear-Attention library. Check it out here: https://t.co/WPozxdRnTD.

78 viewsedited 10:41

Vol Building AGI

Outer product state expansion learns *really fast*

82 views10:47

Vol Building AGI

Wrote a small summary of Transformers, RNNs and Fast Weight Programmers using the language of transformers.
https://twitter.com/darkproger/status/1779123166561890357

X (formerly Twitter)

Volodymyr Kyrylov (@darkproger) on X

In a transformer there are as many keys as tokens, so softmax is used to choose one. In an RNN the number of keys is chosen up front.
In Hawk/RecurrentGemma the key dimension is 1, and the sigmoid gates can choose all keys, some or none.
A FWP has multidimensional…

81 views12:25

Vol Building AGI

Emergent property

👀2

75 views12:12

Vol Building AGI

The_Fisher_Darmois_Koopman_Pitman_theorem_for_random_processes.pdf

179.4 KB

Discovered this probability result by Fisher-Darmois-Koopman-Pitman (F-D-K-P theorem):

an i.i.d stream of data of arbitrary size can be summarized into a finite fixed number of sufficient statistics (= hidden state can be of fixed size and no information will be lost) iff the data is from an exponential family of distributions.

Normal, Bernoulli and MRFs are exponential, but uniform distribution is not. Weirdly enough, a lot of LM diagnostic tasks sample from uniform distributions.

The attached paper generalized the result to random processes summarizable by Kalman filters, so it translates to single-layer linear RNNs.

85 viewsedited 18:55

Vol Building AGI

Meta is in the RNN game https://arxiv.org/abs/2404.08801

arXiv.org

Megalodon: Efficient LLM Pretraining and Inference with Unlimited...

The quadratic complexity and weak length extrapolation of Transformers limits their ability to scale to long sequences, and while sub-quadratic solutions like linear attention and state space...

335 views10:53

Vol Building AGI

https://arxiv.org/abs/2404.08819

A paper precisely demonstrating the difference between input-dependent forget gate vectors and input-dependent recurrence matrices — my intuition from Dyck-1 synthetic tasks suggested there is no difference between recurrence matrices and gated vectors, however there is one (I am super pissed I arrived at the wrong conclusion)!

A linear RNN with a non-diagonal recurrence matrix can simulate DFAs (solve word problems that involve state permutation) and stay linear — strictly better than a transformer!

arXiv.org

The Illusion of State in State-Space Models

State-space models (SSMs) have emerged as a potential alternative architecture for building large language models (LLMs) compared to the previously ubiquitous transformer architecture. One...

79 viewsedited 15:28

Vol Building AGI

Modern pretraining uses attention masks that avoid crossing document boundaries. Llama 3 authors claim they’ve been using intra-document masking, Yi Tay (of Reka and UL2 fame — probably it’s in the Reka tech report?) says it’s "basic fundamental stuff", Amazon offers packing algorithms for batching that opt for a little bit of padding instead of a lot of truncation.

nanoGPT-style random sampling of file offsets with back-to-back documents that we used for uk4b is way too yolo for modern standards.

https://x.com/yitayml/status/1781090183703572500

X (formerly Twitter)

Yi Tay (@YiTayML) on X

@PMinervini @yuzhaouoe Segmentation masks are like basic fundamental stuff that a code base should have though...

77 views22:47

About

Blog

Apps

Platform