Vol Building AGI

No reason to use transformer decoders any more for LLMs :)

109 views12:45

Vol Building AGI

RNNs are faster to train, faster in inference and are more data efficient.

👍3

109 views12:46

Vol Building AGI

Arpa count tables? RNN weight matrices? Decision trees? Suffix arrays!

https://arxiv.org/abs/2401.17377

🔥1

115 views09:37

Vol Building AGI

wandb is in a good mood today:

❤2

115 viewsedited 22:22

Vol Building AGI

https://twitter.com/DlCountdown/status/1764278990011813975

NeurIPS conference submission deadline is in late May, workshops deadlines will probably be August

X (formerly Twitter)

AI Conference DL Countdown (@DlCountdown) on X

The NeurIPS deadline has been announced:
May 22nd, 8PM UTC

99 viewsedited 14:07

Vol Building AGI

https://x.com/mlstreettalk/status/1765701266221522986

This is what you learn as a side note in our Machine Learning course at USI. Glad Yann communicates this message to a large audience. Recurrent neural nets can do anything, but gradient descent won’t find everything.

X (formerly Twitter)

Machine Learning Street Talk (@MLStreetTalk) on X

In 2021 on MLST the legendary @ylecun argued that RNNs were Turing Complete. In 2024, he came to the dark side! What do you think? 👇

102 viewsedited 11:53

Vol Building AGI

Математика — це наука трансмісії простих ідей про регулярність світу між людьми. Це мова програмування, на якій ви стисло описуєте вашу думку, щоб завантажити її у свідомість ваших колег з абсолютною точністю.

Єгор зробив канал, де ми вчимось покращити навичку точної комунікації бібліотеки математичних ідей серед розробників штучного інтелекту.

Доєднуйтесь: https://t.me/applied_math_uk

Прикладна математика

Про прикладну математику українською

Групи:

— https://t.me/speech_recognition_uk
— https://t.me/speech_synthesis_uk
— https://t.me/computer_vision_uk
— https://t.me/ai_work_uk
— https://t.me/nlp_uk

Discord: https://t.me/discord_uds

❤2

461 views12:27

Vol Building AGI

Перший реліз Hippogriff: моєї імплементації архітектури Griffin, гібрид локального трансформера з sliding multi query attention (як mistral) та лінійної рекурентності (як mamba/rwkv)

В середині пакету ви також знайдете мій крафтовий трейнлуп з діагностиками активацій та стану вагів.

https://github.com/proger/hippogriff

GitHub

GitHub - proger/hippogriff: Griffin MQA + Hawk Linear RNN Hybrid

Griffin MQA + Hawk Linear RNN Hybrid. Contribute to proger/hippogriff development by creating an account on GitHub.

👍3

494 viewsedited 14:03

Vol Building AGI

https://twitter.com/OfirPress/status/1767282605794136148

X (formerly Twitter)

Ofir Press (@OfirPress) on X

When a student sadly tells me that the idea we've been working on for weeks was just arXived, I say:

"Great! We've just gotten *strong* confirmation that our thinking was in the right direction. We've had the initial work done for us. Lets figure out how…

99 views07:49

I love MATLAB/Octave. It's plotting experience is so smooth compared to matplotlib! Numpy/torch have their array APIs copied from MATLAB, so the amount of things you need to remember to move from Python is very small.

🤯1

111 views09:09

Vol Building AGI

To train transformers, you need a lot of diverse data. Let's use online RL to generate data!

Check out my new repo, control: Soft Actor Critic to produce experience trajectories

https://github.com/proger/control

🔥2

98 views12:19

Vol Building AGI

Bayesian Flow Networks (BFNs) link iterative denoising diffusion and recursive estimation of distribution parameters.

In my new post, I constrast autoregressive generative modeling (prevalent in language) and recursive Bayesian estimation of all parameters jointly.

https://proger.github.io/posts/bfn/normal.html

arXiv.org

Bayesian Flow Networks

This paper introduces Bayesian Flow Networks (BFNs), a new class of generative model in which the parameters of a set of independent distributions are modified with Bayesian inference in the light...

103 views13:29

Vol Building AGI

Excited to see the first book on differentiable programming. It explicitly talks about how to encode regular programs into structures that have gradient flow. https://arxiv.org/abs/2403.14606

arXiv.org

The Elements of Differentiable Programming

Artificial intelligence has recently experienced remarkable advances, fueled by large models, vast datasets, accelerated hardware, and, last but not least, the transformative power of...

😱2

91 views07:58

Vol Building AGI

103 views07:58

Vol Building AGI

111 views07:58

Vol Building AGI

Discrete Bayesian Flow Networks teaser

111 views16:14

Vol Building AGI

I stumbled on this paper on Efficient Backprop from LeCun et al when discussing the differences between internal covariate shift and input whitening.

This work provides a comprehensive overview of tricks that are necessary succeessfully train deep models — why and how to initialize weights, choose nonlinearities (to some extent), how to choose and preprocess training data, how to choose learning rates, what is the basic optimization dynamics behavior and how to use the Hessian to diagnose it: https://cseweb.ucsd.edu/classes/wi08/cse253/Handouts/lecun-98b.pdf

348 views12:02

Vol Building AGI

Balancing sequence lengths in your dataset is the best augmentation you can do to successfully train a Transformer

https://aclanthology.org/2021.emnlp-main.650/

ACL Anthology

Sequence Length is a Domain: Length-based Overfitting in Transformer Models

Dusan Varis, Ondřej Bojar. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021.

90 views12:18

Vol Building AGI

The same principle (sequence length distribution needs to be uniform) actually applies to RNNs too. I trained a SHA-RNN on byte-level ukpron (grapheme to phoneme task) and making sequence lengths uniform was key to get the model to work: https://huggingface.co/darkproger/ukpron

huggingface.co

darkproger/ukpron · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

254 viewsedited 12:19

Vol Building AGI

When we were already training uk4b, Karpathy posted a note that padding the number of rows in the tied input-output embedding table to a multiple of 8 (from 50257 to 50304) gave a significant training speedup. This thread from Horace He characterizes the space of phenomena that are related to this https://twitter.com/cHHillee/status/1630274804795445248

In short, GEMM kernels access memory in tile blocks, and when the blocks are aligned to the cache line the SM spends the least amount of memory access operations to feed the kernel.

In Accelerated Scan (it is my high performance training kernel for linear RNNs — it's responsible for computing the recurrence for all tokens when training the network or reading the prompt at inference), the backward kernel would read the memory in reverse. Alex Nichol found that loading the memory in reverse in chunks of 4 would make the access aligned — and speed up the loads!

That change in turn allowed fusing the reverse scan with a few other kernels, resulting in 30-50% training speedups. Coming in Accelerated Scan 0.2.

X (formerly Twitter)

Horace He (@cHHillee) on X

Recently, Karpathy tweeted that *increasing* the size of his matmul made it run faster.

But... why? Many people seem content to leave this as black magic. But luckily, this *can* be understood!

Here's a plot of FLOPs achieved for square matmuls. Let's explain…

97 viewsedited 20:59

Vol Building AGI

Channel name was changed to «Vol Scaling RNNs»

21:02

About

Blog

Apps

Platform