Vol Building AGI
580 subscribers
116 photos
9 videos
12 files
199 links
Past topics: speech synthesis, transformers, LSTM, recurrence
Download Telegram
Optimizing parallel deep learning systems is a bit like navigating Tokyo by public transit
RWKV scaled to 1T tokens seems to beat Mistral trained on 8 on some multilingual benchmarks

Zero shot translation to Ukrainian in Eagle is about the same as Mistral in 2-shot setting and fine tuned llama2 with 10k examples.


https://twitter.com/RWKV_AI/status/1751797147492888651
🔥2
No reason to use transformer decoders any more for LLMs :)
RNNs are faster to train, faster in inference and are more data efficient.
👍3
Arpa count tables? RNN weight matrices? Decision trees? Suffix arrays!

https://arxiv.org/abs/2401.17377
🔥1
wandb is in a good mood today:
2
https://x.com/mlstreettalk/status/1765701266221522986

This is what you learn as a side note in our Machine Learning course at USI. Glad Yann communicates this message to a large audience. Recurrent neural nets can do anything, but gradient descent won’t find everything.
Математика — це наука трансмісії простих ідей про регулярність світу між людьми. Це мова програмування, на якій ви стисло описуєте вашу думку, щоб завантажити її у свідомість ваших колег з абсолютною точністю.

Єгор зробив канал, де ми вчимось покращити навичку точної комунікації бібліотеки математичних ідей серед розробників штучного інтелекту.

Доєднуйтесь: https://t.me/applied_math_uk
2
Перший реліз Hippogriff: моєї імплементації архітектури Griffin, гібрид локального трансформера з sliding multi query attention (як mistral) та лінійної рекурентності (як mamba/rwkv)

В середині пакету ви також знайдете мій крафтовий трейнлуп з діагностиками активацій та стану вагів.

https://github.com/proger/hippogriff
👍3
Media is too big
VIEW IN TELEGRAM
I love MATLAB/Octave. It's plotting experience is so smooth compared to matplotlib! Numpy/torch have their array APIs copied from MATLAB, so the amount of things you need to remember to move from Python is very small.
🤯1
To train transformers, you need a lot of diverse data. Let's use online RL to generate data!

Check out my new repo, control: Soft Actor Critic to produce experience trajectories

https://github.com/proger/control
🔥2
Bayesian Flow Networks (BFNs) link iterative denoising diffusion and recursive estimation of distribution parameters.

In my new post, I constrast autoregressive generative modeling (prevalent in language) and recursive Bayesian estimation of all parameters jointly.

https://proger.github.io/posts/bfn/normal.html
Discrete Bayesian Flow Networks teaser
I stumbled on this paper on Efficient Backprop from LeCun et al when discussing the differences between internal covariate shift and input whitening.

This work provides a comprehensive overview of tricks that are necessary succeessfully train deep models — why and how to initialize weights, choose nonlinearities (to some extent), how to choose and preprocess training data, how to choose learning rates, what is the basic optimization dynamics behavior and how to use the Hessian to diagnose it: https://cseweb.ucsd.edu/classes/wi08/cse253/Handouts/lecun-98b.pdf
The same principle (sequence length distribution needs to be uniform) actually applies to RNNs too. I trained a SHA-RNN on byte-level ukpron (grapheme to phoneme task) and making sequence lengths uniform was key to get the model to work: https://huggingface.co/darkproger/ukpron