RWKV scaled to 1T tokens seems to beat Mistral trained on 8 on some multilingual benchmarks
Zero shot translation to Ukrainian in Eagle is about the same as Mistral in 2-shot setting and fine tuned llama2 with 10k examples.
https://twitter.com/RWKV_AI/status/1751797147492888651
Zero shot translation to Ukrainian in Eagle is about the same as Mistral in 2-shot setting and fine tuned llama2 with 10k examples.
https://twitter.com/RWKV_AI/status/1751797147492888651
X (formerly Twitter)
RWKV (@RWKV_AI) on X
Introducing Eagle-7B
Based on the RWKV-v5 architecture, bringing into opensource space, the strongest
- multi-lingual model
(beating even mistral)
- attention-free transformer today
(10-100x+ lower inference)
With comparable English performance with…
Based on the RWKV-v5 architecture, bringing into opensource space, the strongest
- multi-lingual model
(beating even mistral)
- attention-free transformer today
(10-100x+ lower inference)
With comparable English performance with…
🔥2
RNNs are faster to train, faster in inference and are more data efficient.
👍3
Arpa count tables? RNN weight matrices? Decision trees? Suffix arrays!
https://arxiv.org/abs/2401.17377
https://arxiv.org/abs/2401.17377
🔥1
https://twitter.com/DlCountdown/status/1764278990011813975
NeurIPS conference submission deadline is in late May, workshops deadlines will probably be August
NeurIPS conference submission deadline is in late May, workshops deadlines will probably be August
X (formerly Twitter)
AI Conference DL Countdown (@DlCountdown) on X
The NeurIPS deadline has been announced:
May 22nd, 8PM UTC
May 22nd, 8PM UTC
https://x.com/mlstreettalk/status/1765701266221522986
This is what you learn as a side note in our Machine Learning course at USI. Glad Yann communicates this message to a large audience. Recurrent neural nets can do anything, but gradient descent won’t find everything.
This is what you learn as a side note in our Machine Learning course at USI. Glad Yann communicates this message to a large audience. Recurrent neural nets can do anything, but gradient descent won’t find everything.
X (formerly Twitter)
Machine Learning Street Talk (@MLStreetTalk) on X
In 2021 on MLST the legendary @ylecun argued that RNNs were Turing Complete. In 2024, he came to the dark side! What do you think? 👇
Математика — це наука трансмісії простих ідей про регулярність світу між людьми. Це мова програмування, на якій ви стисло описуєте вашу думку, щоб завантажити її у свідомість ваших колег з абсолютною точністю.
Єгор зробив канал, де ми вчимось покращити навичку точної комунікації бібліотеки математичних ідей серед розробників штучного інтелекту.
Доєднуйтесь: https://t.me/applied_math_uk
Єгор зробив канал, де ми вчимось покращити навичку точної комунікації бібліотеки математичних ідей серед розробників штучного інтелекту.
Доєднуйтесь: https://t.me/applied_math_uk
Telegram
Прикладна математика
Про прикладну математику українською
Групи:
— https://t.me/speech_recognition_uk
— https://t.me/speech_synthesis_uk
— https://t.me/computer_vision_uk
— https://t.me/ai_work_uk
— https://t.me/nlp_uk
Discord: https://t.me/discord_uds
Групи:
— https://t.me/speech_recognition_uk
— https://t.me/speech_synthesis_uk
— https://t.me/computer_vision_uk
— https://t.me/ai_work_uk
— https://t.me/nlp_uk
Discord: https://t.me/discord_uds
❤2
Перший реліз Hippogriff: моєї імплементації архітектури Griffin, гібрид локального трансформера з sliding multi query attention (як mistral) та лінійної рекурентності (як mamba/rwkv)
В середині пакету ви також знайдете мій крафтовий трейнлуп з діагностиками активацій та стану вагів.
https://github.com/proger/hippogriff
В середині пакету ви також знайдете мій крафтовий трейнлуп з діагностиками активацій та стану вагів.
https://github.com/proger/hippogriff
GitHub
GitHub - proger/hippogriff: Griffin MQA + Hawk Linear RNN Hybrid
Griffin MQA + Hawk Linear RNN Hybrid. Contribute to proger/hippogriff development by creating an account on GitHub.
👍3
Media is too big
VIEW IN TELEGRAM
I love MATLAB/Octave. It's plotting experience is so smooth compared to matplotlib! Numpy/torch have their array APIs copied from MATLAB, so the amount of things you need to remember to move from Python is very small.
🤯1
To train transformers, you need a lot of diverse data. Let's use online RL to generate data!
Check out my new repo, control: Soft Actor Critic to produce experience trajectories
https://github.com/proger/control
Check out my new repo, control: Soft Actor Critic to produce experience trajectories
https://github.com/proger/control
🔥2
Bayesian Flow Networks (BFNs) link iterative denoising diffusion and recursive estimation of distribution parameters.
In my new post, I constrast autoregressive generative modeling (prevalent in language) and recursive Bayesian estimation of all parameters jointly.
https://proger.github.io/posts/bfn/normal.html
In my new post, I constrast autoregressive generative modeling (prevalent in language) and recursive Bayesian estimation of all parameters jointly.
https://proger.github.io/posts/bfn/normal.html
arXiv.org
Bayesian Flow Networks
This paper introduces Bayesian Flow Networks (BFNs), a new class of generative model in which the parameters of a set of independent distributions are modified with Bayesian inference in the light...
Excited to see the first book on differentiable programming. It explicitly talks about how to encode regular programs into structures that have gradient flow. https://arxiv.org/abs/2403.14606
arXiv.org
The Elements of Differentiable Programming
Artificial intelligence has recently experienced remarkable advances, fueled by large models, vast datasets, accelerated hardware, and, last but not least, the transformative power of...
😱2
I stumbled on this paper on Efficient Backprop from LeCun et al when discussing the differences between internal covariate shift and input whitening.
This work provides a comprehensive overview of tricks that are necessary succeessfully train deep models — why and how to initialize weights, choose nonlinearities (to some extent), how to choose and preprocess training data, how to choose learning rates, what is the basic optimization dynamics behavior and how to use the Hessian to diagnose it: https://cseweb.ucsd.edu/classes/wi08/cse253/Handouts/lecun-98b.pdf
This work provides a comprehensive overview of tricks that are necessary succeessfully train deep models — why and how to initialize weights, choose nonlinearities (to some extent), how to choose and preprocess training data, how to choose learning rates, what is the basic optimization dynamics behavior and how to use the Hessian to diagnose it: https://cseweb.ucsd.edu/classes/wi08/cse253/Handouts/lecun-98b.pdf
Balancing sequence lengths in your dataset is the best augmentation you can do to successfully train a Transformer
https://aclanthology.org/2021.emnlp-main.650/
https://aclanthology.org/2021.emnlp-main.650/
ACL Anthology
Sequence Length is a Domain: Length-based Overfitting in Transformer Models
Dusan Varis, Ondřej Bojar. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021.
The same principle (sequence length distribution needs to be uniform) actually applies to RNNs too. I trained a SHA-RNN on byte-level ukpron (grapheme to phoneme task) and making sequence lengths uniform was key to get the model to work: https://huggingface.co/darkproger/ukpron
huggingface.co
darkproger/ukpron · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.