Reinforcement learning followed by mechanistic interpretability on mice modulated by ketamine
Why and how to initialize neural networks? We would like them to undergo "nontrivial" updates when going through stochastic gradient descent — a regime that is also called "feature learning".
This regime asks for the the norm of the hidden features *and their gradient* scaling like the square root of the width of the feature.
You can achieve this desiderata by scaling your initialization and learning rate to preserve certain properties of the spectral norm of your weights and gradients.
By staying in the feature learning regime, you also get predictable hyperparameter transfer from small-scale development proxy models to large-scale expensive training runs — something you don't get from PyTorch defaults.
https://arxiv.org/abs/2310.17813
This regime asks for the the norm of the hidden features *and their gradient* scaling like the square root of the width of the feature.
You can achieve this desiderata by scaling your initialization and learning rate to preserve certain properties of the spectral norm of your weights and gradients.
By staying in the feature learning regime, you also get predictable hyperparameter transfer from small-scale development proxy models to large-scale expensive training runs — something you don't get from PyTorch defaults.
https://arxiv.org/abs/2310.17813
Jan Leike authored his PhD thesis on Nonparametric General Reinforcement Learning showing that Hutter's incomputable but optimal AGI, AIXI, collapses under degenerate choices of priors and can't be optimally approximated using finite computation. He then shows that AIXI can be approximated epsilon-optimally and provides alternatives to asymptotically optimal learning in stochastic environments based on Thompson sampling.
https://jan.leike.name
https://jan.leike.name
jan.leike.name
Jan Leike
My research, publications, and contact info
https://sites.google.com/view/ngsmworkshop
Call For Papers: ICML 2024 Workshop on Next Generation of Sequence Modeling Architectures
Submission Deadline: May 31, 2024 (Anywhere on Earth)
Acceptance Notification: June 17, 2024 (Anywhere on Earth)
I am happy to be serving as a reviewer for this workshop. Looking forward to learning new insights into sequence models from you.
Call For Papers: ICML 2024 Workshop on Next Generation of Sequence Modeling Architectures
Submission Deadline: May 31, 2024 (Anywhere on Earth)
Acceptance Notification: June 17, 2024 (Anywhere on Earth)
I am happy to be serving as a reviewer for this workshop. Looking forward to learning new insights into sequence models from you.
Google
NGSM
Description
🔥1
New challenge for signal recognition: the bandwidth has increased
https://content.neuralink.com/compression-challenge/README.html
https://content.neuralink.com/compression-challenge/README.html
Automatic learning rate transfer across sizes is now easier to use: https://github.com/jxbz/modula/tree/main
GitHub
GitHub - jxbz/modula: Scalable neural net training via automatic normalization in the modular norm.
Scalable neural net training via automatic normalization in the modular norm. - jxbz/modula
🔥1
Manifest AI is working on linear transformers and context scaling. At the end of this article authors discuss what is possible when you push the current context size limits — at billions of tokens you won't need finetuning any more — you'll just be able to push your entire dataset into the context window.
Currently open source LMs are at thousands of tokens, industrial grade LMs are at the millions of tokens — there's a lot of work left to push this frontier. In transformers we are simply concatenating token embeddings to the memory, and we will need some automatic compression to get past this.
https://manifestai.com/articles/compute-optimal-context-size/
Currently open source LMs are at thousands of tokens, industrial grade LMs are at the millions of tokens — there's a lot of work left to push this frontier. In transformers we are simply concatenating token embeddings to the memory, and we will need some automatic compression to get past this.
https://manifestai.com/articles/compute-optimal-context-size/
Manifestai
Manifest AI - Compute-Optimal Context Size
Doing symbolic differentiation with loops is a piece of cake, I don't have to explain what a "backwards pass" is: https://chatgpt.com/share/858b2882-9d29-442e-a0cb-7e3afb24abab
Openai
ChatGPT
A conversational AI system that listens, learns, and challenges
reproducing gpt-2 small is now about 10x cheaper than last year:
https://x.com/karpathy/status/1795484547267834137/photo/2
https://x.com/karpathy/status/1795484547267834137/photo/2
🔥1
OMNI reboot as OMNI-EPIC: ask ChatGPT to generate environments for your agent that are progressively harder to solve!
https://x.com/jeffclune/status/1795787632435212732
https://x.com/jeffclune/status/1795787632435212732
🔥1
Тимофій Милованов дає приклади байесівського інференсу:
1. дати prior через промпт моделі (спитати що таке organizational culture principles by Ed Schade)
2. дати likelihood моделі (зкинути датасет та спитати identify cultural misalignments)
3. отримати posterior — обговорити результат
https://youtu.be/LTpWpadoT_U
1. дати prior через промпт моделі (спитати що таке organizational culture principles by Ed Schade)
2. дати likelihood моделі (зкинути датасет та спитати identify cultural misalignments)
3. отримати posterior — обговорити результат
https://youtu.be/LTpWpadoT_U
YouTube
Hello, ChatGPT-4o
A few of recipes for cleaning pretraining data are popping up:
- FineWeb-Edu, maximizes scores on downstream educational benchmarks
https://x.com/gui_penedo/status/1797173053123916036
The ablations are 1.82B training runs for 30B tokens (almost same as uk4b-large in data, >2x the size) — ~1 GPU-month per one ablation!
> Our ablation models were trained using nanotron. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most ablations we trained on ~28B tokens (roughly the Chinchilla
optimal training size for this model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below
And similar work from the FLAN collection author:
https://arxiv.org/abs/2305.13169
- FineWeb-Edu, maximizes scores on downstream educational benchmarks
https://x.com/gui_penedo/status/1797173053123916036
The ablations are 1.82B training runs for 30B tokens (almost same as uk4b-large in data, >2x the size) — ~1 GPU-month per one ablation!
> Our ablation models were trained using nanotron. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most ablations we trained on ~28B tokens (roughly the Chinchilla
optimal training size for this model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below
And similar work from the FLAN collection author:
https://arxiv.org/abs/2305.13169
X (formerly Twitter)
Guilherme Penedo (@gui_penedo) on X
We are (finally) releasing the 🍷 FineWeb technical report!
In it, we detail and explain every processing decision we took, and we also introduce our newest dataset: 📚 FineWeb-Edu, a (web only) subset of FW filtered for high educational content.
Link: h…
In it, we detail and explain every processing decision we took, and we also introduce our newest dataset: 📚 FineWeb-Edu, a (web only) subset of FW filtered for high educational content.
Link: h…
Vol Building AGI
https://github.com/shashankvkt/DoRA_ICLR24 Pretraining on object tracking on 10 long form videos beats ImageNet pretraining
https://www.youtube.com/watch?v=i2Mp_Bc14WI
Since FPV pretraining is actually mainstream (= best ICLR honorable mention paper) I'm now obsessed with looking up animal camera videos on YouTube. Do you have a favorite of your own?
Since FPV pretraining is actually mainstream (= best ICLR honorable mention paper) I'm now obsessed with looking up animal camera videos on YouTube. Do you have a favorite of your own?
YouTube
Cat POV / Cat with Camera 🔴 / Ros' Unedited Stream #13
Hope you enjoy this collection of new Ros pov videos!
You can leave a tip directly via: https://streamelements.com/rosadventurecat/tip
Thanks so much for 400K subscribers!
Support the channel by becoming a Squad Member!
You'll get: Subscriber badge, exclusive…
You can leave a tip directly via: https://streamelements.com/rosadventurecat/tip
Thanks so much for 400K subscribers!
Support the channel by becoming a Squad Member!
You'll get: Subscriber badge, exclusive…