Vol Building AGI
https://github.com/shashankvkt/DoRA_ICLR24 Pretraining on object tracking on 10 long form videos beats ImageNet pretraining
https://www.youtube.com/watch?v=i2Mp_Bc14WI
Since FPV pretraining is actually mainstream (= best ICLR honorable mention paper) I'm now obsessed with looking up animal camera videos on YouTube. Do you have a favorite of your own?
Since FPV pretraining is actually mainstream (= best ICLR honorable mention paper) I'm now obsessed with looking up animal camera videos on YouTube. Do you have a favorite of your own?
YouTube
Cat POV / Cat with Camera 🔴 / Ros' Unedited Stream #13
Hope you enjoy this collection of new Ros pov videos!
You can leave a tip directly via: https://streamelements.com/rosadventurecat/tip
Thanks so much for 400K subscribers!
Support the channel by becoming a Squad Member!
You'll get: Subscriber badge, exclusive…
You can leave a tip directly via: https://streamelements.com/rosadventurecat/tip
Thanks so much for 400K subscribers!
Support the channel by becoming a Squad Member!
You'll get: Subscriber badge, exclusive…
baby reproduction of chinchilla scaling laws, i want to reuse this https://x.com/Locchiu/status/1797751414548246624
X (formerly Twitter)
Lechao Xiao (@Locchiu) on X
nanoChinchilla.
Reproducing Chinchilla-Optimal Scaling Phenomenon: Colab, 1 Hour, 100 Lines, + Beautiful Theory https://t.co/Bsd6hWZZVQ
Reproducing Chinchilla-Optimal Scaling Phenomenon: Colab, 1 Hour, 100 Lines, + Beautiful Theory https://t.co/Bsd6hWZZVQ
🔥1
OpenAI has released the simplest sparse autoencoder using top-k activations instead of L1 regularization to convert dense feature superpositions generated by neural networks to sparse interpretable features
Fun fact: L1 regularization helps sparsity for the same reason why the sample estimator for mean absolute error is the median
https://x.com/norabelrose/status/1798766340427403472?
Fun fact: L1 regularization helps sparsity for the same reason why the sample estimator for mean absolute error is the median
https://x.com/norabelrose/status/1798766340427403472?
X (formerly Twitter)
Nora Belrose (@norabelrose) on X
This is so hilariously simple, I'm switching my SAE code to this approach immediately
https://t.co/j0T0D1Arj8
https://t.co/j0T0D1Arj8
🤯1
While I am trying to use more matmuls, people are getting rid of them
https://arxiv.org/abs/2406.02528
https://arxiv.org/abs/2406.02528
arXiv.org
Scalable MatMul-free Language Modeling
Large Language Models (LLMs) have fundamentally altered how we approach scaling in machine learning. However, these models pose substantial computational and memory challenges, primarily due to...
😢2
Good news for pretraining: using cosine schedule is unnecessary.
Relying on cosine schedules has been popularized in Chinchilla and nanoGPT and I'm happy to see multiple confirations that it's not necessary: you can just train with constant LR for most of the time and use 20% to cooldown linearly. You can even to better it you use Noam cooldown (1-sqrt(t)). As a bonus, constant + linear is more forgiving to not searching for a good learning rate.
https://arxiv.org/abs/2405.18392
Relying on cosine schedules has been popularized in Chinchilla and nanoGPT and I'm happy to see multiple confirations that it's not necessary: you can just train with constant LR for most of the time and use 20% to cooldown linearly. You can even to better it you use Noam cooldown (1-sqrt(t)). As a bonus, constant + linear is more forgiving to not searching for a good learning rate.
https://arxiv.org/abs/2405.18392
Fun fact: uk4b large was trained with a constant learning rate without full decay
💅2
Remember Monarch Mixer? Dense linear layers soon to be replaced with more structured counterparts
https://x.com/andrewgwils/status/1800532157406011523?
https://x.com/andrewgwils/status/1800532157406011523?
X (formerly Twitter)
Andrew Gordon Wilson (@andrewgwils) on X
A lot of the computation in pre-training transformers is now spent in the dense linear (MLP) layers. In our new ICML paper, we propose matrix structures with better scaling laws!
https://t.co/6n1dbXiWap
w/@ShikaiQiu, Andres P, @m_finzi, @micahgoldblum
1/8
https://t.co/6n1dbXiWap
w/@ShikaiQiu, Andres P, @m_finzi, @micahgoldblum
1/8
🔥1
I construct a linear RNN that receives ones as inputs and produces increasingly better approximations to pi as outputs.
The width of the network grows logarithmically wrt the maximum sequence length.
https://gist.github.com/proger/ba147e3953a155d833aae084c1f0cd12
The width of the network grows logarithmically wrt the maximum sequence length.
https://gist.github.com/proger/ba147e3953a155d833aae084c1f0cd12
Gist
pi.py
GitHub Gist: instantly share code, notes, and snippets.
God gave us natural numbers. Humans invented real numbers for oppression (at ETH by the way)
😇1
Ready for flow matching? https://bm371613.github.io/conditional-flow-matching/
👀1
Very cool speech recognition work: FastConformer + SC-CTC ablations wrt sequence length. 6 layer models can properly use 21.8 minutes of context.
https://x.com/RobFlynnHere/status/1803004030576152944
https://x.com/RobFlynnHere/status/1803004030576152944
Noam Shazeer has shared techniques used to scale Transformer inference to 1/5 of Google search traffic
https://research.character.ai/optimizing-inference/
https://research.character.ai/optimizing-inference/
Character.AI Blog
Character.AI empowers people to connect, learn, and tell stories through interactive entertainment.
Will Merill's work helped me gain intuition about what classess of tasks are solvable by what kinds of sequence modeling architectures, notably that constant-depth Transformers are not universal (i.e. not even P). The ability to execute any algorithm in P can be achieved by appying chains of thought, scratchpads, pause tokens or infinite depth (see "Universal Transformer"). Chain of thought can be seen as a way to gain infinite depth from a finite network. Check out his recent video
https://www.youtube.com/watch?v=30MhUdapqc8
https://www.youtube.com/watch?v=30MhUdapqc8
YouTube
Will Merrill: The Expressive Power of Transformers with Chain of Though
Talk given by Will Merrill to the Formal Languages and Neural Networks discord on June 10, 2024. Thank you, Will!
Please find the link to their paper here:
https://arxiv.org/abs/2310.07923
Please find the link to their paper here:
https://arxiv.org/abs/2310.07923
👍1
How to do vector calculus with ChatGPT: ask it to put indices on vectors explicitly, then there is no confusion about what dimensions need to be used when outer products arise.
Example: https://chatgpt.com/share/588e5ca1-b286-4f74-b2ef-bdbb402e3c83
Example: https://chatgpt.com/share/588e5ca1-b286-4f74-b2ef-bdbb402e3c83
Chatgpt
A conversational AI system that listens, learns, and challenges