The definition of stabilze:
max_scan is based on a Blelloch scan with a (max, +) semiring
def stabilize(f_, i_):
"stabilize and activate forget and input gates"
m = max_scan(f_, i_)
m_prev = F.pad(m[:, :-1, :], (0,0,1,0))
i = (i_ - m).exp()
f = (f_ + m_prev - m).sigmoid()
return f, i
max_scan is based on a Blelloch scan with a (max, +) semiring
π€1
https://x.com/haoailab/status/1788269848788869299?
Diffusion consistency has been applied together with Jacobi decoding to get 3x speed up over autoregressive loops. Consistency finetuning can be applied to existing autoregressive LMs to get efficient inference.
This might be more important than maintaining constant memory during autoregressive looping (working on RNNs).
Diffusion consistency has been applied together with Jacobi decoding to get 3x speed up over autoregressive loops. Consistency finetuning can be applied to existing autoregressive LMs to get efficient inference.
This might be more important than maintaining constant memory during autoregressive looping (working on RNNs).
X (formerly Twitter)
Hao AI Lab (@haoailab) on X
People often see LLMs as sequential decoders, but we show they can be easily adapted as fast parallel decoders!π₯π
Announcing consistency LLMs: teaching LLMs to predict the fixed point from any point on its Jacobi decoding trajectory
- LLM can fast forwardβ¦
Announcing consistency LLMs: teaching LLMs to predict the fixed point from any point on its Jacobi decoding trajectory
- LLM can fast forwardβ¦
https://github.com/shashankvkt/DoRA_ICLR24
Pretraining on object tracking on 10 long form videos beats ImageNet pretraining
Pretraining on object tracking on 10 long form videos beats ImageNet pretraining
GitHub
GitHub - shashankvkt/DoRA_ICLR24: This repo contains the official implementation of ICLR 2024 paper "Is ImageNet worth 1 video?β¦
This repo contains the official implementation of ICLR 2024 paper "Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video"" - shashankvkt/DoRA_ICLR24
π€―1
An insight from Shida Wang β nonlinearity of recurrence impacts universal approximation ability but not memory capacity. Exponential parametrization of the recurrence operator (like xLSTM) improves the optimization landscape.
https://github.com/radarFudan/Curse-of-memory
https://github.com/radarFudan/Curse-of-memory
GitHub
GitHub - radarFudan/Curse-of-memory: Curse-of-memory phenomenon of RNNs in sequence modelling
Curse-of-memory phenomenon of RNNs in sequence modelling - radarFudan/Curse-of-memory
Kyunghyun Cho is building an AI drug design system. He is also known for GRU and contributions to machine translation.
He is currently working on protein sequence design by generative modeling over a database of sequences, property classification of generated samples and black box optimization to find sample sets on the Pareto frontiers of multiple objectives induced by property classifiers (translation people: think of Minimum Bayes Risk) to send to the lab for validation. This loop describes the second and third pipeline steps on the photo. Eventually he wants to *backprop * through the whole loop.
Currently the forward pass takes more than 100 years, from discovering the role of pancreas in diabetes to approval of Semaglutide.
He is currently working on protein sequence design by generative modeling over a database of sequences, property classification of generated samples and black box optimization to find sample sets on the Pareto frontiers of multiple objectives induced by property classifiers (translation people: think of Minimum Bayes Risk) to send to the lab for validation. This loop describes the second and third pipeline steps on the photo. Eventually he wants to *backprop * through the whole loop.
Currently the forward pass takes more than 100 years, from discovering the role of pancreas in diabetes to approval of Semaglutide.
β€1π€―1
Research hint from Yann LeCun: figure out where transformer loss spikes come from. They donβt usually happen in convnets. My thought: convnets do not have input dependent weights unlike Transformers.
Also work on Q* with hierarchical time
Also work on Q* with hierarchical time
My favorite AI paper from ICLR is OMNI, which is also my favorite NeurIPS workshop paper by Jenny Zhang et al.
Jenny develops on an idea that Juergen calls PowerPlay; you give your AI agent tasks that are
1. Learnable, by measuring learning progress as a fraction of successes. Also can be used to track forgetting.
2. Interesting, by human notion of interestingness encoded into gpt-4
(1) has been known how to do before
(2) has been done with measures like novelty or artificial curiosity. Novel stuff is not always interesting! Which is where you need GPT to get true general agents β GPT has learned interestingness from Reddit experts and all scientific papers.
https://www.jennyzhangzt.com/omni/
Jenny develops on an idea that Juergen calls PowerPlay; you give your AI agent tasks that are
1. Learnable, by measuring learning progress as a fraction of successes. Also can be used to track forgetting.
2. Interesting, by human notion of interestingness encoded into gpt-4
(1) has been known how to do before
(2) has been done with measures like novelty or artificial curiosity. Novel stuff is not always interesting! Which is where you need GPT to get true general agents β GPT has learned interestingness from Reddit experts and all scientific papers.
https://www.jennyzhangzt.com/omni/
OMNI
π1π€―1
Reinforcement learning followed by mechanistic interpretability on mice modulated by ketamine
Why and how to initialize neural networks? We would like them to undergo "nontrivial" updates when going through stochastic gradient descent β a regime that is also called "feature learning".
This regime asks for the the norm of the hidden features *and their gradient* scaling like the square root of the width of the feature.
You can achieve this desiderata by scaling your initialization and learning rate to preserve certain properties of the spectral norm of your weights and gradients.
By staying in the feature learning regime, you also get predictable hyperparameter transfer from small-scale development proxy models to large-scale expensive training runs β something you don't get from PyTorch defaults.
https://arxiv.org/abs/2310.17813
This regime asks for the the norm of the hidden features *and their gradient* scaling like the square root of the width of the feature.
You can achieve this desiderata by scaling your initialization and learning rate to preserve certain properties of the spectral norm of your weights and gradients.
By staying in the feature learning regime, you also get predictable hyperparameter transfer from small-scale development proxy models to large-scale expensive training runs β something you don't get from PyTorch defaults.
https://arxiv.org/abs/2310.17813
Jan Leike authored his PhD thesis on Nonparametric General Reinforcement Learning showing that Hutter's incomputable but optimal AGI, AIXI, collapses under degenerate choices of priors and can't be optimally approximated using finite computation. He then shows that AIXI can be approximated epsilon-optimally and provides alternatives to asymptotically optimal learning in stochastic environments based on Thompson sampling.
https://jan.leike.name
https://jan.leike.name
jan.leike.name
Jan Leike
My research, publications, and contact info
https://sites.google.com/view/ngsmworkshop
Call For Papers: ICML 2024 Workshop on Next Generation of Sequence Modeling Architectures
Submission Deadline: May 31, 2024 (Anywhere on Earth)
Acceptance Notification: June 17, 2024 (Anywhere on Earth)
I am happy to be serving as a reviewer for this workshop. Looking forward to learning new insights into sequence models from you.
Call For Papers: ICML 2024 Workshop on Next Generation of Sequence Modeling Architectures
Submission Deadline: May 31, 2024 (Anywhere on Earth)
Acceptance Notification: June 17, 2024 (Anywhere on Earth)
I am happy to be serving as a reviewer for this workshop. Looking forward to learning new insights into sequence models from you.
Google
NGSM
Description
π₯1
New challenge for signal recognition: the bandwidth has increased
https://content.neuralink.com/compression-challenge/README.html
https://content.neuralink.com/compression-challenge/README.html
Automatic learning rate transfer across sizes is now easier to use: https://github.com/jxbz/modula/tree/main
GitHub
GitHub - jxbz/modula: Scalable neural net training via automatic normalization in the modular norm.
Scalable neural net training via automatic normalization in the modular norm. - jxbz/modula
π₯1
Manifest AI is working on linear transformers and context scaling. At the end of this article authors discuss what is possible when you push the current context size limits β at billions of tokens you won't need finetuning any more β you'll just be able to push your entire dataset into the context window.
Currently open source LMs are at thousands of tokens, industrial grade LMs are at the millions of tokens β there's a lot of work left to push this frontier. In transformers we are simply concatenating token embeddings to the memory, and we will need some automatic compression to get past this.
https://manifestai.com/articles/compute-optimal-context-size/
Currently open source LMs are at thousands of tokens, industrial grade LMs are at the millions of tokens β there's a lot of work left to push this frontier. In transformers we are simply concatenating token embeddings to the memory, and we will need some automatic compression to get past this.
https://manifestai.com/articles/compute-optimal-context-size/
Manifestai
Manifest AI - Compute-Optimal Context Size
Doing symbolic differentiation with loops is a piece of cake, I don't have to explain what a "backwards pass" is: https://chatgpt.com/share/858b2882-9d29-442e-a0cb-7e3afb24abab
Openai
ChatGPT
A conversational AI system that listens, learns, and challenges