Vol Building AGI
581 subscribers
116 photos
9 videos
12 files
199 links
Past topics: speech synthesis, transformers, LSTM, recurrence
Download Telegram
The definition of stabilze:


def stabilize(f_, i_):
"stabilize and activate forget and input gates"
m = max_scan(f_, i_)
m_prev = F.pad(m[:, :-1, :], (0,0,1,0))

i = (i_ - m).exp()
f = (f_ + m_prev - m).sigmoid()
return f, i


max_scan is based on a Blelloch scan with a (max, +) semiring
πŸ€“1
https://x.com/haoailab/status/1788269848788869299?

Diffusion consistency has been applied together with Jacobi decoding to get 3x speed up over autoregressive loops. Consistency finetuning can be applied to existing autoregressive LMs to get efficient inference.

This might be more important than maintaining constant memory during autoregressive looping (working on RNNs).
An insight from Shida Wang β€” nonlinearity of recurrence impacts universal approximation ability but not memory capacity. Exponential parametrization of the recurrence operator (like xLSTM) improves the optimization landscape.


https://github.com/radarFudan/Curse-of-memory
New generative model in town: learning to do nothing

https://assafshocher.github.io/IGN/
Kyunghyun Cho is building an AI drug design system. He is also known for GRU and contributions to machine translation.

He is currently working on protein sequence design by generative modeling over a database of sequences, property classification of generated samples and black box optimization to find sample sets on the Pareto frontiers of multiple objectives induced by property classifiers (translation people: think of Minimum Bayes Risk) to send to the lab for validation. This loop describes the second and third pipeline steps on the photo. Eventually he wants to *backprop * through the whole loop.

Currently the forward pass takes more than 100 years, from discovering the role of pancreas in diabetes to approval of Semaglutide.
❀1🀯1
Research hint from Yann LeCun: figure out where transformer loss spikes come from. They don’t usually happen in convnets. My thought: convnets do not have input dependent weights unlike Transformers.


Also work on Q* with hierarchical time
My favorite AI paper from ICLR is OMNI, which is also my favorite NeurIPS workshop paper by Jenny Zhang et al.

Jenny develops on an idea that Juergen calls PowerPlay; you give your AI agent tasks that are
1. Learnable, by measuring learning progress as a fraction of successes. Also can be used to track forgetting.
2. Interesting, by human notion of interestingness encoded into gpt-4

(1) has been known how to do before
(2) has been done with measures like novelty or artificial curiosity. Novel stuff is not always interesting! Which is where you need GPT to get true general agents β€” GPT has learned interestingness from Reddit experts and all scientific papers.

https://www.jennyzhangzt.com/omni/
πŸ‘1🀯1
Reinforcement learning followed by mechanistic interpretability on mice modulated by ketamine
Why and how to initialize neural networks? We would like them to undergo "nontrivial" updates when going through stochastic gradient descent β€” a regime that is also called "feature learning".

This regime asks for the the norm of the hidden features *and their gradient* scaling like the square root of the width of the feature.

You can achieve this desiderata by scaling your initialization and learning rate to preserve certain properties of the spectral norm of your weights and gradients.

By staying in the feature learning regime, you also get predictable hyperparameter transfer from small-scale development proxy models to large-scale expensive training runs β€” something you don't get from PyTorch defaults.


https://arxiv.org/abs/2310.17813
Jan Leike authored his PhD thesis on Nonparametric General Reinforcement Learning showing that Hutter's incomputable but optimal AGI, AIXI, collapses under degenerate choices of priors and can't be optimally approximated using finite computation. He then shows that AIXI can be approximated epsilon-optimally and provides alternatives to asymptotically optimal learning in stochastic environments based on Thompson sampling.

https://jan.leike.name
I'm finally citing Schlesinger!
πŸ‘2🀯1πŸŽ‰1
https://sites.google.com/view/ngsmworkshop

Call For Papers: ICML 2024 Workshop on Next Generation of Sequence Modeling Architectures

Submission Deadline: May 31, 2024 (Anywhere on Earth)
Acceptance Notification: June 17, 2024 (Anywhere on Earth)

I am happy to be serving as a reviewer for this workshop. Looking forward to learning new insights into sequence models from you.
πŸ”₯1
My best science joke so far
tired: github streaks
wired: wandb streaks
New challenge for signal recognition: the bandwidth has increased

https://content.neuralink.com/compression-challenge/README.html
Manifest AI is working on linear transformers and context scaling. At the end of this article authors discuss what is possible when you push the current context size limits β€” at billions of tokens you won't need finetuning any more β€” you'll just be able to push your entire dataset into the context window.

Currently open source LMs are at thousands of tokens, industrial grade LMs are at the millions of tokens β€” there's a lot of work left to push this frontier. In transformers we are simply concatenating token embeddings to the memory, and we will need some automatic compression to get past this.

https://manifestai.com/articles/compute-optimal-context-size/
4o has become so good at math that it can analyze recurrences for me (and tolerate my typos)
Doing symbolic differentiation with loops is a piece of cake, I don't have to explain what a "backwards pass" is: https://chatgpt.com/share/858b2882-9d29-442e-a0cb-7e3afb24abab