Vol Building AGI
580 subscribers
116 photos
9 videos
12 files
199 links
Past topics: speech synthesis, transformers, LSTM, recurrence
Download Telegram
GH200 consume 900W per GPU
😱1
RNNs = orders of magnitude compute and memory savings. Latest results from my thesis.
💅1
я: радію що код працює
код:
OpenAI has released the simplest sparse autoencoder using top-k activations instead of L1 regularization to convert dense feature superpositions generated by neural networks to sparse interpretable features

Fun fact: L1 regularization helps sparsity for the same reason why the sample estimator for mean absolute error is the median

https://x.com/norabelrose/status/1798766340427403472?
🤯1
Good news for pretraining: using cosine schedule is unnecessary.

Relying on cosine schedules has been popularized in Chinchilla and nanoGPT and I'm happy to see multiple confirations that it's not necessary: you can just train with constant LR for most of the time and use 20% to cooldown linearly. You can even to better it you use Noam cooldown (1-sqrt(t)). As a bonus, constant + linear is more forgiving to not searching for a good learning rate.

https://arxiv.org/abs/2405.18392
Fun fact: uk4b large was trained with a constant learning rate without full decay
💅2
I construct a linear RNN that receives ones as inputs and produces increasingly better approximations to pi as outputs.

The width of the network grows logarithmically wrt the maximum sequence length.

https://gist.github.com/proger/ba147e3953a155d833aae084c1f0cd12
God gave us natural numbers. Humans invented real numbers for oppression (at ETH by the way)
😇1
Very cool speech recognition work: FastConformer + SC-CTC ablations wrt sequence length. 6 layer models can properly use 21.8 minutes of context.

https://x.com/RobFlynnHere/status/1803004030576152944
Will Merill's work helped me gain intuition about what classess of tasks are solvable by what kinds of sequence modeling architectures, notably that constant-depth Transformers are not universal (i.e. not even P). The ability to execute any algorithm in P can be achieved by appying chains of thought, scratchpads, pause tokens or infinite depth (see "Universal Transformer"). Chain of thought can be seen as a way to gain infinite depth from a finite network. Check out his recent video

https://www.youtube.com/watch?v=30MhUdapqc8
👍1
emotional computing
How to do vector calculus with ChatGPT: ask it to put indices on vectors explicitly, then there is no confusion about what dimensions need to be used when outer products arise.

Example: https://chatgpt.com/share/588e5ca1-b286-4f74-b2ef-bdbb402e3c83