Vol Building AGI
581 subscribers
116 photos
9 videos
12 files
199 links
Past topics: speech synthesis, transformers, LSTM, recurrence
Download Telegram
tired: github streaks
wired: wandb streaks
New challenge for signal recognition: the bandwidth has increased

https://content.neuralink.com/compression-challenge/README.html
Manifest AI is working on linear transformers and context scaling. At the end of this article authors discuss what is possible when you push the current context size limits — at billions of tokens you won't need finetuning any more — you'll just be able to push your entire dataset into the context window.

Currently open source LMs are at thousands of tokens, industrial grade LMs are at the millions of tokens — there's a lot of work left to push this frontier. In transformers we are simply concatenating token embeddings to the memory, and we will need some automatic compression to get past this.

https://manifestai.com/articles/compute-optimal-context-size/
4o has become so good at math that it can analyze recurrences for me (and tolerate my typos)
Doing symbolic differentiation with loops is a piece of cake, I don't have to explain what a "backwards pass" is: https://chatgpt.com/share/858b2882-9d29-442e-a0cb-7e3afb24abab
reproducing gpt-2 small is now about 10x cheaper than last year:

https://x.com/karpathy/status/1795484547267834137/photo/2
🔥1
OMNI reboot as OMNI-EPIC: ask ChatGPT to generate environments for your agent that are progressively harder to solve!

https://x.com/jeffclune/status/1795787632435212732
🔥1
The recent CoPE paper made it all click. Gated Linear RNNs are transformers with token-dependent masks without softmax. If you would like to know how to calculate the attention mask, read my upcoming thesis :D
Dynamic programming is also used to find optimal tensor contraction paths, expoiting associativity of tensor products

The screenshot uses opt_einsum
🔥1
Тимофій Милованов дає приклади байесівського інференсу:
1. дати prior через промпт моделі (спитати що таке organizational culture principles by Ed Schade)
2. дати likelihood моделі (зкинути датасет та спитати identify cultural misalignments)
3. отримати posterior — обговорити результат

https://youtu.be/LTpWpadoT_U
A few of recipes for cleaning pretraining data are popping up:

- FineWeb-Edu, maximizes scores on downstream educational benchmarks

https://x.com/gui_penedo/status/1797173053123916036

The ablations are 1.82B training runs for 30B tokens (almost same as uk4b-large in data, >2x the size) — ~1 GPU-month per one ablation!

> Our ablation models were trained using nanotron. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most ablations we trained on ~28B tokens (roughly the Chinchilla
optimal training size for this model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below


And similar work from the FLAN collection author:

https://arxiv.org/abs/2305.13169
GH200 consume 900W per GPU
😱1
RNNs = orders of magnitude compute and memory savings. Latest results from my thesis.
💅1
я: радію що код працює
код:
OpenAI has released the simplest sparse autoencoder using top-k activations instead of L1 regularization to convert dense feature superpositions generated by neural networks to sparse interpretable features

Fun fact: L1 regularization helps sparsity for the same reason why the sample estimator for mean absolute error is the median

https://x.com/norabelrose/status/1798766340427403472?
🤯1