Vol Building AGI
580 subscribers
116 photos
9 videos
12 files
199 links
Past topics: speech synthesis, transformers, LSTM, recurrence
Download Telegram
Dynamic programming is also used to find optimal tensor contraction paths, expoiting associativity of tensor products

The screenshot uses opt_einsum
🔥1
Тимофій Милованов дає приклади байесівського інференсу:
1. дати prior через промпт моделі (спитати що таке organizational culture principles by Ed Schade)
2. дати likelihood моделі (зкинути датасет та спитати identify cultural misalignments)
3. отримати posterior — обговорити результат

https://youtu.be/LTpWpadoT_U
A few of recipes for cleaning pretraining data are popping up:

- FineWeb-Edu, maximizes scores on downstream educational benchmarks

https://x.com/gui_penedo/status/1797173053123916036

The ablations are 1.82B training runs for 30B tokens (almost same as uk4b-large in data, >2x the size) — ~1 GPU-month per one ablation!

> Our ablation models were trained using nanotron. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most ablations we trained on ~28B tokens (roughly the Chinchilla
optimal training size for this model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below


And similar work from the FLAN collection author:

https://arxiv.org/abs/2305.13169
GH200 consume 900W per GPU
😱1
RNNs = orders of magnitude compute and memory savings. Latest results from my thesis.
💅1
я: радію що код працює
код:
OpenAI has released the simplest sparse autoencoder using top-k activations instead of L1 regularization to convert dense feature superpositions generated by neural networks to sparse interpretable features

Fun fact: L1 regularization helps sparsity for the same reason why the sample estimator for mean absolute error is the median

https://x.com/norabelrose/status/1798766340427403472?
🤯1
Good news for pretraining: using cosine schedule is unnecessary.

Relying on cosine schedules has been popularized in Chinchilla and nanoGPT and I'm happy to see multiple confirations that it's not necessary: you can just train with constant LR for most of the time and use 20% to cooldown linearly. You can even to better it you use Noam cooldown (1-sqrt(t)). As a bonus, constant + linear is more forgiving to not searching for a good learning rate.

https://arxiv.org/abs/2405.18392
Fun fact: uk4b large was trained with a constant learning rate without full decay
💅2
I construct a linear RNN that receives ones as inputs and produces increasingly better approximations to pi as outputs.

The width of the network grows logarithmically wrt the maximum sequence length.

https://gist.github.com/proger/ba147e3953a155d833aae084c1f0cd12
God gave us natural numbers. Humans invented real numbers for oppression (at ETH by the way)
😇1
Very cool speech recognition work: FastConformer + SC-CTC ablations wrt sequence length. 6 layer models can properly use 21.8 minutes of context.

https://x.com/RobFlynnHere/status/1803004030576152944
Will Merill's work helped me gain intuition about what classess of tasks are solvable by what kinds of sequence modeling architectures, notably that constant-depth Transformers are not universal (i.e. not even P). The ability to execute any algorithm in P can be achieved by appying chains of thought, scratchpads, pause tokens or infinite depth (see "Universal Transformer"). Chain of thought can be seen as a way to gain infinite depth from a finite network. Check out his recent video

https://www.youtube.com/watch?v=30MhUdapqc8
👍1
emotional computing