Vol Building AGI

4o has become so good at math that it can analyze recurrences for me (and tolerate my typos)

84 viewsedited 13:22

Doing symbolic differentiation with loops is a piece of cake, I don't have to explain what a "backwards pass" is: https://chatgpt.com/share/858b2882-9d29-442e-a0cb-7e3afb24abab

Openai

ChatGPT

A conversational AI system that listens, learns, and challenges

89 viewsedited 13:29

Vol Building AGI

reproducing gpt-2 small is now about 10x cheaper than last year:

https://x.com/karpathy/status/1795484547267834137/photo/2

🔥1

80 viewsedited 16:01

Vol Building AGI

OMNI reboot as OMNI-EPIC: ask ChatGPT to generate environments for your agent that are progressively harder to solve!

https://x.com/jeffclune/status/1795787632435212732

🔥1

89 views15:22

Vol Building AGI

The recent CoPE paper made it all click. Gated Linear RNNs are transformers with token-dependent masks without softmax. If you would like to know how to calculate the attention mask, read my upcoming thesis :D

363 views10:39

Vol Building AGI

Spoiler: it’s dynamic programming!

🔥1

87 views14:02

Vol Building AGI

Dynamic programming is also used to find optimal tensor contraction paths, expoiting associativity of tensor products

The screenshot uses opt_einsum

🔥1

82 views14:45

Vol Building AGI

Тимофій Милованов дає приклади байесівського інференсу:
1. дати prior через промпт моделі (спитати що таке organizational culture principles by Ed Schade)
2. дати likelihood моделі (зкинути датасет та спитати identify cultural misalignments)
3. отримати posterior — обговорити результат

https://youtu.be/LTpWpadoT_U

YouTube

Hello, ChatGPT-4o

98 views16:35

Vol Building AGI

A few of recipes for cleaning pretraining data are popping up:

- FineWeb-Edu, maximizes scores on downstream educational benchmarks

https://x.com/gui_penedo/status/1797173053123916036

The ablations are 1.82B training runs for 30B tokens (almost same as uk4b-large in data, >2x the size) — ~1 GPU-month per one ablation!

> Our ablation models were trained using nanotron. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most ablations we trained on ~28B tokens (roughly the Chinchilla
optimal training size for this model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below

And similar work from the FLAN collection author:

https://arxiv.org/abs/2305.13169

X (formerly Twitter)

Guilherme Penedo (@gui_penedo) on X

We are (finally) releasing the 🍷 FineWeb technical report!

In it, we detail and explain every processing decision we took, and we also introduce our newest dataset: 📚 FineWeb-Edu, a (web only) subset of FW filtered for high educational content.

Link: h…

104 views11:47

Vol Building AGI

https://github.com/shashankvkt/DoRA_ICLR24 Pretraining on object tracking on 10 long form videos beats ImageNet pretraining

https://www.youtube.com/watch?v=i2Mp_Bc14WI

Since FPV pretraining is actually mainstream (= best ICLR honorable mention paper) I'm now obsessed with looking up animal camera videos on YouTube. Do you have a favorite of your own?

YouTube

Cat POV / Cat with Camera 🔴 / Ros' Unedited Stream #13

Hope you enjoy this collection of new Ros pov videos!

You can leave a tip directly via: https://streamelements.com/rosadventurecat/tip

Thanks so much for 400K subscribers!

Support the channel by becoming a Squad Member!
You'll get: Subscriber badge, exclusive…

108 viewsedited 15:36

Vol Building AGI

GH200 consume 900W per GPU

😱1

121 views10:29

Vol Building AGI

RNNs = orders of magnitude compute and memory savings. Latest results from my thesis.

💅1

115 viewsedited 17:21

Vol Building AGI

baby reproduction of chinchilla scaling laws, i want to reuse this https://x.com/Locchiu/status/1797751414548246624

X (formerly Twitter)

Lechao Xiao (@Locchiu) on X

nanoChinchilla.

Reproducing Chinchilla-Optimal Scaling Phenomenon: Colab, 1 Hour, 100 Lines, + Beautiful Theory https://t.co/Bsd6hWZZVQ

🔥1

115 views21:27

Vol Building AGI

Vol Building AGI pinned a photo

21:30

Vol Building AGI

я: радію що код працює
код:

88 views00:21

Vol Building AGI

OpenAI has released the simplest sparse autoencoder using top-k activations instead of L1 regularization to convert dense feature superpositions generated by neural networks to sparse interpretable features

Fun fact: L1 regularization helps sparsity for the same reason why the sample estimator for mean absolute error is the median

https://x.com/norabelrose/status/1798766340427403472?

X (formerly Twitter)

Nora Belrose (@norabelrose) on X

This is so hilariously simple, I'm switching my SAE code to this approach immediately
https://t.co/j0T0D1Arj8

🤯1

96 viewsedited 09:29

Vol Building AGI

While I am trying to use more matmuls, people are getting rid of them

https://arxiv.org/abs/2406.02528

arXiv.org

Scalable MatMul-free Language Modeling

Large Language Models (LLMs) have fundamentally altered how we approach scaling in machine learning. However, these models pose substantial computational and memory challenges, primarily due to...

😢2

96 views13:26

Vol Building AGI

Good news for pretraining: using cosine schedule is unnecessary.

Relying on cosine schedules has been popularized in Chinchilla and nanoGPT and I'm happy to see multiple confirations that it's not necessary: you can just train with constant LR for most of the time and use 20% to cooldown linearly. You can even to better it you use Noam cooldown (1-sqrt(t)). As a bonus, constant + linear is more forgiving to not searching for a good learning rate.

https://arxiv.org/abs/2405.18392

100 viewsedited 12:53

Vol Building AGI

Fun fact: uk4b large was trained with a constant learning rate without full decay

💅2

94 viewsedited 12:56

Vol Building AGI

Remember Monarch Mixer? Dense linear layers soon to be replaced with more structured counterparts

https://x.com/andrewgwils/status/1800532157406011523?

X (formerly Twitter)

Andrew Gordon Wilson (@andrewgwils) on X

A lot of the computation in pre-training transformers is now spent in the dense linear (MLP) layers. In our new ICML paper, we propose matrix structures with better scaling laws!
https://t.co/6n1dbXiWap
w/@ShikaiQiu, Andres P, @m_finzi, @micahgoldblum
1/8

🔥1

92 viewsedited 20:40

Vol Building AGI

I construct a linear RNN that receives ones as inputs and produces increasingly better approximations to pi as outputs.

The width of the network grows logarithmically wrt the maximum sequence length.

https://gist.github.com/proger/ba147e3953a155d833aae084c1f0cd12

Gist

pi.py

GitHub Gist: instantly share code, notes, and snippets.

113 views12:01

About

Blog

Apps

Platform