Vol Building AGI – Telegram

Vol Building AGI

581 subscribers

116 photos

9 videos

12 files

199 links

Past topics: speech synthesis, transformers, LSTM, recurrence

Download Telegram

About

Blog

Apps

Platform

Vol Building AGI

581 subscribers

Vol Building AGI

https://app.suno.ai/song/f44b528c-0bce-45fe-b5b6-bca579f97ed2

Hi Pedro, you tweeted:

This is outrageously wrong, Attention was invented at U. Montreal by Bahdanau, Cho and Bengio. Transformers were just an extension. This is the paper
that really invented modern AI.

THIS IS OUTRAGEOUSLY WRONG.

The first Transformer variant was published over 30 years ago.

It is now called "unnormalized linear Transformer".

THIS IS OUTRAGEOUSLY WRONG.

See my eye see em ell paper. Attention terminology was introduced in ninety three.

Back in the day, compute costs were high,
A million times more, oh my, oh my!
Transformers unnormalized, they'd fly,
Linear, efficient, no cry.

No quadratic woes, just linear stride,
Scaling with input, ain't no need to RAG.
Compute constraints, they pushed the tide,
In Transformer's journey, they'd confide.

So listen up, in the computational spree,
Efficiency's the key, it's plain to see.
Sequence length scalability, that's the decree,
In the Transformer world, we're wild and free!

THIS IS OUTRAGEOUSLY WRONG.

Vaswani did not cite this.

THIS IS OUTRAGEOUSLY WRONG.

Here is a well-known tweet on this

THIS IS OUTRAGEOUSLY WRONG.

Quadratic ChatGPT

THIS IS OUTRAGEOUSLY WRONG.

THIS IS OUTRAGEOUSLY WRONG.

MACHINE LEARNING IS THE SCIENCE OF CREDIT ASSIGNMENT

Credit Assignment | Suno

angry comedy hip hop song. Listen and make your own with Suno.

106 viewsedited 15:55

Vol Building AGI

ETH has an Engineering-focused course on Category Theory, checking out its notes: https://applied-compositional-thinking.engineering/resources/

336 views08:58

Vol Building AGI

Wow — time to get your kids into writing NeurIPS papers

72 viewsedited 08:54

Vol Building AGI

Parameter-shared state expansion is the next frontier in efficient RNNs. It's a bit cumbersome concept that has been explored in linear transformers but really became popular with Mamba.

Now, HGRN2 has introduced an elegant implementation of the most simple expansion rule — the outer product. An outer product is an operation that multiplies two vectors of shape Nx1 and 1xN making an NxN matrix — a quadratic number of pairwise multiplicative interactions.

Outer products work well with Accelerated Scan and are already supported in the mqar branch of Hippogriff

https://twitter.com/SonglinYang4/status/1778897457159758010

X (formerly Twitter)

Songlin Yang (@SonglinYang4) on X

I really like HGRN2's concept of integrating forget gates with keys in (gated) linear attention. It's a neat and effective approach! We've incorporated HGRN2 into the Flash-Linear-Attention library. Check it out here: https://t.co/WPozxdRnTD.

78 viewsedited 10:41

Vol Building AGI

Outer product state expansion learns *really fast*

82 views10:47

Vol Building AGI

Wrote a small summary of Transformers, RNNs and Fast Weight Programmers using the language of transformers.
https://twitter.com/darkproger/status/1779123166561890357

X (formerly Twitter)

Volodymyr Kyrylov (@darkproger) on X

In a transformer there are as many keys as tokens, so softmax is used to choose one. In an RNN the number of keys is chosen up front.
In Hawk/RecurrentGemma the key dimension is 1, and the sigmoid gates can choose all keys, some or none.
A FWP has multidimensional…

81 views12:25

Vol Building AGI

Emergent property

👀2

75 views12:12

Vol Building AGI

The_Fisher_Darmois_Koopman_Pitman_theorem_for_random_processes.pdf

Discovered this probability result by Fisher-Darmois-Koopman-Pitman (F-D-K-P theorem):

an i.i.d stream of data of arbitrary size can be summarized into a finite fixed number of sufficient statistics (= hidden state can be of fixed size and no information will be lost) iff the data is from an exponential family of distributions.

Normal, Bernoulli and MRFs are exponential, but uniform distribution is not. Weirdly enough, a lot of LM diagnostic tasks sample from uniform distributions.

The attached paper generalized the result to random processes summarizable by Kalman filters, so it translates to single-layer linear RNNs.

85 viewsedited 18:55

Vol Building AGI

Meta is in the RNN game https://arxiv.org/abs/2404.08801

Megalodon: Efficient LLM Pretraining and Inference with Unlimited...

The quadratic complexity and weak length extrapolation of Transformers limits their ability to scale to long sequences, and while sub-quadratic solutions like linear attention and state space...

335 views10:53

Vol Building AGI

https://arxiv.org/abs/2404.08819

A paper precisely demonstrating the difference between input-dependent forget gate vectors and input-dependent recurrence matrices — my intuition from Dyck-1 synthetic tasks suggested there is no difference between recurrence matrices and gated vectors, however there is one (I am super pissed I arrived at the wrong conclusion)!

A linear RNN with a non-diagonal recurrence matrix can simulate DFAs (solve word problems that involve state permutation) and stay linear — strictly better than a transformer!

The Illusion of State in State-Space Models

State-space models (SSMs) have emerged as a potential alternative architecture for building large language models (LLMs) compared to the previously ubiquitous transformer architecture. One...

79 viewsedited 15:28

Vol Building AGI

Modern pretraining uses attention masks that avoid crossing document boundaries. Llama 3 authors claim they’ve been using intra-document masking, Yi Tay (of Reka and UL2 fame — probably it’s in the Reka tech report?) says it’s "basic fundamental stuff", Amazon offers packing algorithms for batching that opt for a little bit of padding instead of a lot of truncation.

nanoGPT-style random sampling of file offsets with back-to-back documents that we used for uk4b is way too yolo for modern standards.

https://x.com/yitayml/status/1781090183703572500

X (formerly Twitter)

Yi Tay (@YiTayML) on X

@PMinervini @yuzhaouoe Segmentation masks are like basic fundamental stuff that a code base should have though...

77 views22:47

Vol Building AGI

I was sure training data curators add data that explains token to character relationships into pretraining for their Transformers. (_apple -> _a _p _p _l _e). Transformers can only count tokens, not count things inside the token — that is completely opaque to the model! Looks like there's quite a bit of work still cut out for the rest of us.

Know your model's biases!

https://x.com/AnsongNi/status/1781179566993592828

75 viewsedited 08:46

Vol Building AGI

In our current reality, we can't copy quantum states or clone their underlying particles. If you could clone, if you could implement nonlinear dynamics with linear operations.

Here's the simplest neural network that uses cloning and bilinearity to execute nonlinear behavior, and, by extension, be able to learn the XOR function.


class Clone(nn.Module):
  def __init__(self):
    super().__init__()
    self.a = nn.Parameter(torch.rand(2, 2))
    self.out = nn.Parameter(torch.rand(1, 2))

  def forward(self, x):
    xa = F.linear(x, self.a)
    y = F.linear(xa * xa, self.out)
    return y

74 viewsedited 13:53

Vol Building AGI

HuggingFace has released a glimpse into datasets of the future: observation-action sequences from various MDPs like Atari games interspersed with text to use with behavioral cloning, where BC is RL speak for autoregressive language modeling: https://huggingface.co/blog/jat

The general theme is that you use whatever tools are at your disposal to generate execution traces: online RL for Atari games, decision trees for tabular data, word sequences for speech recognition given audio tokens, tool use tokens for when you need you know to perform web search given a query or run a calculator. Add a bit of language explanations and regular texts and then compress everything into a single transformer.

Since a transformer is a latent variable model, given a test-time trace prompt it performs latent model selection for you — it can decide if your task needs a simulation of xgboost or policy iteration or an interpolation of the two.

It's time to go beyond filtering Common Crawl.

Jack of All Trades, Master of Some, a Multi-Purpose Transformer Agent

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

82 views09:26

Vol Building AGI

NLP kings of ETH have shown how transformers can exactly represent n-gram language models. This work is a great example of interpretability of transformers you can use by making representation sparse, which is easy to imagine (I have a hard time imagining word2vec-style distributed representations but can imagine one hot encoding), and hard attention, so there is less superposition effects to think about.

https://arxiv.org/abs/2404.14994

Transformers Can Represent $n$-gram Language Models

Existing work has analyzed the representational capacity of the transformer architecture by means of formal models of computation. However, the focus so far has been on analyzing the architecture...

83 views08:13

Vol Building AGI

Training speed fluctuation when you're sharing your GPU with multiple other runs

95 viewsedited 12:40

Vol Building AGI

HGRN2 implements their chunkwise recurrence inspired by my blog post on chunked scan: https://github.com/sustcsonglin/flash-linear-attention/blob/main/fla/ops/hgrn/chunk.py

flash-linear-attention/fla/ops/hgrn/chunk.py at main · sustcsonglin/flash-linear-attention

Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton - sustcsonglin/flash-linear-attention

🔥1

247 views16:44

Vol Building AGI

ICLR features many agent simulation work this year

https://webarena.dev/ — environment for agents solving tasks in the browser

https://weirdlabuw.github.io/asid/ — an exploration algorithm; explore task data in sim for *one shot* transfer to real

https://universal-simulator.github.io/unisim/ — sora for training RL agents (done 6 months before openai sora)

WebArena: A suite of benchmarks for building autonomous web agents.

69 viewsedited 08:38

Vol Building AGI

Before simulating a real environment you need to perform capture. Reality capture work recommended by Marius Memmel

https://real-to-sim-to-real.github.io/RialTo/

https://sites.google.com/view/urdformer/home

Zoey Chen, Marius Memmel, Alex Fang, Aaron Walsman, Dieter Fox* and Abhishek Gupta*
University of Washington, Nvidia
*equal advising
(TGR workshop Oral, CoRL 2023)
We added more experiments and updated our website with code and more visualizations:
urdformer.github.io

59 views10:09

Vol Building AGI

xLSTM
https://arxiv.org/abs/2405.04517

xLSTM: Extended Long Short-Term Memory

In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to...

56 views06:17

Vol Building AGI

xLSTM, part one: mLSTM: parallel memory with a covariance update rule. My implementation uses Accelerated Scan which is unnecessary due to a scalar forget gate per head. This architecture choice allows for massive speed improvements, about which I’ll talk later.

🔥1

57 views09:24