https://app.suno.ai/song/f44b528c-0bce-45fe-b5b6-bca579f97ed2
Hi Pedro, you tweeted:
This is outrageously wrong, Attention was invented at U. Montreal by Bahdanau, Cho and Bengio. Transformers were just an extension. This is the paper
that really invented modern AI.
THIS IS OUTRAGEOUSLY WRONG.
The first Transformer variant was published over 30 years ago.
It is now called "unnormalized linear Transformer".
THIS IS OUTRAGEOUSLY WRONG.
See my eye see em ell paper. Attention terminology was introduced in ninety three.
Back in the day, compute costs were high,
A million times more, oh my, oh my!
Transformers unnormalized, they'd fly,
Linear, efficient, no cry.
No quadratic woes, just linear stride,
Scaling with input, ain't no need to RAG.
Compute constraints, they pushed the tide,
In Transformer's journey, they'd confide.
So listen up, in the computational spree,
Efficiency's the key, it's plain to see.
Sequence length scalability, that's the decree,
In the Transformer world, we're wild and free!
THIS IS OUTRAGEOUSLY WRONG.
Vaswani did not cite this.
THIS IS OUTRAGEOUSLY WRONG.
Here is a well-known tweet on this
THIS IS OUTRAGEOUSLY WRONG.
Quadratic ChatGPT
THIS IS OUTRAGEOUSLY WRONG.
THIS IS OUTRAGEOUSLY WRONG.
MACHINE LEARNING IS THE SCIENCE OF CREDIT ASSIGNMENT
Hi Pedro, you tweeted:
This is outrageously wrong, Attention was invented at U. Montreal by Bahdanau, Cho and Bengio. Transformers were just an extension. This is the paper
that really invented modern AI.
THIS IS OUTRAGEOUSLY WRONG.
The first Transformer variant was published over 30 years ago.
It is now called "unnormalized linear Transformer".
THIS IS OUTRAGEOUSLY WRONG.
See my eye see em ell paper. Attention terminology was introduced in ninety three.
Back in the day, compute costs were high,
A million times more, oh my, oh my!
Transformers unnormalized, they'd fly,
Linear, efficient, no cry.
No quadratic woes, just linear stride,
Scaling with input, ain't no need to RAG.
Compute constraints, they pushed the tide,
In Transformer's journey, they'd confide.
So listen up, in the computational spree,
Efficiency's the key, it's plain to see.
Sequence length scalability, that's the decree,
In the Transformer world, we're wild and free!
THIS IS OUTRAGEOUSLY WRONG.
Vaswani did not cite this.
THIS IS OUTRAGEOUSLY WRONG.
Here is a well-known tweet on this
THIS IS OUTRAGEOUSLY WRONG.
Quadratic ChatGPT
THIS IS OUTRAGEOUSLY WRONG.
THIS IS OUTRAGEOUSLY WRONG.
MACHINE LEARNING IS THE SCIENCE OF CREDIT ASSIGNMENT
app.suno.ai
Credit Assignment | Suno
angry comedy hip hop song. Listen and make your own with Suno.
ETH has an Engineering-focused course on Category Theory, checking out its notes: https://applied-compositional-thinking.engineering/resources/
Parameter-shared state expansion is the next frontier in efficient RNNs. It's a bit cumbersome concept that has been explored in linear transformers but really became popular with Mamba.
Now, HGRN2 has introduced an elegant implementation of the most simple expansion rule — the outer product. An outer product is an operation that multiplies two vectors of shape Nx1 and 1xN making an NxN matrix — a quadratic number of pairwise multiplicative interactions.
Outer products work well with Accelerated Scan and are already supported in the
https://twitter.com/SonglinYang4/status/1778897457159758010
Now, HGRN2 has introduced an elegant implementation of the most simple expansion rule — the outer product. An outer product is an operation that multiplies two vectors of shape Nx1 and 1xN making an NxN matrix — a quadratic number of pairwise multiplicative interactions.
Outer products work well with Accelerated Scan and are already supported in the
mqar branch of Hippogriffhttps://twitter.com/SonglinYang4/status/1778897457159758010
X (formerly Twitter)
Songlin Yang (@SonglinYang4) on X
I really like HGRN2's concept of integrating forget gates with keys in (gated) linear attention. It's a neat and effective approach! We've incorporated HGRN2 into the Flash-Linear-Attention library. Check it out here: https://t.co/WPozxdRnTD.
Wrote a small summary of Transformers, RNNs and Fast Weight Programmers using the language of transformers.
https://twitter.com/darkproger/status/1779123166561890357
https://twitter.com/darkproger/status/1779123166561890357
X (formerly Twitter)
Volodymyr Kyrylov (@darkproger) on X
In a transformer there are as many keys as tokens, so softmax is used to choose one. In an RNN the number of keys is chosen up front.
In Hawk/RecurrentGemma the key dimension is 1, and the sigmoid gates can choose all keys, some or none.
A FWP has multidimensional…
In Hawk/RecurrentGemma the key dimension is 1, and the sigmoid gates can choose all keys, some or none.
A FWP has multidimensional…
The_Fisher_Darmois_Koopman_Pitman_theorem_for_random_processes.pdf
179.4 KB
Discovered this probability result by Fisher-Darmois-Koopman-Pitman (F-D-K-P theorem):
an i.i.d stream of data of arbitrary size can be summarized into a finite fixed number of sufficient statistics (= hidden state can be of fixed size and no information will be lost) iff the data is from an exponential family of distributions.
Normal, Bernoulli and MRFs are exponential, but uniform distribution is not. Weirdly enough, a lot of LM diagnostic tasks sample from uniform distributions.
The attached paper generalized the result to random processes summarizable by Kalman filters, so it translates to single-layer linear RNNs.
an i.i.d stream of data of arbitrary size can be summarized into a finite fixed number of sufficient statistics (= hidden state can be of fixed size and no information will be lost) iff the data is from an exponential family of distributions.
Normal, Bernoulli and MRFs are exponential, but uniform distribution is not. Weirdly enough, a lot of LM diagnostic tasks sample from uniform distributions.
The attached paper generalized the result to random processes summarizable by Kalman filters, so it translates to single-layer linear RNNs.
Meta is in the RNN game https://arxiv.org/abs/2404.08801
arXiv.org
Megalodon: Efficient LLM Pretraining and Inference with Unlimited...
The quadratic complexity and weak length extrapolation of Transformers limits their ability to scale to long sequences, and while sub-quadratic solutions like linear attention and state space...
https://arxiv.org/abs/2404.08819
A paper precisely demonstrating the difference between input-dependent forget gate vectors and input-dependent recurrence matrices — my intuition from Dyck-1 synthetic tasks suggested there is no difference between recurrence matrices and gated vectors, however there is one (I am super pissed I arrived at the wrong conclusion)!
A linear RNN with a non-diagonal recurrence matrix can simulate DFAs (solve word problems that involve state permutation) and stay linear — strictly better than a transformer!
A paper precisely demonstrating the difference between input-dependent forget gate vectors and input-dependent recurrence matrices — my intuition from Dyck-1 synthetic tasks suggested there is no difference between recurrence matrices and gated vectors, however there is one (I am super pissed I arrived at the wrong conclusion)!
A linear RNN with a non-diagonal recurrence matrix can simulate DFAs (solve word problems that involve state permutation) and stay linear — strictly better than a transformer!
arXiv.org
The Illusion of State in State-Space Models
State-space models (SSMs) have emerged as a potential alternative architecture for building large language models (LLMs) compared to the previously ubiquitous transformer architecture. One...
Modern pretraining uses attention masks that avoid crossing document boundaries. Llama 3 authors claim they’ve been using intra-document masking, Yi Tay (of Reka and UL2 fame — probably it’s in the Reka tech report?) says it’s "basic fundamental stuff", Amazon offers packing algorithms for batching that opt for a little bit of padding instead of a lot of truncation.
nanoGPT-style random sampling of file offsets with back-to-back documents that we used for uk4b is way too yolo for modern standards.
https://x.com/yitayml/status/1781090183703572500
nanoGPT-style random sampling of file offsets with back-to-back documents that we used for uk4b is way too yolo for modern standards.
https://x.com/yitayml/status/1781090183703572500
X (formerly Twitter)
Yi Tay (@YiTayML) on X
@PMinervini @yuzhaouoe Segmentation masks are like basic fundamental stuff that a code base should have though...
I was sure training data curators add data that explains token to character relationships into pretraining for their Transformers. (_apple -> _a _p _p _l _e). Transformers can only count tokens, not count things inside the token — that is completely opaque to the model! Looks like there's quite a bit of work still cut out for the rest of us.
Know your model's biases!
https://x.com/AnsongNi/status/1781179566993592828
Know your model's biases!
https://x.com/AnsongNi/status/1781179566993592828
In our current reality, we can't copy quantum states or clone their underlying particles. If you could clone, if you could implement nonlinear dynamics with linear operations.
Here's the simplest neural network that uses cloning and bilinearity to execute nonlinear behavior, and, by extension, be able to learn the XOR function.
Here's the simplest neural network that uses cloning and bilinearity to execute nonlinear behavior, and, by extension, be able to learn the XOR function.
class Clone(nn.Module):
def __init__(self):
super().__init__()
self.a = nn.Parameter(torch.rand(2, 2))
self.out = nn.Parameter(torch.rand(1, 2))
def forward(self, x):
xa = F.linear(x, self.a)
y = F.linear(xa * xa, self.out)
return y
HuggingFace has released a glimpse into datasets of the future: observation-action sequences from various MDPs like Atari games interspersed with text to use with behavioral cloning, where BC is RL speak for autoregressive language modeling: https://huggingface.co/blog/jat
The general theme is that you use whatever tools are at your disposal to generate execution traces: online RL for Atari games, decision trees for tabular data, word sequences for speech recognition given audio tokens, tool use tokens for when you need you know to perform web search given a query or run a calculator. Add a bit of language explanations and regular texts and then compress everything into a single transformer.
Since a transformer is a latent variable model, given a test-time trace prompt it performs latent model selection for you — it can decide if your task needs a simulation of xgboost or policy iteration or an interpolation of the two.
It's time to go beyond filtering Common Crawl.
The general theme is that you use whatever tools are at your disposal to generate execution traces: online RL for Atari games, decision trees for tabular data, word sequences for speech recognition given audio tokens, tool use tokens for when you need you know to perform web search given a query or run a calculator. Add a bit of language explanations and regular texts and then compress everything into a single transformer.
Since a transformer is a latent variable model, given a test-time trace prompt it performs latent model selection for you — it can decide if your task needs a simulation of xgboost or policy iteration or an interpolation of the two.
It's time to go beyond filtering Common Crawl.
huggingface.co
Jack of All Trades, Master of Some, a Multi-Purpose Transformer Agent
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
NLP kings of ETH have shown how transformers can exactly represent n-gram language models. This work is a great example of interpretability of transformers you can use by making representation sparse, which is easy to imagine (I have a hard time imagining word2vec-style distributed representations but can imagine one hot encoding), and hard attention, so there is less superposition effects to think about.
https://arxiv.org/abs/2404.14994
https://arxiv.org/abs/2404.14994
arXiv.org
Transformers Can Represent $n$-gram Language Models
Existing work has analyzed the representational capacity of the transformer architecture by means of formal models of computation. However, the focus so far has been on analyzing the architecture...
HGRN2 implements their chunkwise recurrence inspired by my blog post on chunked scan: https://github.com/sustcsonglin/flash-linear-attention/blob/main/fla/ops/hgrn/chunk.py
GitHub
flash-linear-attention/fla/ops/hgrn/chunk.py at main · sustcsonglin/flash-linear-attention
Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton - sustcsonglin/flash-linear-attention
🔥1
ICLR features many agent simulation work this year
https://webarena.dev/ — environment for agents solving tasks in the browser
https://weirdlabuw.github.io/asid/ — an exploration algorithm; explore task data in sim for *one shot* transfer to real
https://universal-simulator.github.io/unisim/ — sora for training RL agents (done 6 months before openai sora)
https://webarena.dev/ — environment for agents solving tasks in the browser
https://weirdlabuw.github.io/asid/ — an exploration algorithm; explore task data in sim for *one shot* transfer to real
https://universal-simulator.github.io/unisim/ — sora for training RL agents (done 6 months before openai sora)
webarena.dev
WebArena-x
WebArena: A suite of benchmarks for building autonomous web agents.
Before simulating a real environment you need to perform capture. Reality capture work recommended by Marius Memmel
https://real-to-sim-to-real.github.io/RialTo/
https://sites.google.com/view/urdformer/home
https://real-to-sim-to-real.github.io/RialTo/
https://sites.google.com/view/urdformer/home
Google
URDFormer
Zoey Chen, Marius Memmel, Alex Fang, Aaron Walsman, Dieter Fox* and Abhishek Gupta*
University of Washington, Nvidia
*equal advising
(TGR workshop Oral, CoRL 2023)
We added more experiments and updated our website with code and more visualizations:
urdformer.github.io
University of Washington, Nvidia
*equal advising
(TGR workshop Oral, CoRL 2023)
We added more experiments and updated our website with code and more visualizations:
urdformer.github.io