Modern pretraining uses attention masks that avoid crossing document boundaries. Llama 3 authors claim they’ve been using intra-document masking, Yi Tay (of Reka and UL2 fame — probably it’s in the Reka tech report?) says it’s "basic fundamental stuff", Amazon offers packing algorithms for batching that opt for a little bit of padding instead of a lot of truncation.
nanoGPT-style random sampling of file offsets with back-to-back documents that we used for uk4b is way too yolo for modern standards.
https://x.com/yitayml/status/1781090183703572500
nanoGPT-style random sampling of file offsets with back-to-back documents that we used for uk4b is way too yolo for modern standards.
https://x.com/yitayml/status/1781090183703572500
X (formerly Twitter)
Yi Tay (@YiTayML) on X
@PMinervini @yuzhaouoe Segmentation masks are like basic fundamental stuff that a code base should have though...
I was sure training data curators add data that explains token to character relationships into pretraining for their Transformers. (_apple -> _a _p _p _l _e). Transformers can only count tokens, not count things inside the token — that is completely opaque to the model! Looks like there's quite a bit of work still cut out for the rest of us.
Know your model's biases!
https://x.com/AnsongNi/status/1781179566993592828
Know your model's biases!
https://x.com/AnsongNi/status/1781179566993592828
In our current reality, we can't copy quantum states or clone their underlying particles. If you could clone, if you could implement nonlinear dynamics with linear operations.
Here's the simplest neural network that uses cloning and bilinearity to execute nonlinear behavior, and, by extension, be able to learn the XOR function.
Here's the simplest neural network that uses cloning and bilinearity to execute nonlinear behavior, and, by extension, be able to learn the XOR function.
class Clone(nn.Module):
def __init__(self):
super().__init__()
self.a = nn.Parameter(torch.rand(2, 2))
self.out = nn.Parameter(torch.rand(1, 2))
def forward(self, x):
xa = F.linear(x, self.a)
y = F.linear(xa * xa, self.out)
return y
HuggingFace has released a glimpse into datasets of the future: observation-action sequences from various MDPs like Atari games interspersed with text to use with behavioral cloning, where BC is RL speak for autoregressive language modeling: https://huggingface.co/blog/jat
The general theme is that you use whatever tools are at your disposal to generate execution traces: online RL for Atari games, decision trees for tabular data, word sequences for speech recognition given audio tokens, tool use tokens for when you need you know to perform web search given a query or run a calculator. Add a bit of language explanations and regular texts and then compress everything into a single transformer.
Since a transformer is a latent variable model, given a test-time trace prompt it performs latent model selection for you — it can decide if your task needs a simulation of xgboost or policy iteration or an interpolation of the two.
It's time to go beyond filtering Common Crawl.
The general theme is that you use whatever tools are at your disposal to generate execution traces: online RL for Atari games, decision trees for tabular data, word sequences for speech recognition given audio tokens, tool use tokens for when you need you know to perform web search given a query or run a calculator. Add a bit of language explanations and regular texts and then compress everything into a single transformer.
Since a transformer is a latent variable model, given a test-time trace prompt it performs latent model selection for you — it can decide if your task needs a simulation of xgboost or policy iteration or an interpolation of the two.
It's time to go beyond filtering Common Crawl.
huggingface.co
Jack of All Trades, Master of Some, a Multi-Purpose Transformer Agent
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
NLP kings of ETH have shown how transformers can exactly represent n-gram language models. This work is a great example of interpretability of transformers you can use by making representation sparse, which is easy to imagine (I have a hard time imagining word2vec-style distributed representations but can imagine one hot encoding), and hard attention, so there is less superposition effects to think about.
https://arxiv.org/abs/2404.14994
https://arxiv.org/abs/2404.14994
arXiv.org
Transformers Can Represent $n$-gram Language Models
Existing work has analyzed the representational capacity of the transformer architecture by means of formal models of computation. However, the focus so far has been on analyzing the architecture...
HGRN2 implements their chunkwise recurrence inspired by my blog post on chunked scan: https://github.com/sustcsonglin/flash-linear-attention/blob/main/fla/ops/hgrn/chunk.py
GitHub
flash-linear-attention/fla/ops/hgrn/chunk.py at main · sustcsonglin/flash-linear-attention
Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton - sustcsonglin/flash-linear-attention
🔥1
ICLR features many agent simulation work this year
https://webarena.dev/ — environment for agents solving tasks in the browser
https://weirdlabuw.github.io/asid/ — an exploration algorithm; explore task data in sim for *one shot* transfer to real
https://universal-simulator.github.io/unisim/ — sora for training RL agents (done 6 months before openai sora)
https://webarena.dev/ — environment for agents solving tasks in the browser
https://weirdlabuw.github.io/asid/ — an exploration algorithm; explore task data in sim for *one shot* transfer to real
https://universal-simulator.github.io/unisim/ — sora for training RL agents (done 6 months before openai sora)
webarena.dev
WebArena-x
WebArena: A suite of benchmarks for building autonomous web agents.
Before simulating a real environment you need to perform capture. Reality capture work recommended by Marius Memmel
https://real-to-sim-to-real.github.io/RialTo/
https://sites.google.com/view/urdformer/home
https://real-to-sim-to-real.github.io/RialTo/
https://sites.google.com/view/urdformer/home
Google
URDFormer
Zoey Chen, Marius Memmel, Alex Fang, Aaron Walsman, Dieter Fox* and Abhishek Gupta*
University of Washington, Nvidia
*equal advising
(TGR workshop Oral, CoRL 2023)
We added more experiments and updated our website with code and more visualizations:
urdformer.github.io
University of Washington, Nvidia
*equal advising
(TGR workshop Oral, CoRL 2023)
We added more experiments and updated our website with code and more visualizations:
urdformer.github.io
The definition of stabilze:
max_scan is based on a Blelloch scan with a (max, +) semiring
def stabilize(f_, i_):
"stabilize and activate forget and input gates"
m = max_scan(f_, i_)
m_prev = F.pad(m[:, :-1, :], (0,0,1,0))
i = (i_ - m).exp()
f = (f_ + m_prev - m).sigmoid()
return f, i
max_scan is based on a Blelloch scan with a (max, +) semiring
🤓1
https://x.com/haoailab/status/1788269848788869299?
Diffusion consistency has been applied together with Jacobi decoding to get 3x speed up over autoregressive loops. Consistency finetuning can be applied to existing autoregressive LMs to get efficient inference.
This might be more important than maintaining constant memory during autoregressive looping (working on RNNs).
Diffusion consistency has been applied together with Jacobi decoding to get 3x speed up over autoregressive loops. Consistency finetuning can be applied to existing autoregressive LMs to get efficient inference.
This might be more important than maintaining constant memory during autoregressive looping (working on RNNs).
X (formerly Twitter)
Hao AI Lab (@haoailab) on X
People often see LLMs as sequential decoders, but we show they can be easily adapted as fast parallel decoders!🔥🚀
Announcing consistency LLMs: teaching LLMs to predict the fixed point from any point on its Jacobi decoding trajectory
- LLM can fast forward…
Announcing consistency LLMs: teaching LLMs to predict the fixed point from any point on its Jacobi decoding trajectory
- LLM can fast forward…
https://github.com/shashankvkt/DoRA_ICLR24
Pretraining on object tracking on 10 long form videos beats ImageNet pretraining
Pretraining on object tracking on 10 long form videos beats ImageNet pretraining
GitHub
GitHub - shashankvkt/DoRA_ICLR24: This repo contains the official implementation of ICLR 2024 paper "Is ImageNet worth 1 video?…
This repo contains the official implementation of ICLR 2024 paper "Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video"" - shashankvkt/DoRA_ICLR24
🤯1
An insight from Shida Wang — nonlinearity of recurrence impacts universal approximation ability but not memory capacity. Exponential parametrization of the recurrence operator (like xLSTM) improves the optimization landscape.
https://github.com/radarFudan/Curse-of-memory
https://github.com/radarFudan/Curse-of-memory
GitHub
GitHub - radarFudan/Curse-of-memory: Curse-of-memory phenomenon of RNNs in sequence modelling
Curse-of-memory phenomenon of RNNs in sequence modelling - radarFudan/Curse-of-memory
Kyunghyun Cho is building an AI drug design system. He is also known for GRU and contributions to machine translation.
He is currently working on protein sequence design by generative modeling over a database of sequences, property classification of generated samples and black box optimization to find sample sets on the Pareto frontiers of multiple objectives induced by property classifiers (translation people: think of Minimum Bayes Risk) to send to the lab for validation. This loop describes the second and third pipeline steps on the photo. Eventually he wants to *backprop * through the whole loop.
Currently the forward pass takes more than 100 years, from discovering the role of pancreas in diabetes to approval of Semaglutide.
He is currently working on protein sequence design by generative modeling over a database of sequences, property classification of generated samples and black box optimization to find sample sets on the Pareto frontiers of multiple objectives induced by property classifiers (translation people: think of Minimum Bayes Risk) to send to the lab for validation. This loop describes the second and third pipeline steps on the photo. Eventually he wants to *backprop * through the whole loop.
Currently the forward pass takes more than 100 years, from discovering the role of pancreas in diabetes to approval of Semaglutide.
❤1🤯1
Research hint from Yann LeCun: figure out where transformer loss spikes come from. They don’t usually happen in convnets. My thought: convnets do not have input dependent weights unlike Transformers.
Also work on Q* with hierarchical time
Also work on Q* with hierarchical time
My favorite AI paper from ICLR is OMNI, which is also my favorite NeurIPS workshop paper by Jenny Zhang et al.
Jenny develops on an idea that Juergen calls PowerPlay; you give your AI agent tasks that are
1. Learnable, by measuring learning progress as a fraction of successes. Also can be used to track forgetting.
2. Interesting, by human notion of interestingness encoded into gpt-4
(1) has been known how to do before
(2) has been done with measures like novelty or artificial curiosity. Novel stuff is not always interesting! Which is where you need GPT to get true general agents — GPT has learned interestingness from Reddit experts and all scientific papers.
https://www.jennyzhangzt.com/omni/
Jenny develops on an idea that Juergen calls PowerPlay; you give your AI agent tasks that are
1. Learnable, by measuring learning progress as a fraction of successes. Also can be used to track forgetting.
2. Interesting, by human notion of interestingness encoded into gpt-4
(1) has been known how to do before
(2) has been done with measures like novelty or artificial curiosity. Novel stuff is not always interesting! Which is where you need GPT to get true general agents — GPT has learned interestingness from Reddit experts and all scientific papers.
https://www.jennyzhangzt.com/omni/
OMNI
👍1🤯1
Reinforcement learning followed by mechanistic interpretability on mice modulated by ketamine