Vol Building AGI – Telegram

Vol Building AGI

581 subscribers

116 photos

9 videos

12 files

199 links

Past topics: speech synthesis, transformers, LSTM, recurrence

Download Telegram

About

Blog

Apps

Platform

Vol Building AGI

581 subscribers

Vol Building AGI

Modern pretraining uses attention masks that avoid crossing document boundaries. Llama 3 authors claim they’ve been using intra-document masking, Yi Tay (of Reka and UL2 fame — probably it’s in the Reka tech report?) says it’s "basic fundamental stuff", Amazon offers packing algorithms for batching that opt for a little bit of padding instead of a lot of truncation.

nanoGPT-style random sampling of file offsets with back-to-back documents that we used for uk4b is way too yolo for modern standards.

https://x.com/yitayml/status/1781090183703572500

X (formerly Twitter)

Yi Tay (@YiTayML) on X

@PMinervini @yuzhaouoe Segmentation masks are like basic fundamental stuff that a code base should have though...

77 views22:47

Vol Building AGI

I was sure training data curators add data that explains token to character relationships into pretraining for their Transformers. (_apple -> _a _p _p _l _e). Transformers can only count tokens, not count things inside the token — that is completely opaque to the model! Looks like there's quite a bit of work still cut out for the rest of us.

Know your model's biases!

https://x.com/AnsongNi/status/1781179566993592828

75 viewsedited 08:46

Vol Building AGI

In our current reality, we can't copy quantum states or clone their underlying particles. If you could clone, if you could implement nonlinear dynamics with linear operations.

Here's the simplest neural network that uses cloning and bilinearity to execute nonlinear behavior, and, by extension, be able to learn the XOR function.


class Clone(nn.Module):
  def __init__(self):
    super().__init__()
    self.a = nn.Parameter(torch.rand(2, 2))
    self.out = nn.Parameter(torch.rand(1, 2))

  def forward(self, x):
    xa = F.linear(x, self.a)
    y = F.linear(xa * xa, self.out)
    return y

74 viewsedited 13:53

Vol Building AGI

HuggingFace has released a glimpse into datasets of the future: observation-action sequences from various MDPs like Atari games interspersed with text to use with behavioral cloning, where BC is RL speak for autoregressive language modeling: https://huggingface.co/blog/jat

The general theme is that you use whatever tools are at your disposal to generate execution traces: online RL for Atari games, decision trees for tabular data, word sequences for speech recognition given audio tokens, tool use tokens for when you need you know to perform web search given a query or run a calculator. Add a bit of language explanations and regular texts and then compress everything into a single transformer.

Since a transformer is a latent variable model, given a test-time trace prompt it performs latent model selection for you — it can decide if your task needs a simulation of xgboost or policy iteration or an interpolation of the two.

It's time to go beyond filtering Common Crawl.

Jack of All Trades, Master of Some, a Multi-Purpose Transformer Agent

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

82 views09:26

Vol Building AGI

NLP kings of ETH have shown how transformers can exactly represent n-gram language models. This work is a great example of interpretability of transformers you can use by making representation sparse, which is easy to imagine (I have a hard time imagining word2vec-style distributed representations but can imagine one hot encoding), and hard attention, so there is less superposition effects to think about.

https://arxiv.org/abs/2404.14994

Transformers Can Represent $n$-gram Language Models

Existing work has analyzed the representational capacity of the transformer architecture by means of formal models of computation. However, the focus so far has been on analyzing the architecture...

83 views08:13

Vol Building AGI

Training speed fluctuation when you're sharing your GPU with multiple other runs

95 viewsedited 12:40

Vol Building AGI

HGRN2 implements their chunkwise recurrence inspired by my blog post on chunked scan: https://github.com/sustcsonglin/flash-linear-attention/blob/main/fla/ops/hgrn/chunk.py

flash-linear-attention/fla/ops/hgrn/chunk.py at main · sustcsonglin/flash-linear-attention

Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton - sustcsonglin/flash-linear-attention

🔥1

247 views16:44

Vol Building AGI

ICLR features many agent simulation work this year

https://webarena.dev/ — environment for agents solving tasks in the browser

https://weirdlabuw.github.io/asid/ — an exploration algorithm; explore task data in sim for *one shot* transfer to real

https://universal-simulator.github.io/unisim/ — sora for training RL agents (done 6 months before openai sora)

WebArena: A suite of benchmarks for building autonomous web agents.

69 viewsedited 08:38

Vol Building AGI

Before simulating a real environment you need to perform capture. Reality capture work recommended by Marius Memmel

https://real-to-sim-to-real.github.io/RialTo/

https://sites.google.com/view/urdformer/home

Zoey Chen, Marius Memmel, Alex Fang, Aaron Walsman, Dieter Fox* and Abhishek Gupta*
University of Washington, Nvidia
*equal advising
(TGR workshop Oral, CoRL 2023)
We added more experiments and updated our website with code and more visualizations:
urdformer.github.io

59 views10:09

Vol Building AGI

xLSTM
https://arxiv.org/abs/2405.04517

xLSTM: Extended Long Short-Term Memory

In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to...

56 views06:17

Vol Building AGI

xLSTM, part one: mLSTM: parallel memory with a covariance update rule. My implementation uses Accelerated Scan which is unnecessary due to a scalar forget gate per head. This architecture choice allows for massive speed improvements, about which I’ll talk later.

🔥1

57 views09:24

Vol Building AGI

The definition of stabilze:


def stabilize(f_, i_):
    "stabilize and activate forget and input gates"
    m = max_scan(f_, i_)
    m_prev = F.pad(m[:, :-1, :], (0,0,1,0))

    i = (i_ - m).exp()
    f = (f_ + m_prev - m).sigmoid()
    return f, i

max_scan is based on a Blelloch scan with a (max, +) semiring

🤓1

60 viewsedited 09:25

Vol Building AGI

https://x.com/haoailab/status/1788269848788869299?

Diffusion consistency has been applied together with Jacobi decoding to get 3x speed up over autoregressive loops. Consistency finetuning can be applied to existing autoregressive LMs to get efficient inference.

This might be more important than maintaining constant memory during autoregressive looping (working on RNNs).

X (formerly Twitter)

Hao AI Lab (@haoailab) on X

People often see LLMs as sequential decoders, but we show they can be easily adapted as fast parallel decoders!🔥🚀

Announcing consistency LLMs: teaching LLMs to predict the fixed point from any point on its Jacobi decoding trajectory
- LLM can fast forward…

65 viewsedited 06:37

Vol Building AGI

https://github.com/shashankvkt/DoRA_ICLR24

Pretraining on object tracking on 10 long form videos beats ImageNet pretraining

GitHub - shashankvkt/DoRA_ICLR24: This repo contains the official implementation of ICLR 2024 paper "Is ImageNet worth 1 video?…

This repo contains the official implementation of ICLR 2024 paper "Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video"" - shashankvkt/DoRA_ICLR24

🤯1

66 views08:34

Vol Building AGI

An insight from Shida Wang — nonlinearity of recurrence impacts universal approximation ability but not memory capacity. Exponential parametrization of the recurrence operator (like xLSTM) improves the optimization landscape.

https://github.com/radarFudan/Curse-of-memory

GitHub - radarFudan/Curse-of-memory: Curse-of-memory phenomenon of RNNs in sequence modelling

Curse-of-memory phenomenon of RNNs in sequence modelling - radarFudan/Curse-of-memory

72 viewsedited 09:45

Vol Building AGI

New generative model in town: learning to do nothing

https://assafshocher.github.io/IGN/

67 views10:29

Vol Building AGI

Kyunghyun Cho is building an AI drug design system. He is also known for GRU and contributions to machine translation.

He is currently working on protein sequence design by generative modeling over a database of sequences, property classification of generated samples and black box optimization to find sample sets on the Pareto frontiers of multiple objectives induced by property classifiers (translation people: think of Minimum Bayes Risk) to send to the lab for validation. This loop describes the second and third pipeline steps on the photo. Eventually he wants to *backprop * through the whole loop.

Currently the forward pass takes more than 100 years, from discovering the role of pancreas in diabetes to approval of Semaglutide.

❤1🤯1

69 views13:13

Vol Building AGI

Research hint from Yann LeCun: figure out where transformer loss spikes come from. They don’t usually happen in convnets. My thought: convnets do not have input dependent weights unlike Transformers.

Also work on Q* with hierarchical time

69 viewsedited 15:42

Vol Building AGI

My favorite AI paper from ICLR is OMNI, which is also my favorite NeurIPS workshop paper by Jenny Zhang et al.

Jenny develops on an idea that Juergen calls PowerPlay; you give your AI agent tasks that are
1. Learnable, by measuring learning progress as a fraction of successes. Also can be used to track forgetting.
2. Interesting, by human notion of interestingness encoded into gpt-4

(1) has been known how to do before
(2) has been done with measures like novelty or artificial curiosity. Novel stuff is not always interesting! Which is where you need GPT to get true general agents — GPT has learned interestingness from Reddit experts and all scientific papers.

https://www.jennyzhangzt.com/omni/

👍1🤯1

80 viewsedited 16:10

Vol Building AGI

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9890294/pdf/CN-20-2267.pdf

75 views16:56

Vol Building AGI

Reinforcement learning followed by mechanistic interpretability on mice modulated by ketamine

78 views16:56