I construct a linear RNN that receives ones as inputs and produces increasingly better approximations to pi as outputs.
The width of the network grows logarithmically wrt the maximum sequence length.
https://gist.github.com/proger/ba147e3953a155d833aae084c1f0cd12
The width of the network grows logarithmically wrt the maximum sequence length.
https://gist.github.com/proger/ba147e3953a155d833aae084c1f0cd12
Gist
pi.py
GitHub Gist: instantly share code, notes, and snippets.
God gave us natural numbers. Humans invented real numbers for oppression (at ETH by the way)
😇1
Ready for flow matching? https://bm371613.github.io/conditional-flow-matching/
👀1
Very cool speech recognition work: FastConformer + SC-CTC ablations wrt sequence length. 6 layer models can properly use 21.8 minutes of context.
https://x.com/RobFlynnHere/status/1803004030576152944
https://x.com/RobFlynnHere/status/1803004030576152944
Noam Shazeer has shared techniques used to scale Transformer inference to 1/5 of Google search traffic
https://research.character.ai/optimizing-inference/
https://research.character.ai/optimizing-inference/
Character.AI Blog
Character.AI empowers people to connect, learn, and tell stories through interactive entertainment.
Will Merill's work helped me gain intuition about what classess of tasks are solvable by what kinds of sequence modeling architectures, notably that constant-depth Transformers are not universal (i.e. not even P). The ability to execute any algorithm in P can be achieved by appying chains of thought, scratchpads, pause tokens or infinite depth (see "Universal Transformer"). Chain of thought can be seen as a way to gain infinite depth from a finite network. Check out his recent video
https://www.youtube.com/watch?v=30MhUdapqc8
https://www.youtube.com/watch?v=30MhUdapqc8
YouTube
Will Merrill: The Expressive Power of Transformers with Chain of Though
Talk given by Will Merrill to the Formal Languages and Neural Networks discord on June 10, 2024. Thank you, Will!
Please find the link to their paper here:
https://arxiv.org/abs/2310.07923
Please find the link to their paper here:
https://arxiv.org/abs/2310.07923
👍1
How to do vector calculus with ChatGPT: ask it to put indices on vectors explicitly, then there is no confusion about what dimensions need to be used when outer products arise.
Example: https://chatgpt.com/share/588e5ca1-b286-4f74-b2ef-bdbb402e3c83
Example: https://chatgpt.com/share/588e5ca1-b286-4f74-b2ef-bdbb402e3c83
Chatgpt
A conversational AI system that listens, learns, and challenges
When making matplotlib figures, it's important to match the font to the rest of the paper: https://x.com/giffmana/status/1632506730897653761
👍1
This media is not supported in your browser
VIEW IN TELEGRAM
В мене була можливість поспілкуватися з дослідниками robotics + computer vision — їх фокус розробки завжди на обчислення в реальному часі. Мапи і моделі місцевості будуються в реальному часі. Може так є сенс думати і про тренування мультимодальних діалогових систем? Працювати з чатботами натренованими на випадкових текстах зібраних хто зна ким хто зна коли не настільки цікаво.
👍3
https://www.youtube.com/watch?v=1t7AWa4SMlo
I got excited and built a fast weight programmer loop into the script demo I shared before. It's using a read-forget-update loop to remember patterns that are coming from the microphone.
Check it out — it learns music and speech on the fly.
I got excited and built a fast weight programmer loop into the script demo I shared before. It's using a read-forget-update loop to remember patterns that are coming from the microphone.
Check it out — it learns music and speech on the fly.
🔥1
The format above is required for using the mma.m16n8k16 instruction in PTX 8.5: https://docs.nvidia.com/cuda/parallel-thread-execution/#warp-level-matrix-fragment-mma-16816-float
Turing complete programs in transformers by construction give length generalization out of the box
https://arxiv.org/abs/2407.03310
https://arxiv.org/abs/2407.03310
arXiv.org
Universal Length Generalization with Turing Programs
Length generalization refers to the ability to extrapolate from short training sequences to long test sequences and is a challenge for current large language models. While prior work has proposed...
Cool example of deep learning research going in circles: short conv1d has been reintroduced with QRNN, reintroduced to linear transformers in H3 and made mainstream in Mamba. This convolution makes the network learn faster in the beginning, improves recall and allows a single-layer linear RNN solve associative retrieval — I spent about a month figuring that property out.
This convolution block has made its way into Noam Shazeer's last transformer variant, and people are now writing papers making statements about its expressive power.
The conclusion authors make:
> An important parameterization to explore is replacing the short convolutions within CAT with SSMs
We've been there! Next thing you know you get Conv-SSM-Attention, and that is called Recurrent Gemma 😄
https://arxiv.org/abs/2407.05591
This convolution block has made its way into Noam Shazeer's last transformer variant, and people are now writing papers making statements about its expressive power.
The conclusion authors make:
> An important parameterization to explore is replacing the short convolutions within CAT with SSMs
We've been there! Next thing you know you get Conv-SSM-Attention, and that is called Recurrent Gemma 😄
https://arxiv.org/abs/2407.05591
arXiv.org
On the Power of Convolution Augmented Transformer
The transformer architecture has catalyzed revolutionary advances in language modeling. However, recent architectural recipes, such as state-space models, have bridged the performance gap....
👍2
Compiler Explorer supports CUDA C++. See how nvcc emits SASS for thread barriers: https://godbolt.org/z/613zseoGW
godbolt.org
Compiler Explorer - CUDA C++ (NVCC 12.4.1)
#include <cooperative_groups.h>
#include <cuda/barrier>
using barrier = cuda::barrier<cuda::thread_scope_block>;
__global__ void square(int* array, int n) {
__shared__ barrier bar;
auto block = cooperative_groups::this_thread_block();
if (…
#include <cuda/barrier>
using barrier = cuda::barrier<cuda::thread_scope_block>;
__global__ void square(int* array, int n) {
__shared__ barrier bar;
auto block = cooperative_groups::this_thread_block();
if (…
Hello from ICML. I am at the tutorial on Data Attribution at Scale. We are studying how to relate model outputs to training inputs. Here are the notes:
https://ml-data-tutorial.org/
https://ml-data-tutorial.org/
ml-data-tutorial.org
Data Attribution at Scale | ICML 2024
Notes accompanying our ICML tutorial
👍4
The next tutorial by Zeyuan Allen-Zhu is on Physics of Language Models. We study how to apply scientific method to the study of language models with examples. The examples include curation of synthetic data, mechanistic probing, and more.
Tutorial website: https://physics.allen-zhu.com/
Announcement: https://x.com/zeyuanallenzhu/status/1813150298363601102
Tutorial website: https://physics.allen-zhu.com/
Announcement: https://x.com/zeyuanallenzhu/status/1813150298363601102
Allen-Zhu
Physics of Language Models
The concept of Physics of Language Models was jointly conceived and designed by ZA and Xiaoli Xu.
🔥2