Mechinterp on chain of thought circuits https://arxiv.org/abs/2406.02128
arXiv.org
Iteration Head: A Mechanistic Study of Chain-of-Thought
Chain-of-Thought (CoT) reasoning is known to improve Large Language Models both empirically and in terms of theoretical approximation power. However, our understanding of the inner workings and...
Debug neural networks by casting them as neural fields https://github.com/neale/neural-canvas
GitHub
GitHub - neale/neural-canvas: creative deep learning with implicit neural representations
creative deep learning with implicit neural representations - neale/neural-canvas
ARC-AGI has been solved. Apply for safety testing of o3: https://openai.com/12-days/
OpenAI
12 Days of OpenAI
Plotting columns of the DFT basis with matplotlib looks very vibrant out of the box. A small enough ratio of the window size to the sampling rate allows for stripes to reveal underlying higher frequency filters at the back.
def dft(size=512, rate=16000, low=50):
k = np.linspace(low, rate / 2, size, endpoint=False)
t = np.arange(size) / rate
return np.exp(-2j * np.pi * k[:, None] * t)
Marcus Hutter has provided a recipe to build an agent that provably solves any problem. He wrote a new book about it: https://x.com/mhutter42/status/1871426793380688255?s=46&t=qNUYWfgTfF4u1RKfN0ir3A
X (formerly Twitter)
Marcus Hutter (@mhutter42) on X
Santa Arrived! The PDF (of a colorful Xmas version) of the "Introduction to Universal AI" book is now freely available online at https://t.co/r9COEBnf4S Wishing you all joyful reading, Merry Xmas & a :-) New Year.
🔥2
Media is too big
VIEW IN TELEGRAM
I installed ghostty so my terminal can render images and have fragment shaders applied on the whole window.
👍2🤯1
Media is too big
VIEW IN TELEGRAM
Let's animate the process of extracting a 13 Mel-Frequency Cepstral Coefficients (MFCC) spectrogram from an MP3 file.
To update gaussian mixture models with 16384 components of 13d MFCC frames using expectation maximization, I need to initialize the mixtures. The simplest data-driven initializer for GMMs is taking cluster centroids.
I decided to compare three algorithms for clustering:
1. random sampling (usually decent init for other algorithms, no hyperparmeters)
2. minibatch Lloyd (performs EMA updates on mini batches, has one EMA weight hypeparameter)
3. Linde-Buzo-Gray (LBG) with minibatch Lloyd refinement — a classic algorithm for computing quantization codebooks. Its main trick is to start with a cluster of size 1, and progressively double the clustering size by perturbing the original set and refining it with k-means.
I have 11 million frames of from common voice uk 10.0, so Lloyd algorithm (classical k means) and SVD/QR are out — they require materializing matrices that are a bit too big for my macbook.
On the plot the x axis is number of steps, the Y axis is the quantization loss.
I decided to compare three algorithms for clustering:
1. random sampling (usually decent init for other algorithms, no hyperparmeters)
2. minibatch Lloyd (performs EMA updates on mini batches, has one EMA weight hypeparameter)
3. Linde-Buzo-Gray (LBG) with minibatch Lloyd refinement — a classic algorithm for computing quantization codebooks. Its main trick is to start with a cluster of size 1, and progressively double the clustering size by perturbing the original set and refining it with k-means.
I have 11 million frames of from common voice uk 10.0, so Lloyd algorithm (classical k means) and SVD/QR are out — they require materializing matrices that are a bit too big for my macbook.
On the plot the x axis is number of steps, the Y axis is the quantization loss.
I found that LBG is very compute efficient — it spends most of the time running k-means for small clusterings, so in terms of wall clock (ballpark 10x faster?) my cpu-only pure-numpy implementation with efficient L2 distance computation. It also seems to be more data efficient: I couldn't get better results with lloyd when I ran it for more steps.
I didn't bother tuning the EMA learning rate too much and settled at 0.9. Progressive scaling is all we need?
I didn't bother tuning the EMA learning rate too much and settled at 0.9. Progressive scaling is all we need?
👍1
This media is not supported in your browser
VIEW IN TELEGRAM
Alignment self-training can work even with a single utterance. In this video expectation maximization for a GMM acoustic model with a linear chain HMM successfully finds a plausible alignment in 30 steps.
My algorithm only updates GMM mixture coefficients in the maximization step. Expectation step uses my implementation of scaled forward-backward recursions.
1024 GMM means are pretrained using LBG, the HMM prior is a linear chain — similar to what you see on the right hand side in the plot.
My algorithm only updates GMM mixture coefficients in the maximization step. Expectation step uses my implementation of scaled forward-backward recursions.
1024 GMM means are pretrained using LBG, the HMM prior is a linear chain — similar to what you see on the right hand side in the plot.
My holiday's ending so I'll share one last observation about HMMs and post an animation of LBG clustering.
A linear chain HMM has a transition matrix that allows every state to go to itself or to the next one in the list. It's a transition design pattern used in CTC, Transducer, Kaldi and some clustering models: you have a sequence of T characters, you want the decoder to go through all of them, You make an HMM that corresponds to the regex with a + sign after each character.
The transition matrix also requires transition probabilities, but what do you put there? 0.5 0.5? 0.9 0.1? I used to default to the latter however I had my alignment procedure spend most of its time in the final state failing to align the rest.
A linear chain HMM has a transition matrix that allows every state to go to itself or to the next one in the list. It's a transition design pattern used in CTC, Transducer, Kaldi and some clustering models: you have a sequence of T characters, you want the decoder to go through all of them, You make an HMM that corresponds to the regex with a + sign after each character.
The transition matrix also requires transition probabilities, but what do you put there? 0.5 0.5? 0.9 0.1? I used to default to the latter however I had my alignment procedure spend most of its time in the final state failing to align the rest.
❤2
Turns out the transition probabilities directly affect state duration: the expected duration of each state is E[d] = 1/(1-p) where p is the transition probability.
This quantity corresponds to the mean of the geometric distribution. It arises when you run a sequence of coin flips with probability p and count the number of times until you get one side. On average it's 1/(1-p).
On the screenshot I start with a trained acoustic model but set the transition probabilities to 3/4 1/4 (printed at the top).
The decoder assigns one frame to each of the first two states and then spends all of its time in the final state (second ali plot on the bottom), even though the observation map looks reasonable (see right plot). The error goes away when I initialize state durations by solving for p letting d be the total duration of the sequence divided by the number of states.
This quantity corresponds to the mean of the geometric distribution. It arises when you run a sequence of coin flips with probability p and count the number of times until you get one side. On average it's 1/(1-p).
On the screenshot I start with a trained acoustic model but set the transition probabilities to 3/4 1/4 (printed at the top).
The decoder assigns one frame to each of the first two states and then spends all of its time in the final state (second ali plot on the bottom), even though the observation map looks reasonable (see right plot). The error goes away when I initialize state durations by solving for p letting d be the total duration of the sequence divided by the number of states.
Media is too big
VIEW IN TELEGRAM
Animating the Linde-Buzo-Gray clustering algorithm. It starts with a single cluster describing all the data (i.e. just taking the mean), doubles it using a small perturbation and refines the clustering using the Lloyd (aka "k means") algorithm. I am plotting the reconstruction loss, the entropy (entropy should be close to log K) and the codebook utilization. At the end there are 5 dead entries in the codebook — this problem gets bigger with larger K and needs to be worked on.
Each picture is the codebook after doubling the number of clusters.
Each picture is the codebook after doubling the number of clusters.
❤1🔥1
Kazuki Irie’s new paper: the brain uses keys to access memories but cannot access keys themselves https://arxiv.org/abs/2501.02950
arXiv.org
Key-value memory in the brain
Classical models of memory in psychology and neuroscience rely on similarity-based retrieval of stored patterns, where similarity is a function of retrieval cues and the stored patterns. While...
❤1