Vol Building AGI

To update gaussian mixture models with 16384 components of 13d MFCC frames using expectation maximization, I need to initialize the mixtures. The simplest data-driven initializer for GMMs is taking cluster centroids.

I decided to compare three algorithms for clustering:

1. random sampling (usually decent init for other algorithms, no hyperparmeters)
2. minibatch Lloyd (performs EMA updates on mini batches, has one EMA weight hypeparameter)
3. Linde-Buzo-Gray (LBG) with minibatch Lloyd refinement — a classic algorithm for computing quantization codebooks. Its main trick is to start with a cluster of size 1, and progressively double the clustering size by perturbing the original set and refining it with k-means.

I have 11 million frames of from common voice uk 10.0, so Lloyd algorithm (classical k means) and SVD/QR are out — they require materializing matrices that are a bit too big for my macbook.

On the plot the x axis is number of steps, the Y axis is the quantization loss.

154 viewsedited 07:48

Vol Building AGI

I found that LBG is very compute efficient — it spends most of the time running k-means for small clusterings, so in terms of wall clock (ballpark 10x faster?) my cpu-only pure-numpy implementation with efficient L2 distance computation. It also seems to be more data efficient: I couldn't get better results with lloyd when I ran it for more steps.

I didn't bother tuning the EMA learning rate too much and settled at 0.9. Progressive scaling is all we need?

👍1

153 viewsedited 07:48

Vol Building AGI

0:14

This media is not supported in your browser

VIEW IN TELEGRAM

Alignment self-training can work even with a single utterance. In this video expectation maximization for a GMM acoustic model with a linear chain HMM successfully finds a plausible alignment in 30 steps.

My algorithm only updates GMM mixture coefficients in the maximization step. Expectation step uses my implementation of scaled forward-backward recursions.

1024 GMM means are pretrained using LBG, the HMM prior is a linear chain — similar to what you see on the right hand side in the plot.

175 viewsedited 08:21

Vol Building AGI

Making regularized updates to the transition matrix for 10 more steps makes it get the silence bounds much better. Without regularization that's gently mixing the prior and the posterior the update would completely derail the optimization.

Still using just one example.

155 viewsedited 20:16

Vol Building AGI

My holiday's ending so I'll share one last observation about HMMs and post an animation of LBG clustering.

A linear chain HMM has a transition matrix that allows every state to go to itself or to the next one in the list. It's a transition design pattern used in CTC, Transducer, Kaldi and some clustering models: you have a sequence of T characters, you want the decoder to go through all of them, You make an HMM that corresponds to the regex with a + sign after each character.

The transition matrix also requires transition probabilities, but what do you put there? 0.5 0.5? 0.9 0.1? I used to default to the latter however I had my alignment procedure spend most of its time in the final state failing to align the rest.

❤2

170 views03:03

Vol Building AGI

Turns out the transition probabilities directly affect state duration: the expected duration of each state is E[d] = 1/(1-p) where p is the transition probability.

This quantity corresponds to the mean of the geometric distribution. It arises when you run a sequence of coin flips with probability p and count the number of times until you get one side. On average it's 1/(1-p).

On the screenshot I start with a trained acoustic model but set the transition probabilities to 3/4 1/4 (printed at the top).

The decoder assigns one frame to each of the first two states and then spends all of its time in the final state (second ali plot on the bottom), even though the observation map looks reasonable (see right plot). The error goes away when I initialize state durations by solving for p letting d be the total duration of the sequence divided by the number of states.

170 views03:03

Animating the Linde-Buzo-Gray clustering algorithm. It starts with a single cluster describing all the data (i.e. just taking the mean), doubles it using a small perturbation and refines the clustering using the Lloyd (aka "k means") algorithm. I am plotting the reconstruction loss, the entropy (entropy should be close to log K) and the codebook utilization. At the end there are 5 dead entries in the codebook — this problem gets bigger with larger K and needs to be worked on.

Each picture is the codebook after doubling the number of clusters.

❤1🔥1

215 viewsedited 03:11

Vol Building AGI

💯5

215 views03:59

Vol Building AGI

Kazuki Irie’s new paper: the brain uses keys to access memories but cannot access keys themselves https://arxiv.org/abs/2501.02950

arXiv.org

Key-value memory in the brain

Classical models of memory in psychology and neuroscience rely on similarity-based retrieval of stored patterns, where similarity is a function of retrieval cues and the stored patterns. While...

❤1

211 views03:43

Vol Building AGI

Pretty cool that in a progressive superposition of MLPs scaling activations for older MLPs just by a single large value is enough to restore access to the old task knowledge written in the "old" MLP.

SGD seems to be incentivized to find larger weights for the second MLP in the continual training scenario so its activations always "beat" the activations of the first network.

I am referring to this experiment: https://github.com/kazuki-irie/kv-memory-brain/blob/master/Forgetting_and_recovery.ipynb

GitHub

kv-memory-brain/Forgetting_and_recovery.ipynb at master · kazuki-irie/kv-memory-brain

Official Code Repository for the paper "Key-value memory in the brain" - kazuki-irie/kv-memory-brain

❤3

289 views05:40

Vol Building AGI

252 views04:12

Vol Building AGI

When training neural networks with SGD, the learning rate critically depends on the batch size. If you change one parameter you usually need to sweep the other. This a heat map that measures log(1-accuracy) on the test set as a function of width, batch size and learning rate.

🔥3

259 views22:12

Vol Building AGI

Decided to collect a few tokens from telegram for the UNLP shared task pretraining. Will 1B be enough? At 350M words now.

👍3❤1

325 views05:06

Vol Building AGI

baseline.png

532.6 KB

I ran a baseline for UNLP2025 shared task using gpt-4o-mini-2024-07-18 and gpt-4o-2024-08-06 with a basic prompt and structured outputs asking for reasoning and labels as binary flags. The macro f1 scores for technique detection are 0.32 and 0.34, mini model tends to trade precision for extra recall.

👍4

741 views05:44

Vol Building AGI

> Так, всі вже зареєструвалися на спільну задачу UNLP, а ти чого чекаєш?

425 views05:15

Vol Building AGI

> Зареєструйся на спільну задачу UNLP, інакше будеш лузером

😢1

430 views05:21

Vol Building AGI

> Так а чого ти бертом тегатимеш слова в реченні, це ж просто енкодер. Енкодери то обмежена архітектура, нею AGI не зробиш

461 views05:25

About

Blog

Apps

Platform