Vol Building AGI
581 subscribers
116 photos
9 videos
12 files
199 links
Past topics: speech synthesis, transformers, LSTM, recurrence
Download Telegram
Why want high precision accumulation while doing low precision computations
👍1
Media is too big
VIEW IN TELEGRAM
I installed ghostty so my terminal can render images and have fragment shaders applied on the whole window.
👍2🤯1
Media is too big
VIEW IN TELEGRAM
Let's animate the process of extracting a 13 Mel-Frequency Cepstral Coefficients (MFCC) spectrogram from an MP3 file.
To update gaussian mixture models with 16384 components of 13d MFCC frames using expectation maximization, I need to initialize the mixtures. The simplest data-driven initializer for GMMs is taking cluster centroids.

I decided to compare three algorithms for clustering:

1. random sampling (usually decent init for other algorithms, no hyperparmeters)
2. minibatch Lloyd (performs EMA updates on mini batches, has one EMA weight hypeparameter)
3. Linde-Buzo-Gray (LBG) with minibatch Lloyd refinement — a classic algorithm for computing quantization codebooks. Its main trick is to start with a cluster of size 1, and progressively double the clustering size by perturbing the original set and refining it with k-means.

I have 11 million frames of from common voice uk 10.0, so Lloyd algorithm (classical k means) and SVD/QR are out — they require materializing matrices that are a bit too big for my macbook.

On the plot the x axis is number of steps, the Y axis is the quantization loss.
I found that LBG is very compute efficient — it spends most of the time running k-means for small clusterings, so in terms of wall clock (ballpark 10x faster?) my cpu-only pure-numpy implementation with efficient L2 distance computation. It also seems to be more data efficient: I couldn't get better results with lloyd when I ran it for more steps.

I didn't bother tuning the EMA learning rate too much and settled at 0.9. Progressive scaling is all we need?
👍1
This media is not supported in your browser
VIEW IN TELEGRAM
Alignment self-training can work even with a single utterance. In this video expectation maximization for a GMM acoustic model with a linear chain HMM successfully finds a plausible alignment in 30 steps.

My algorithm only updates GMM mixture coefficients in the maximization step. Expectation step uses my implementation of scaled forward-backward recursions.

1024 GMM means are pretrained using LBG, the HMM prior is a linear chain — similar to what you see on the right hand side in the plot.
Making regularized updates to the transition matrix for 10 more steps makes it get the silence bounds much better. Without regularization that's gently mixing the prior and the posterior the update would completely derail the optimization.

Still using just one example.
My holiday's ending so I'll share one last observation about HMMs and post an animation of LBG clustering.

A linear chain HMM has a transition matrix that allows every state to go to itself or to the next one in the list. It's a transition design pattern used in CTC, Transducer, Kaldi and some clustering models: you have a sequence of T characters, you want the decoder to go through all of them, You make an HMM that corresponds to the regex with a + sign after each character.

The transition matrix also requires transition probabilities, but what do you put there? 0.5 0.5? 0.9 0.1? I used to default to the latter however I had my alignment procedure spend most of its time in the final state failing to align the rest.
2
Turns out the transition probabilities directly affect state duration: the expected duration of each state is E[d] = 1/(1-p) where p is the transition probability.

This quantity corresponds to the mean of the geometric distribution. It arises when you run a sequence of coin flips with probability p and count the number of times until you get one side. On average it's 1/(1-p).

On the screenshot I start with a trained acoustic model but set the transition probabilities to 3/4 1/4 (printed at the top).

The decoder assigns one frame to each of the first two states and then spends all of its time in the final state (second ali plot on the bottom), even though the observation map looks reasonable (see right plot). The error goes away when I initialize state durations by solving for p letting d be the total duration of the sequence divided by the number of states.
Media is too big
VIEW IN TELEGRAM
Animating the Linde-Buzo-Gray clustering algorithm. It starts with a single cluster describing all the data (i.e. just taking the mean), doubles it using a small perturbation and refines the clustering using the Lloyd (aka "k means") algorithm. I am plotting the reconstruction loss, the entropy (entropy should be close to log K) and the codebook utilization. At the end there are 5 dead entries in the codebook — this problem gets bigger with larger K and needs to be worked on.

Each picture is the codebook after doubling the number of clusters.
1🔥1
💯5
Pretty cool that in a progressive superposition of MLPs scaling activations for older MLPs just by a single large value is enough to restore access to the old task knowledge written in the "old" MLP.

SGD seems to be incentivized to find larger weights for the second MLP in the continual training scenario so its activations always "beat" the activations of the first network.

I am referring to this experiment: https://github.com/kazuki-irie/kv-memory-brain/blob/master/Forgetting_and_recovery.ipynb
3
When training neural networks with SGD, the learning rate critically depends on the batch size. If you change one parameter you usually need to sweep the other. This a heat map that measures log(1-accuracy) on the test set as a function of width, batch size and learning rate.
🔥3
Decided to collect a few tokens from telegram for the UNLP shared task pretraining. Will 1B be enough? At 350M words now.
👍31
baseline.png
532.6 KB
I ran a baseline for UNLP2025 shared task using gpt-4o-mini-2024-07-18 and gpt-4o-2024-08-06 with a basic prompt and structured outputs asking for reasoning and labels as binary flags. The macro f1 scores for technique detection are 0.32 and 0.34, mini model tends to trade precision for extra recall.
👍4
> Так, всі вже зареєструвалися на спільну задачу UNLP, а ти чого чекаєш?
> Зареєструйся на спільну задачу UNLP, інакше будеш лузером
😢1
> Так а чого ти бертом тегатимеш слова в реченні, це ж просто енкодер. Енкодери то обмежена архітектура, нею AGI не зробиш