Vol Building AGI

Heatmap of additively smoothed log probabilities of character bigrams log p(target|source) in common voice uk 10.0. Every word is padded by spaces on each side.

що looks like a very popular word. ї is missing entirely! You can see vowels making up distinct columns and rows.

284 viewsedited 02:52

Vol Building AGI

Plotting columns of the DFT basis with matplotlib looks very vibrant out of the box. A small enough ratio of the window size to the sampling rate allows for stripes to reveal underlying higher frequency filters at the back.


def dft(size=512, rate=16000, low=50):
    k = np.linspace(low, rate / 2, size, endpoint=False)
    t = np.arange(size) / rate
    return np.exp(-2j * np.pi * k[:, None] * t)

160 views06:06

Vol Building AGI

Marcus Hutter has provided a recipe to build an agent that provably solves any problem. He wrote a new book about it: https://x.com/mhutter42/status/1871426793380688255?s=46&t=qNUYWfgTfF4u1RKfN0ir3A

X (formerly Twitter)

Marcus Hutter (@mhutter42) on X

Santa Arrived! The PDF (of a colorful Xmas version) of the "Introduction to Universal AI" book is now freely available online at https://t.co/r9COEBnf4S Wishing you all joyful reading, Merry Xmas & a :-) New Year.

🔥2

192 views17:51

Vol Building AGI

Why want high precision accumulation while doing low precision computations

👍1

142 views21:56

I installed ghostty so my terminal can render images and have fragment shaders applied on the whole window.

👍2🤯1

141 views01:13

Let's animate the process of extracting a 13 Mel-Frequency Cepstral Coefficients (MFCC) spectrogram from an MP3 file.

369 views00:00

Vol Building AGI

To update gaussian mixture models with 16384 components of 13d MFCC frames using expectation maximization, I need to initialize the mixtures. The simplest data-driven initializer for GMMs is taking cluster centroids.

I decided to compare three algorithms for clustering:

1. random sampling (usually decent init for other algorithms, no hyperparmeters)
2. minibatch Lloyd (performs EMA updates on mini batches, has one EMA weight hypeparameter)
3. Linde-Buzo-Gray (LBG) with minibatch Lloyd refinement — a classic algorithm for computing quantization codebooks. Its main trick is to start with a cluster of size 1, and progressively double the clustering size by perturbing the original set and refining it with k-means.

I have 11 million frames of from common voice uk 10.0, so Lloyd algorithm (classical k means) and SVD/QR are out — they require materializing matrices that are a bit too big for my macbook.

On the plot the x axis is number of steps, the Y axis is the quantization loss.

154 viewsedited 07:48

Vol Building AGI

I found that LBG is very compute efficient — it spends most of the time running k-means for small clusterings, so in terms of wall clock (ballpark 10x faster?) my cpu-only pure-numpy implementation with efficient L2 distance computation. It also seems to be more data efficient: I couldn't get better results with lloyd when I ran it for more steps.

I didn't bother tuning the EMA learning rate too much and settled at 0.9. Progressive scaling is all we need?

👍1

153 viewsedited 07:48

Vol Building AGI

0:14

This media is not supported in your browser

VIEW IN TELEGRAM

Alignment self-training can work even with a single utterance. In this video expectation maximization for a GMM acoustic model with a linear chain HMM successfully finds a plausible alignment in 30 steps.

My algorithm only updates GMM mixture coefficients in the maximization step. Expectation step uses my implementation of scaled forward-backward recursions.

1024 GMM means are pretrained using LBG, the HMM prior is a linear chain — similar to what you see on the right hand side in the plot.

175 viewsedited 08:21

Vol Building AGI

Making regularized updates to the transition matrix for 10 more steps makes it get the silence bounds much better. Without regularization that's gently mixing the prior and the posterior the update would completely derail the optimization.

Still using just one example.

155 viewsedited 20:16

Vol Building AGI

My holiday's ending so I'll share one last observation about HMMs and post an animation of LBG clustering.

A linear chain HMM has a transition matrix that allows every state to go to itself or to the next one in the list. It's a transition design pattern used in CTC, Transducer, Kaldi and some clustering models: you have a sequence of T characters, you want the decoder to go through all of them, You make an HMM that corresponds to the regex with a + sign after each character.

The transition matrix also requires transition probabilities, but what do you put there? 0.5 0.5? 0.9 0.1? I used to default to the latter however I had my alignment procedure spend most of its time in the final state failing to align the rest.

❤2

170 views03:03

Vol Building AGI

Turns out the transition probabilities directly affect state duration: the expected duration of each state is E[d] = 1/(1-p) where p is the transition probability.

This quantity corresponds to the mean of the geometric distribution. It arises when you run a sequence of coin flips with probability p and count the number of times until you get one side. On average it's 1/(1-p).

On the screenshot I start with a trained acoustic model but set the transition probabilities to 3/4 1/4 (printed at the top).

The decoder assigns one frame to each of the first two states and then spends all of its time in the final state (second ali plot on the bottom), even though the observation map looks reasonable (see right plot). The error goes away when I initialize state durations by solving for p letting d be the total duration of the sequence divided by the number of states.

170 views03:03

Animating the Linde-Buzo-Gray clustering algorithm. It starts with a single cluster describing all the data (i.e. just taking the mean), doubles it using a small perturbation and refines the clustering using the Lloyd (aka "k means") algorithm. I am plotting the reconstruction loss, the entropy (entropy should be close to log K) and the codebook utilization. At the end there are 5 dead entries in the codebook — this problem gets bigger with larger K and needs to be worked on.

Each picture is the codebook after doubling the number of clusters.

❤1🔥1

215 viewsedited 03:11

Vol Building AGI

💯5

215 views03:59

Vol Building AGI

Kazuki Irie’s new paper: the brain uses keys to access memories but cannot access keys themselves https://arxiv.org/abs/2501.02950

arXiv.org

Key-value memory in the brain

Classical models of memory in psychology and neuroscience rely on similarity-based retrieval of stored patterns, where similarity is a function of retrieval cues and the stored patterns. While...

❤1

211 views03:43