Tokenless - The Best of AI, ML & CS Talks

Professor Tom Griffiths of Princeton University explores the mathematical principles that form the foundation of both human and artificial intelligence, bridging the gap between two contrasting views of the human mind. While psychologists often highlight human irrationality and biases, computer scientists see human cognition as an inspiration for AI. Griffiths' work seeks to reconcile these perspectives by framing human intelligence as a rational adaptation to significant constraints.

The Laws of Thought: A Mathematical Theory of Mind

The core idea of Griffiths' book, The Laws of Thought, is that just as mathematical laws of nature describe the external, physical world, a complementary set of mathematical principles can describe our internal, mental world.
⦁ From Behaviorism to Cognitive Science: Early psychology struggled to scientifically study internal thoughts, leading to the rise of behaviorism, which focused only on observable behaviors. The cognitive revolution was made possible by the development of computers and mathematical concepts like logic and probability, which provided a new, rigorous language to form and test hypotheses about the mind.
⦁ Research Methodology: Modern cognitive science research often involves large-scale online experiments. In Griffiths' lab, participants are presented with problems that require them to make inferences or decisions from data. By analyzing the responses from thousands of participants using modern machine learning tools like neural networks, researchers can develop and refine computational models of human cognition.

Human vs. AI: A Tale of Two Intelligences

A key distinction between human and artificial intelligence lies in the constraints they operate under. Humans are limited by time (a finite lifespan), computation (a few pounds of neural tissue), and communication bandwidth. In contrast, AI systems can be scaled with more data and compute and can transfer information perfectly. This leads to fundamentally different problem-solving approaches.

⦁ Inductive Bias and The Data Gap: A human child learns a language in about five years, whereas an LLM requires the equivalent of thousands of years of text data. This vast difference highlights the powerful inductive biases, or priors, built into human cognition. These biases provide a starting framework that makes learning from sparse data possible.
⦁ The Machine Learning Paradigm: Since the success of AlexNet in 2012, the dominant paradigm in machine learning has been one of weak inductive biases and massive datasets. The philosophy is that with enough data, a sufficiently complex model can learn the necessary features and solutions without human-engineered priors. This is the opposite of the human approach.
⦁ Engineering Inductive Bias: To create more human-like AI, we may need to engineer these biases. Meta-learning is one such technique, where a model learns an optimal set of initial weights by being trained on a wide variety of tasks. This provides a "soft bias" that guides the model toward effective solutions without rigidly constraining it, making it better at few-shot learning.

Deconstructing Large Language Models

Griffiths' research provides a scientific lens for understanding the behavior of LLMs.
⦁ Deductive vs. Inductive Problems: Early symbolic AI excelled at deductive problems (e.g., logic, chess), where all necessary information is provided. However, it struggled with inductive problems—the cornerstone of human intelligence—where conclusions must be drawn from incomplete information. Probability theory, particularly Bayes' rule, provides the mathematical framework for induction.
⦁ "Embers of Autoregression": LLMs are trained to predict the next token in a sequence, which makes them highly sensitive to the statistical patterns in their training data. This can lead to counter-intuitive behavior. For example, an LLM might be mo...

Full story

tokenless.tech

The Laws of Thought: The Math of Minds and Machines, with Prof. Tom Griffiths | Tokenless

Princeton Professor Tom Griffiths discusses his book "The Laws of Thought," exploring the mathematical models that govern both biological and artificial intelligence. He details the fundamental differences between human and machine cognition, rooted in their…

2 views16:12

Tokenless - The Best of AI, ML & CS Talks

How A Team Of 7 Keeps Breaking AI Benchmark Records

Poetiq, a startup by former DeepMind researchers, has developed a recursive self-improvement meta-system that builds "reasoning harnesses" on top of existing LLMs. This approach avoids the costly "fine-tuning trap" and has achieved state-of-the-art results on benchmarks like ARC-AGI and Humanity's Last Exam by automatically optimizing prompts and discovering novel reasoning strategies.

1 view16:47

Tokenless - The Best of AI, ML & CS Talks

Poetiq is building a recursively self-improving system that acts as a "reasoning harness" for large language models (LLMs). The core insight is a method for recursive self-improvement—where an AI makes itself smarter—that is significantly faster and cheaper than traditional approaches, which typically require retraining a new LLM from scratch at a cost of hundreds of millions of dollars.

The Challenge: The "Fine-Tuning Trap"

Many companies building on LLMs face a significant challenge: the "fine-tuning trap." The conventional approach involves:
1. Collecting a large dataset (tens of thousands of examples).
2. Spending a great deal on compute to fine-tune a frontier or open-weights model.
3. Achieving improved performance on a specific task.

However, this process is vulnerable to the "bitter lesson." Soon after, a new, more powerful base model is released (e.g., GPT-4o) that outperforms the specialized, fine-tuned model out of the box. The company is then faced with the choice of repeating the expensive fine-tuning process or going out of business.

Poetiq's Solution: "Stilts" for LLMs

Poetiq offers a different paradigm. Instead of fine-tuning, it provides an agentic system, or "harness," that sits on top of one or more LLMs. This harness enhances the base model's capabilities, acting like "stilts" to make it perform better on specific, hard problems.

Key advantages of this approach include:
⦁ Model Agnostic: When a new frontier model is released, the same harness remains compatible and provides an immediate performance boost.
⦁ Cost-Effective: The optimization process is vastly cheaper than fine-tuning.
⦁ Continuous Improvement: The harness can be further optimized for the new model to achieve even greater performance, ensuring the system always outperforms the underlying base models.

How the Meta-System Works

Poetiq's core technology is a recursively self-improving meta-system. The output of this meta-system is not just a solution, but systems that solve hard problems. This automated optimization process can generate a complete reasoning harness—comprising code, prompts, and data—from scratch.

Furthermore, if a startup has already built its own agent, Poetiq can ingest that system and optimize its components, such as prompts or reasoning strategies. The meta-system analyzes the problem data to discover failure modes and identify robust reasoning paths, effectively outsourcing the deep, manual data analysis to the AI itself.

Automating Prompt and Reasoning Engineering

The system moves beyond simple prompt optimization. While automated prompt tuning (like the popular DSPy framework) provides some gains, the most significant improvements come from discovering novel reasoning strategies that are implemented in code.

In one example from a previous project at DeepMind, manual prompt optimization on a very hard task took performance to 5%. However, by adding optimized reasoning strategies, performance jumped from 5% to 95%. The Poetiq meta-system automates this discovery process, often generating non-intuitive prompts and strategies that a human might not devise.

Demonstrated Success on Key Benchmarks

Poetiq has validated its approach by achieving top rankings on difficult AI benchmarks:

⦁ ARC-AGI: Shortly after a new model set the state-of-the-art at 45%, Poetiq's system achieved 54% accuracy. Notably, it did so by building on a cheaper base model, costing only $32 per problem compared to the previous SOTA's $70+.
⦁ Humanity's Last Exam: On this set of 2,500 expert-level questions, Poetiq achieved a score of 55%, surpassing the previous record of 53.1% set by Anthropic's Claude Opus. The entire optimization run for this achievement cost less than $100,000, a fraction of the cost of training a foundation model.

These results demonstrate the system's ability to enhance both complex reasoning (ARC...

Full story

tokenless.tech

How A Team Of 7 Keeps Breaking AI Benchmark Records | Tokenless

Poetiq, a startup by former DeepMind researchers, has developed a recursive self-improvement meta-system that builds "reasoning harnesses" on top of existing LLMs. This approach avoids the costly "fine-tuning trap" and has achieved state-of-the-art results…

2 views16:47

Tokenless - The Best of AI, ML & CS Talks

How AI is changing Software Engineering: A Conversation with Gergely Orosz, @pragmaticengineer

Gergely Orosz, author of The Pragmatic Engineer, discusses the bizarre trend of 'token maxing' in Big Tech, the evolving role of software engineers in the AI era, and why companies are heavily investing in internal AI infrastructure despite uncertain productivity gains.

2 views14:35

Tokenless - The Best of AI, ML & CS Talks

The Rise of "Token Maxing" in Big Tech

A strange cultural phenomenon known as "token maxing" has emerged within large tech companies like Meta, Microsoft, and Salesforce. It stems from these companies measuring developers' usage of internal AI tools, often through public leaderboards or spend tracking. At Salesforce, for example, there's a tool to see how many dollars colleagues have spent on AI tokens, with some teams having a target minimum spend of around $175 per month.

This measurement, coupled with industry-wide job insecurity, has led engineers to feel pressured to increase their token count to avoid being perceived as low performers. This pressure results in counterproductive behaviors:
⦁ Artificial Usage: Engineers run autonomous agents to "build junk" or ask AI assistants to summarize documentation (even if the AI does a poor job) simply to generate tokens and stay out of the bottom percentile.
⦁ Weaponized Metrics: While token count is just one of many data points in performance reviews, it can be "weaponized." A low performer with a low token count is seen as "not even trying," while a high performer with a high token count is seen as an innovator.

This trend is reminiscent of earlier flawed metrics like "lines of code," but it's now being driven by top tech companies. It's a response to an initial push from leadership who, seeing the success of AI-native companies like Anthropic, wanted to force adoption among skeptical, experienced engineers. The situation at Coinbase was an extreme example, where the CEO mandated AI tool usage under the threat of termination.

Is AI Actually Making Engineers More Productive?

While individual productivity is certainly increasing, team-level productivity is more of a question mark. It's proven difficult to retrofit AI into established workflows. An interesting perspective is that the most significant productivity gain may not be for engineers themselves, but for their non-technical collaborators. By enabling them with coding agents, they no longer have to wait for an engineer, effectively creating "serverless developers" and unlocking organizational productivity.

Mastering these AI tools requires a new mindset:
⦁ Continuous Learning: There is no manual for AI tools. It takes a long time to get good, and workflows are constantly changing.
⦁ Practice Over Theory: Unlike traditional computer science, understanding the underlying theory of models doesn't necessarily make you a better user. Hands-on experience is key.
⦁ Open-Mindedness: Success requires a low-ego approach, being "open to leaving your priors behind" and experimenting.

The Changing Role of the Software Engineer

AI is accelerating a trend that was already underway: the expansion of the software engineer's role. The role has already absorbed responsibilities from dedicated tester and DevOps teams. Now, it's beginning to incorporate product skills, giving rise to the "product engineer."

Expectations for seniority and business awareness are increasing even for early-career engineers. As a result, teams are shrinking. A VP of Engineering at John Deere noted their "two-pizza teams" are becoming "one-pizza teams," a direct result of these new tools and expanded roles.

The idea that engineers are becoming "engineering managers for AI agents" is a flawed analogy. The role is more akin to a Tech Lead or someone operating a "mech suit" (an analogy from DHH). You orchestrate tasks and can do more, faster, without the people-related challenges of management—the drama, conflicts, and slow feedback loops. With agents, the feedback loop is immediate.

Big Tech's Massive Investment in Internal AI Infrastructure

Many large tech companies like Uber, Airbnb, and Meta are investing heavily in building bespoke internal AI infrastructure, even if it hasn't yet translated to a dramatic increase in external product features. They...

Full story

tokenless.tech

How AI is changing Software Engineering: A Conversation with Gergely Orosz, @pragmaticengineer | Tokenless

Gergely Orosz, author of The Pragmatic Engineer, discusses the bizarre trend of 'token maxing' in Big Tech, the evolving role of software engineers in the AI era, and why companies are heavily investing in internal AI infrastructure despite uncertain productivity…

3 views14:35

Tokenless - The Best of AI, ML & CS Talks

Building Generative Image & Video models at Scale - Sander Dieleman (Veo and Nano Banana)

Sander Dieleman from Google DeepMind provides a behind-the-scenes look at the key components of training large-scale diffusion models for audio-visual data. The talk covers the entire pipeline, from the critical role of data curation and latent representations to the mechanics of diffusion, network architectures, sampling with guidance, and advanced control signals.

2 views14:38

Tokenless - The Best of AI, ML & CS Talks

Diffusion models have become the dominant paradigm for generating high-quality audio-visual data, differing from the auto-regressive models that are prevalent in language modeling. This is a comprehensive overview of the entire process, from data to sampling.

Data Curation and Representation

A critical, yet often underrated, aspect of training large-scale models is meticulous data curation. In contrast to academic research which incentivizes using standard benchmarks, real-world success depends on actively curating and cleaning the training data. Time spent improving the dataset is often a better investment than tweaking model architecture or optimizers.

Training on raw pixel data is infeasible for high-resolution or long-duration content due to immense memory requirements. The solution is to work in a compressed latent space.
⦁ Autoencoder-based Compression: A custom autoencoder is trained to compress data into a compact latent representation and then decode it back to pixel space. The diffusion model is then trained exclusively on these latents.
⦁ Preserving Structure: Unlike standard codecs (e.g., JPEG), these learned latents preserve the spatial grid structure of the original data, albeit at a much lower resolution (e.g., a 256x256 image might become a 32x32 latent grid). This is crucial because the neural network architectures have strong inductive biases that rely on this grid structure.
⦁ Efficiency: This approach can reduce data size by up to two orders of magnitude, making it possible to fit training examples in memory. The latents abstract away fine-grained local textures while preserving the core semantic content of the image or video.

The Diffusion Modeling Mechanism

Diffusion is an iterative refinement process. It works by first defining a forward (corruption) process, where Gaussian noise is gradually added to an image until all information is destroyed. The model then learns a reverse (denoising) process to remove that noise.

Generation starts with pure noise and iteratively refines it:
1. The denoiser model is given a noisy input XT and predicts the original, clean image X0.
2. Because noise removes information, this is an ill-posed problem. The model's prediction is effectively the average of all possible source images, resulting in a blurry output. This blurry prediction provides a direction for the next step.
3. A small step is taken in this predicted direction.
4. (Optional) A small amount of new noise is added back. This makes the process stochastic and more robust to the accumulation of the model's own errors.
5. This process is repeated. As noise is removed, the model has more information, its predictions become sharper, and the sample quality improves until a clean image is generated.

From a frequency perspective, this process can be viewed as spectral auto-regression. The noise corruption process obscures high-frequency details first, followed by lower-frequency global structures. The reverse denoising process, therefore, generates the image from coarse to fine, starting with low frequencies (global layout) and progressively adding high-frequency details.

Network Architecture and Training

⦁ Backbones: While early models like Stable Diffusion used U-Net architectures, the field has largely shifted to Transformers. This allows leveraging the extensive knowledge and tooling developed for scaling Large Language Models (LLMs), using adaptations like bidirectional attention instead of causal masking.
⦁ Video Models: There are two main approaches for video:
1. Joint Diffusion: Treat the entire 3D video volume (space and time) as a single entity to be noised and denoised.
2. Hybrid Approach: Use auto-regression in the time dimension (generating frame-by-frame) while using diffusion to generate each individual frame. This is useful for applications like...

Full story

3 views14:38

Tokenless - The Best of AI, ML & CS Talks

Open Models at Google DeepMind — Cassidy Hardin, Google DeepMind

Cassidy Hardin from Google DeepMind introduces Gemma 4, a new family of open-weight models with significant architectural and performance improvements. This summary covers the four new models (31B Dense, 26B MoE, and two "Effective" on-device models), deep dives into architectural changes like mixed global/local attention and Per-Layer Embeddings (PLE), and details the new native multimodal capabilities for vision and audio.

2 views07:23

Tokenless - The Best of AI, ML & CS Talks

Introducing the Gemma 4 Family

Gemma 4 is the latest addition to Google's family of open-source models, setting a new standard for performance at various scales under a more accessible Apache 2.0 license. The family includes four models designed for different use cases, from powerful cloud-based applications to efficient on-device execution.

The family consists of:
⦁ 31B Dense Model: A state-of-the-art multimodal model built for advanced reasoning and autonomous workflows. It ranks #3 on the LMSys Arena leaderboard, outperforming models over 20 times its size. It features a 256k context length with native support for function calling and structured JSON outputs.
⦁ 26B Mixture-of-Experts (MoE) Model: The first MoE model in the Gemma family, designed for efficiency. It utilizes only 3.8 billion active parameters during any forward pass. It has a total of 128 experts, with a router activating 8 experts during inference.
⦁ Effective 4B (E4B) & 2B (E2B) Models: These smaller models are optimized for on-device applications, capable of running locally on phones and laptops. They are multimodal, supporting text, vision, and now audio inputs, while remaining text-output models.

Architectural Innovations

Gemma 4's performance gains are driven by several key architectural improvements.

Attention Mechanism Enhancements

To balance performance and computational cost, Gemma 4 employs a hybrid attention strategy across all models:
⦁ Interleaved Local and Global Attention: The models use a mix of local and global attention layers, typically in a 5:1 ratio (4:1 for the E2B model). The final layer is always a global layer, ensuring it attends to all preceding tokens.
⦁ Sliding Window Attention: Local layers use a sliding window (1,024 tokens for large models, 512 for smaller ones) to efficiently process context without attending to every single token.
⦁ Grouped Query Attention (GQA): To manage the high memory cost of global layers, GQA is used. In local layers, two queries share a key/value head. In the more expensive global layers, eight queries share a single key/value head. To compensate for potential performance loss, the key/value head length in global layers is doubled to 512 (from 256 in local layers).

"Effective" Models and Per-Layer Embeddings (PLE)

The "Effective" 2B and 4B models introduce a novel architecture to maximize on-device performance. The term "effective" refers to the number of parameters required for operation, which is much lower than the total "representational" parameters. For example, the E2B model has 2.3B effective parameters but a representational depth of 5.1B parameters.

This is achieved through Per-Layer Embeddings (PLE):
1. Standard Embedding Table: The model still has a standard token embedding table that maps a token ID to its main embedding vector (e.g., dimension 1536 for E2B). This is stored in VRAM.
2. Per-Layer Embedding Table: A second, dedicated embedding table is introduced. This table stores a much smaller embedding vector (dimension 256) for every token at each layer of the model.
3. Flash Memory Storage: Crucially, this large PLE table is stored in flash memory, not VRAM. This bypasses the primary memory constraint on mobile devices and laptops.
4. In-Practice Workflow: At the end of each decoder block, the model performs a lookup in the PLE table to get the token's 256-dimension embedding for that specific layer. This smaller embedding is then projected up to the full embedding size expected by the model. This allows the model's understanding of a token to evolve and improve as it progresses through the layers without incurring a massive VRAM cost.

Native Multimodal Capabilities

Gemma 4 was designed to be natively multimodal from the ground up, with significant advancements in vision and the introduction of audio.

Vision Processing

Key improvements over Ge...

Full story

tokenless.tech

Open Models at Google DeepMind — Cassidy Hardin, Google DeepMind | Tokenless

Cassidy Hardin from Google DeepMind introduces Gemma 4, a new family of open-weight models with significant architectural and performance improvements. This summary covers the four new models (31B Dense, 26B MoE, and two "Effective" on-device models), deep…

2 views07:23

Tokenless - The Best of AI, ML & CS Talks

Demis Hassabis on Building DeepMind, AlphaFold, and the Final Stretch to AGI

Demis Hassabis, CEO of Google DeepMind, outlines the path to AGI, which he predicts by 2030. He discusses the profound impact of AI on science, particularly in revolutionizing drug discovery with systems like AlphaFold, and posits that AI will enable new forms of simulation-based science. Hassabis also delves into the philosophical underpinnings of his work, viewing information as the universe's most fundamental quantity and advocating for developing AGI as a powerful tool before tackling the deeper questions of consciousness.

2 views17:22

Tokenless - The Best of AI, ML & CS Talks

The Path to DeepMind: A Common Thread

The journey to founding DeepMind was guided by a long-held interest in AI, which served as a common thread through seemingly disparate fields like chess, game development, and neuroscience.

⦁ Games as a Proving Ground: In the 1990s, the video game industry was at the forefront of technology, particularly in graphics and the development of GPUs. Games like Theme Park were essentially complex economic and AI simulations. The delight people took in interacting with these AI systems reinforced the potential of the field.
⦁ A Lesson in Ambition: The first startup, Elixir Studios, aimed to simulate an entire country on a home PC in the late 90s. The experience taught a crucial lesson: "You want to be 5 years ahead of your time, not 50 years ahead." Being too far ahead means the enabling technology and market may not be ready.
⦁ Neuroscience for Inspiration: Studying the brain provided key algorithmic ideas and principles, including the concept that reinforcement learning could eventually scale to AGI.

Founding DeepMind and the AGI Mission

DeepMind was founded in 2010 with a clear, two-step mission statement: "Step one, solve intelligence, i.e., build AGI. Step two, use it to solve everything else."

At the time, the idea of pursuing AGI was met with skepticism in both academia and industry. The core insight was to combine two siloed fields: deep learning, which had recently emerged, and reinforcement learning. This novel combination, along with the increasing power of accelerated computing (GPUs), created a sense that "we were keepers of a secret." The conviction was that even if the approach failed, it would fail in a new and original way compared to the symbolic AI efforts of the 1990s.

AI for Science: The Ultimate Tool

The primary motivation for building AGI has always been to use it as the ultimate tool to advance science and medicine. The victory of AlphaGo was the crucial turning point, demonstrating that the algorithms were powerful and general enough to be applied to major scientific challenges. This led to the formal creation of the AI for Science division at DeepMind. The goal is to use AI to cure diseases, create healthier lifespans, and make breakthroughs in materials science, energy, and the environment.

Revolutionizing Biology and Drug Discovery

AI is poised to fundamentally reshape biology, a field perfectly suited for machine learning.
⦁ AlphaFold: This system solved the 50-year-old grand challenge of protein folding, a critical step in understanding biological mechanisms.
⦁ Isomorphic Labs: This newer venture builds on AlphaFold's success, developing adjacent technologies to design chemical compounds that bind to specific protein targets.
⦁ The In Silico Dream: The ultimate goal is to shift 99% of drug discovery exploration into the digital realm ("in silico"). This would reduce the process from a 10-year average down to months, weeks, or perhaps even days. Such a breakthrough would put all diseases within reach and make personalized medicine a practical reality.

The Emergence of New Sciences

Beyond accelerating existing fields, AI may create entirely new sciences.

⦁ AI-Powered Simulations: Many fields, like economics and social sciences, are not as rigorous as physics because it is impossible to run repeated, controlled experiments. AI can create highly accurate "learned simulators" for complex, emergent systems where the underlying mathematics is unknown or too complex.
⦁ A New Language for Biology: Machine learning is the "perfect description language for biology," just as mathematics is for physics. Biological systems involve countless weak signals and correlations in vast datasets, a structure that machine learning is uniquely equipped to analyze. A "virtual cell" is one such simulation being explored.
⦁ Extracting New Equations: Once an accurate, implicit simulat...

Full story

tokenless.tech

Demis Hassabis on Building DeepMind, AlphaFold, and the Final Stretch to AGI | Tokenless

Demis Hassabis, CEO of Google DeepMind, outlines the path to AGI, which he predicts by 2030. He discusses the profound impact of AI on science, particularly in revolutionizing drug discovery with systems like AlphaFold, and posits that AI will enable new…

2 views17:22

Tokenless - The Best of AI, ML & CS Talks

Training an LLM from Scratch, Locally — Angelos Perivolaropoulos, ElevenLabs

A practical guide to the engineering principles and trade-offs involved in training a small language model from scratch on a local machine, based on a workshop by Angelos Perivolaropoulos from ElevenLabs.

1 view17:00

Tokenless - The Best of AI, ML & CS Talks

The Four Pillars of Training an LLM from Scratch

Training a large language model (LLM) from scratch is often seen as an endeavor exclusive to large labs with massive compute resources. However, the fundamental principles can be applied on a much smaller scale, even on a local machine. The process revolves around four key building blocks: the tokenizer, the model architecture, the training loop, and the inference logic.

1. The Tokenizer: The Language of the Model

The first and one of the most critical decisions is choosing a tokenizer, which converts raw text into numerical representations (vectors) that the model can process. For this hands-on project, a character-level tokenizer is used.

⦁ Simplicity and Small Vocabulary: This approach is ideal for small-scale training because the vocabulary is tiny—only 65 unique characters in the Shakespeare dataset. This means the model has fewer possible combinations to learn (e.g., 65 * 65 = 4,225 possible bigrams), making it converge faster on a small dataset.
⦁ The Trade-off: The major drawback is that character-level tokenizers don't scale well. The model struggles to understand the relationship between semantically linked but distant characters (e.g., the characters in "sky" and "blue"). It's computationally expensive and less effective for building powerful, general-purpose models.
⦁ Advanced Alternative: In contrast, large-scale models use more sophisticated tokenizers like Byte Pair Encoding (BPE). BPE is trained on the entire dataset to find and merge common character patterns into single tokens (e.g., "ing", "the", "for"). This creates a much larger vocabulary but allows the model to learn meaningful relationships between word parts and concepts more effectively.

The tokenizer's vocabulary size directly impacts the model's embedding table, which is the first layer of the network. A large vocabulary (like GPT-2's 50,000 tokens) would create an embedding table with millions of parameters, dwarfing the size of the small model being trained.

2. The Transformer Architecture: The Brains of the Operation

The workshop focuses on a GPT-2-style, decoder-only transformer architecture. While modern architectures have numerous optimizations for scale and context length, the core components remain remarkably consistent.

⦁ Multi-Head Self-Attention: This is the core mechanism that allows the model to weigh the importance of different tokens in the input sequence when producing an output. It helps the model understand relationships, like how "sky" and "blue" are correlated. Different "heads" can learn to focus on different aspects of the text, such as grammar, punctuation, or semantic concepts.
⦁ Feed-Forward Network (MLP): After the attention mechanism identifies relationships between tokens, the MLP (Multi-Layer Perceptron) processes this information, organizing the context in a way that helps the model generate the final logits for the next token.
⦁ Residual Connections: These are simple but crucial additions (x = x + attention(x)). Instead of each layer completely transforming the input, it only adds a modification. This stabilizes training by preventing the model's internal representations from changing too drastically from one layer to the next.
⦁ Layer Normalization: This component keeps the activations (the outputs of each layer) within a controlled range. It prevents values from "exploding" as they pass through successive layers, which is another key factor in maintaining training stability.

These components are assembled into a Block, which represents one layer of the transformer. The final model is a stack of these blocks, preceded by an embedding layer and followed by a final linear layer (the LM Head) that produces the probability distribution over the entire vocabulary for the next token.

3. The Training Loop: The Learning Process

The training l...

Full story

tokenless.tech

Training an LLM from Scratch, Locally — Angelos Perivolaropoulos, ElevenLabs | Tokenless

A practical guide to the engineering principles and trade-offs involved in training a small language model from scratch on a local machine, based on a workshop by Angelos Perivolaropoulos from ElevenLabs.

1 view17:00

About

Blog

Apps

Platform