Why Every Brain Metaphor in History Has Been Wrong [SPECIAL EDITION]
An exploration of scientific simplification, questioning the metaphors we use to understand the brain and intelligence. This summary delves into the tension between creating useful models and mistaking them for reality, featuring insights on the mind-as-software debate, the limits of prediction versus understanding, and the philosophical underpinnings of our quest for AGI.
An exploration of scientific simplification, questioning the metaphors we use to understand the brain and intelligence. This summary delves into the tension between creating useful models and mistaking them for reality, featuring insights on the mind-as-software debate, the limits of prediction versus understanding, and the philosophical underpinnings of our quest for AGI.
Science operates by simplifying complex reality, but this necessary act raises a fundamental question: have we found a deep truth about the world, or are we mistaking our simplified model for the actual thing? This tension is embodied by the "spherical cow" joke in physics and is central to modern neuroscience and AI. As Professor Mazviita Chirimuuta explains in her book, The Brain Abstracted, we are limited creatures who must build models and leave things out. The critical disagreement, however, is what this success implies about reality itself.
This can be framed as a conflict between two perspectives:
⦁ Simplicius: Believes that science works because the universe is fundamentally simple and orderly underneath its apparent complexity. An elegant equation reflects reality.
⦁ Ignorantio: Argues that we simplify because we are cognitively limited. Our models are useful fictions—maps, not the territory—that work for our specific purposes, which doesn't prove that nature itself is simple.
Chirimuuta aligns with "learned ignorance" (docta ignorantia), the idea that true learning includes understanding the limits of what you know.
The Kaleidoscope Hypothesis: Is Reality Fundamentally Code?
Francois Chollet proposes the "kaleidoscope hypothesis," suggesting that beneath the messy surface of reality lies an intrinsic, underlying structure composed of simple, repeating "atoms of meaning." Much like a kaleidoscope creates infinite complexity from a few pieces of colored glass, the world is generated by the repetition and composition of these fundamental elements. Intelligence, in this view, is the process of mining experience to extract these abstractions.
Chirimuuta frames this not as a scientific certainty but as a philosophical bet, akin to Plato's theory of Forms. It's a wager that "real reality is neat, mathematical, and decomposable" beneath the complicated world of appearances.
The Ultimate Metaphor: Is the Mind Software?
The most pervasive simplification today is the idea that the mind is a computer running software. This has moved from a metaphor to what many consider a literal truth. Joscha Bach argues provocatively that this is not a metaphor at all: "Software is spirit." He posits that abstract patterns, like software or money, have real causal power, independent of their physical substrate. A program produces the same effects whether on a Mac, a PC, or potentially even neurons, because the causal power lies in the invariance of the pattern itself.
The counterargument is that this "sameness" is not inherent in nature but is imposed by a human observer. Physically, completely different things are happening inside different computer chips. The invariance exists only in our description. The causal power of money, for example, isn't in the paper or electrons but in the shared social agreements and interpretive practices of humans. The critique is that this view mistakes an elegant description for the fundamental structure of reality.
Historically, our metaphors for the brain have always tracked our most advanced technology:
⦁ Descartes: Hydraulic pumps in French royal gardens.
⦁ 19th Century: A telegraph network.
⦁ 20th Century: A telephone switchboard.
⦁ 21st Century: A digital computer.
As Jeff Beck bluntly states, "It will always be the case that our explanation for how the brain works will be by analogy to the most sophisticated technology that we have."
Ontology vs. Metaphysics: It Depends on Why You're Asking
Professor Luciano Floridi offers a framework to navigate this, distinguishing between metaphysics (reality as it is in itself, which is inaccessible) and ontology (the structure we impose on reality for a specific purpose). Our models of the world are not absolutely true or false; their value is relational.
Full story
This can be framed as a conflict between two perspectives:
⦁ Simplicius: Believes that science works because the universe is fundamentally simple and orderly underneath its apparent complexity. An elegant equation reflects reality.
⦁ Ignorantio: Argues that we simplify because we are cognitively limited. Our models are useful fictions—maps, not the territory—that work for our specific purposes, which doesn't prove that nature itself is simple.
Chirimuuta aligns with "learned ignorance" (docta ignorantia), the idea that true learning includes understanding the limits of what you know.
The Kaleidoscope Hypothesis: Is Reality Fundamentally Code?
Francois Chollet proposes the "kaleidoscope hypothesis," suggesting that beneath the messy surface of reality lies an intrinsic, underlying structure composed of simple, repeating "atoms of meaning." Much like a kaleidoscope creates infinite complexity from a few pieces of colored glass, the world is generated by the repetition and composition of these fundamental elements. Intelligence, in this view, is the process of mining experience to extract these abstractions.
Chirimuuta frames this not as a scientific certainty but as a philosophical bet, akin to Plato's theory of Forms. It's a wager that "real reality is neat, mathematical, and decomposable" beneath the complicated world of appearances.
The Ultimate Metaphor: Is the Mind Software?
The most pervasive simplification today is the idea that the mind is a computer running software. This has moved from a metaphor to what many consider a literal truth. Joscha Bach argues provocatively that this is not a metaphor at all: "Software is spirit." He posits that abstract patterns, like software or money, have real causal power, independent of their physical substrate. A program produces the same effects whether on a Mac, a PC, or potentially even neurons, because the causal power lies in the invariance of the pattern itself.
The counterargument is that this "sameness" is not inherent in nature but is imposed by a human observer. Physically, completely different things are happening inside different computer chips. The invariance exists only in our description. The causal power of money, for example, isn't in the paper or electrons but in the shared social agreements and interpretive practices of humans. The critique is that this view mistakes an elegant description for the fundamental structure of reality.
Historically, our metaphors for the brain have always tracked our most advanced technology:
⦁ Descartes: Hydraulic pumps in French royal gardens.
⦁ 19th Century: A telegraph network.
⦁ 20th Century: A telephone switchboard.
⦁ 21st Century: A digital computer.
As Jeff Beck bluntly states, "It will always be the case that our explanation for how the brain works will be by analogy to the most sophisticated technology that we have."
Ontology vs. Metaphysics: It Depends on Why You're Asking
Professor Luciano Floridi offers a framework to navigate this, distinguishing between metaphysics (reality as it is in itself, which is inaccessible) and ontology (the structure we impose on reality for a specific purpose). Our models of the world are not absolutely true or false; their value is relational.
Is it the same ship of Theseus? The question is a mistake. It provides no interface, what computer scientis...
Full story
tokenless.tech
Why Every Brain Metaphor in History Has Been Wrong [SPECIAL EDITION] | Tokenless
An exploration of scientific simplification, questioning the metaphors we use to understand the brain and intelligence. This summary delves into the tension between creating useful models and mistaking them for reality, featuring insights on the mind-as-software…
"We Made a Dream Machine That Runs on Your Gaming PC"
Shahbuland Matiana and Andrew Lapp from Overworld Labs introduce Waypoint 1, a 2 billion-parameter open-source world simulation model designed to run on consumer hardware at 60 FPS. They discuss its novel architecture, which combines a causal language model with an image diffusion model to denoise frames in real-time based on user prompts and controller inputs, emphasizing low-latency interaction and the importance of local execution for user privacy.
Shahbuland Matiana and Andrew Lapp from Overworld Labs introduce Waypoint 1, a 2 billion-parameter open-source world simulation model designed to run on consumer hardware at 60 FPS. They discuss its novel architecture, which combines a causal language model with an image diffusion model to denoise frames in real-time based on user prompts and controller inputs, emphasizing low-latency interaction and the importance of local execution for user privacy.
Overworld Labs has introduced Waypoint 1, a 2 billion-parameter world simulation model designed to run efficiently on consumer hardware. Unlike large-scale projects like Google's Genie, which rely on massive cloud infrastructure, Waypoint 1 is optimized for local execution on gaming PCs (e.g., NVIDIA 3070s, 4090s) and soon, Apple Silicon. The model, whose weights are being open-sourced, is capable of generating interactive, explorable worlds from text or image prompts at 60 frames per second.
The Vision: Sharable Lucid Dreams
The core motivation behind Overworld is to create a way to record and share the kinds of immersive, dynamic experiences found in dreams. Co-founder Shahbuland Matiana described a personal lucid dream that modern game engines cannot replicate:
The goal of Waypoint 1 is to enable the creation of such fully immersive experiences where the world bends and reacts to the user's actions, and then allow those experiences to be shared with others. This technology aims to be a "killer application" for AI, moving beyond static video generation into truly interactive entertainment.
Technical Architecture: A Real-Time Diffusion Transformer
Waypoint 1's architecture is a novel hybrid of a causal language model and an image diffusion model, optimized for real-time interaction.
1. Image Compression: The process begins with an autoencoder that compresses video frames (e.g., 360p) into a much smaller latent representation, such as a 32x32 grid. The model operates entirely in this compressed latent space, not on raw pixels.
2. Frame Generation: The core of the system is a transformer model. However, instead of autoregressively predicting the next token like a standard LLM, it denoises the next 256 tokens (representing one full frame) in a single forward pass.
3. Conditioning: Each frame is generated conditioned on a history of preceding frames, a text prompt, and controller inputs from the last 1/60th of a second. This conditioning is managed through cross-attention mechanisms within the transformer blocks.
4. Low Latency: To ensure playability and responsiveness, the model generates only one frame at a time. This is a key distinction from many video diffusion models that use temporal autoencoders to compress multiple frames together, which saves computation but introduces significant input lag (e.g., only accepting input every 4th frame).
Optimization and Distillation
Achieving 60 FPS on consumer hardware requires significant optimization. The team uses a four-step rectified flow model with an Euler sampler. In this process, the model starts with random noise and, over four steps, predicts the vector that moves the latent representation closer to the "clean," ideal frame.
A key insight is that reducing the number of diffusion steps primarily sacrifices diversity, not quality. For an autoregressive model like Waypoint 1, this is an acceptable trade-off. The strong conditioning from previous frames and user input already constrains the output, so the inherent diversity from a high-step diffusion process is less critical.
This speed is further enhanced by diffusion distillation (e.g., using methods like Distribution Matching Distillation or DMD), where a "student" model is trained to replicate the output of a larger model in fewer steps. This process effectively "bakes in" parameters like the classifier-free guidance (CFG) scale, which avoids the need for multiple forward passes during inference and dramatically speeds up generation.
Privacy and the Future
The team strongly advocates for ...
Full story
The Vision: Sharable Lucid Dreams
The core motivation behind Overworld is to create a way to record and share the kinds of immersive, dynamic experiences found in dreams. Co-founder Shahbuland Matiana described a personal lucid dream that modern game engines cannot replicate:
"I was in this like house floating in space and there was a giant like dragon circling the the house... I draw a katana from my like waist and I parry the dragon's teeth as it goes try to bite me. I feel a clang reverberate through my whole body. The floorboards crack beneath my feet. The window shatter around me."
The goal of Waypoint 1 is to enable the creation of such fully immersive experiences where the world bends and reacts to the user's actions, and then allow those experiences to be shared with others. This technology aims to be a "killer application" for AI, moving beyond static video generation into truly interactive entertainment.
Technical Architecture: A Real-Time Diffusion Transformer
Waypoint 1's architecture is a novel hybrid of a causal language model and an image diffusion model, optimized for real-time interaction.
1. Image Compression: The process begins with an autoencoder that compresses video frames (e.g., 360p) into a much smaller latent representation, such as a 32x32 grid. The model operates entirely in this compressed latent space, not on raw pixels.
2. Frame Generation: The core of the system is a transformer model. However, instead of autoregressively predicting the next token like a standard LLM, it denoises the next 256 tokens (representing one full frame) in a single forward pass.
3. Conditioning: Each frame is generated conditioned on a history of preceding frames, a text prompt, and controller inputs from the last 1/60th of a second. This conditioning is managed through cross-attention mechanisms within the transformer blocks.
4. Low Latency: To ensure playability and responsiveness, the model generates only one frame at a time. This is a key distinction from many video diffusion models that use temporal autoencoders to compress multiple frames together, which saves computation but introduces significant input lag (e.g., only accepting input every 4th frame).
Optimization and Distillation
Achieving 60 FPS on consumer hardware requires significant optimization. The team uses a four-step rectified flow model with an Euler sampler. In this process, the model starts with random noise and, over four steps, predicts the vector that moves the latent representation closer to the "clean," ideal frame.
A key insight is that reducing the number of diffusion steps primarily sacrifices diversity, not quality. For an autoregressive model like Waypoint 1, this is an acceptable trade-off. The strong conditioning from previous frames and user input already constrains the output, so the inherent diversity from a high-step diffusion process is less critical.
This speed is further enhanced by diffusion distillation (e.g., using methods like Distribution Matching Distillation or DMD), where a "student" model is trained to replicate the output of a larger model in fewer steps. This process effectively "bakes in" parameters like the classifier-free guidance (CFG) scale, which avoids the need for multiple forward passes during inference and dramatically speeds up generation.
Privacy and the Future
The team strongly advocates for ...
Full story
tokenless.tech
"We Made a Dream Machine That Runs on Your Gaming PC" | Tokenless
Shahbuland Matiana and Andrew Lapp from Overworld Labs introduce Waypoint 1, a 2 billion-parameter open-source world simulation model designed to run on consumer hardware at 60 FPS. They discuss its novel architecture, which combines a causal language model…
This Startup Beat Gemini 3 on ARC-AGI — at Half the Cost
Poetic, a startup by ex-DeepMind researchers, has significantly advanced performance on the ARC-AGI benchmark by applying a recursive self-improvement system to Gemini 3. Co-founder Ian Fisher discusses how their approach of automating prompt and system engineering provides a substantial performance boost without needing access to model weights, and explores its potential as a path toward AGI.
Poetic, a startup by ex-DeepMind researchers, has significantly advanced performance on the ARC-AGI benchmark by applying a recursive self-improvement system to Gemini 3. Co-founder Ian Fisher discusses how their approach of automating prompt and system engineering provides a substantial performance boost without needing access to model weights, and explores its potential as a path toward AGI.
Poetic, a new startup founded by former DeepMind researchers, has achieved a significant breakthrough on the ARC-AGI benchmark. By layering their proprietary system on top of Gemini 3, they achieved a 54% score on the private test set, a substantial leap from Gemini 3's baseline of approximately 33% and even surpassing the more advanced Gemini 3 Deep Think's 45% at half the cost.
The Core Technology: Recursive Self-Improvement
The central idea behind Poetic's success is a form of recursive self-improvement (RSI), which co-founder Ian Fisher describes as "the holy grail of AI." The goal is to create a system where the AI actively makes itself smarter.
Unlike methods that require fine-tuning or access to model weights, Poetic's approach operates purely at the system and prompt level. This is a crucial advantage when working with closed-source models available only through APIs. The methodology involves:
⦁ Ensemble Methods: The system calls the underlying model (e.g., Gemini 3) multiple times.
⦁ Independent Refinement: Each member of the ensemble works independently to refine its own answer.
⦁ Advanced Voting Schemes: The refined answers are combined using a sophisticated voting mechanism to produce a final, more accurate solution.
This system-level optimization is what differentiates Poetic from other prompt engineering frameworks like DSPy, containing what Fisher refers to as "trade secret insights" that yield a significant performance difference. The entire ARC-AGI solver was an output of their system, which was trained on ARC-1 and then applied to ARC-2 without any specific training on the latter.
The Gemini 3 Catalyst
The release of Gemini 3 was a pivotal moment. While Poetic's system showed promising results on ARC-1 with other models (reaching 89%), switching to Gemini 3 pushed their performance to 95%. When they applied this new combination to the more challenging ARC-2, they had a "holy cow moment" as the performance jumped to the state-of-the-art 54%.
Fisher attributes this leap to Gemini 3's exceptional ability to generate code for visual problem-solving, a capability that surpassed previous models. He also notes that other powerful models like Anthropic's Opus can be swapped in for Gemini 3 to achieve similar results, albeit at a higher cost.
A Path to AGI and Practical Applications
Fisher views RSI as both a practical tool for immediate performance gains and a credible path toward AGI.
⦁ Immediate Value: The performance "bump" from Poetic's system can be highly valuable. On the ARC-AGI benchmark, which allows for two solution submissions, their method provided a single, higher-quality solution that outperformed the underlying model's two submissions, sometimes at a lower overall cost.
⦁ Long-Term Vision: While not the only path, Fisher believes RSI is "the most exciting path to AGI and beyond." The process on ARC-AGI was stopped manually due to cost constraints, suggesting that with more resources, the performance could have "hill-climbed" even further.
Automating the Prompt Engineer
The broader vision for Poetic is to automate the complex and often manual process of prompt engineering and agent creation. Fisher draws an analogy to the evolution of deep learning, which automated the manual process of feature engineering.
He contrasts their previous manual work at DeepMind—akin to building a car by hand—with Poetic's technology, which is like "building a factory to build cars." The goal is to create a system that automatically discovers the optimal prompts and system configurations, removing the human from the tedious trial-and-error loop. While continuing their research and targeting other high-impact benchmarks, the six-person team is now also focusing on bringing t...
Full story
The Core Technology: Recursive Self-Improvement
The central idea behind Poetic's success is a form of recursive self-improvement (RSI), which co-founder Ian Fisher describes as "the holy grail of AI." The goal is to create a system where the AI actively makes itself smarter.
Unlike methods that require fine-tuning or access to model weights, Poetic's approach operates purely at the system and prompt level. This is a crucial advantage when working with closed-source models available only through APIs. The methodology involves:
⦁ Ensemble Methods: The system calls the underlying model (e.g., Gemini 3) multiple times.
⦁ Independent Refinement: Each member of the ensemble works independently to refine its own answer.
⦁ Advanced Voting Schemes: The refined answers are combined using a sophisticated voting mechanism to produce a final, more accurate solution.
This system-level optimization is what differentiates Poetic from other prompt engineering frameworks like DSPy, containing what Fisher refers to as "trade secret insights" that yield a significant performance difference. The entire ARC-AGI solver was an output of their system, which was trained on ARC-1 and then applied to ARC-2 without any specific training on the latter.
The Gemini 3 Catalyst
The release of Gemini 3 was a pivotal moment. While Poetic's system showed promising results on ARC-1 with other models (reaching 89%), switching to Gemini 3 pushed their performance to 95%. When they applied this new combination to the more challenging ARC-2, they had a "holy cow moment" as the performance jumped to the state-of-the-art 54%.
Fisher attributes this leap to Gemini 3's exceptional ability to generate code for visual problem-solving, a capability that surpassed previous models. He also notes that other powerful models like Anthropic's Opus can be swapped in for Gemini 3 to achieve similar results, albeit at a higher cost.
A Path to AGI and Practical Applications
Fisher views RSI as both a practical tool for immediate performance gains and a credible path toward AGI.
⦁ Immediate Value: The performance "bump" from Poetic's system can be highly valuable. On the ARC-AGI benchmark, which allows for two solution submissions, their method provided a single, higher-quality solution that outperformed the underlying model's two submissions, sometimes at a lower overall cost.
⦁ Long-Term Vision: While not the only path, Fisher believes RSI is "the most exciting path to AGI and beyond." The process on ARC-AGI was stopped manually due to cost constraints, suggesting that with more resources, the performance could have "hill-climbed" even further.
Automating the Prompt Engineer
The broader vision for Poetic is to automate the complex and often manual process of prompt engineering and agent creation. Fisher draws an analogy to the evolution of deep learning, which automated the manual process of feature engineering.
"We are quite intentionally automating ourselves, automating prompt engineers, automating people who are building agents. It's a power tool."
He contrasts their previous manual work at DeepMind—akin to building a car by hand—with Poetic's technology, which is like "building a factory to build cars." The goal is to create a system that automatically discovers the optimal prompts and system configurations, removing the human from the tedious trial-and-error loop. While continuing their research and targeting other high-impact benchmarks, the six-person team is now also focusing on bringing t...
Full story
tokenless.tech
This Startup Beat Gemini 3 on ARC-AGI — at Half the Cost | Tokenless
Poetic, a startup by ex-DeepMind researchers, has significantly advanced performance on the ARC-AGI benchmark by applying a recursive self-improvement system to Gemini 3. Co-founder Ian Fisher discusses how their approach of automating prompt and system engineering…
She Raised $64M to Build an AI Math Prodigy | Carina Hong, CEO of Axiom
Carina Hong, Founder & CEO of Axiom, discusses building a self-improving AI reasoning engine that combines generation and verification. Starting with formal mathematics, Axiom's system has achieved superhuman results on the notoriously difficult Putnam Exam by leveraging formal languages like Lean to overcome the probabilistic and unverifiable nature of standard LLMs. Hong explores how this technology can solve major bottlenecks in hardware and software verification, code migration, and database consistency, and what it means for the future of mathematical research.
Carina Hong, Founder & CEO of Axiom, discusses building a self-improving AI reasoning engine that combines generation and verification. Starting with formal mathematics, Axiom's system has achieved superhuman results on the notoriously difficult Putnam Exam by leveraging formal languages like Lean to overcome the probabilistic and unverifiable nature of standard LLMs. Hong explores how this technology can solve major bottlenecks in hardware and software verification, code migration, and database consistency, and what it means for the future of mathematical research.
Axiom's mission is to build a self-improving reasoning engine that uniquely combines generation and verification, an often-overlooked component in the current AI landscape. The company starts with an "AI mathematician" as a testing ground for this self-improvement loop, using formal languages like Lean to ground its natural language capabilities.
The Architecture of a Reasoning Engine
Axiom's system is built on three core components that interact with each other:
⦁ Prover: A system that can prove theorems.
⦁ Conjecturer: A system that proposes interesting and novel conjectures.
⦁ Knowledge Base: A database of what has already been proven, which both the prover and conjecturer can reference.
Tying these components together is auto-formalization, the process of converting natural language mathematics into a formal language. This is a core technology for Axiom, viewed as being as challenging and important as theorem proving itself.
Superhuman Performance on the Putnam Exam
Axiom's prover has demonstrated remarkable capabilities on the Putnam Mathematical Competition, an infamously difficult exam for undergraduates where the median score is often zero.
⦁ Axiom's system solved 8 out of 12 problems within the official time limit, a score that would place it in the top five (Putnam Fellow). A ninth problem was solved shortly after.
⦁ This performance significantly surpasses that of Axiom's founder, Carina Hong, who scored 4 out of 12.
⦁ This success showcases the power of combining deterministic, formal tools with probabilistic systems. Formal systems cannot "hand-wave" through difficult steps, forcing a level of rigor that informal LLMs lack. For instance, the AI prover might spend significant effort generating detailed code to rigorously prove convergence or limits, something a human might take for granted.
AI vs. Human Problem-Solving
While LLMs can seem impressive on some math problems, they often fail on seemingly simpler brain teasers because they lack true reasoning and verification. They generate solutions statistically without a guarantee of soundness.
⦁ Formal Verification's Role: Axiom's use of formal languages like Lean ensures that a proof is sound. Unlike a natural language proof from an LLM, which can have subtle flaws that are hard to spot, a Lean proof is machine-verifiable.
⦁ Interpretability: While the AI may generate proofs that are structured differently from human proofs, they are ultimately interpretable. The formal code of each step can be inspected and converted back to natural language, a significantly easier task than the initial formalization. The AI may find solutions that are convergent with what a human would find, acting like a collaborator with a different style, akin to the discovery of a self-taught genius like Ramanujan.
Applications Beyond Pure Mathematics
The core technology of generation paired with verification has profound implications for high-stakes commercial applications where correctness is critical. Formal verification is a major bottleneck in many industries, often consuming years of effort.
⦁ Hardware and Software Verification: In chip design, verification teams can be three to four times larger than design teams, with verification cycles taking years. AI-powered formal verification can dramatically reduce this time and lower the expertise required. AWS, for example, took five years to manually formalize just one component of its hypervisor.
⦁ Code Migration and Equivalence: When upgrading legacy systems, it's crucial to ensure the new code is perfectly equivalent to the old code. Formal methods can prove this equivalence, preventing regressions in critical business functions.
⦁ Database Consistency: Formal verification can be used to prove the consistency of database protocols, such as solving the Byzantine Generals Problem, ensuring reliability even in the presence of bad act...
Full story
The Architecture of a Reasoning Engine
Axiom's system is built on three core components that interact with each other:
⦁ Prover: A system that can prove theorems.
⦁ Conjecturer: A system that proposes interesting and novel conjectures.
⦁ Knowledge Base: A database of what has already been proven, which both the prover and conjecturer can reference.
Tying these components together is auto-formalization, the process of converting natural language mathematics into a formal language. This is a core technology for Axiom, viewed as being as challenging and important as theorem proving itself.
Superhuman Performance on the Putnam Exam
Axiom's prover has demonstrated remarkable capabilities on the Putnam Mathematical Competition, an infamously difficult exam for undergraduates where the median score is often zero.
⦁ Axiom's system solved 8 out of 12 problems within the official time limit, a score that would place it in the top five (Putnam Fellow). A ninth problem was solved shortly after.
⦁ This performance significantly surpasses that of Axiom's founder, Carina Hong, who scored 4 out of 12.
⦁ This success showcases the power of combining deterministic, formal tools with probabilistic systems. Formal systems cannot "hand-wave" through difficult steps, forcing a level of rigor that informal LLMs lack. For instance, the AI prover might spend significant effort generating detailed code to rigorously prove convergence or limits, something a human might take for granted.
AI vs. Human Problem-Solving
While LLMs can seem impressive on some math problems, they often fail on seemingly simpler brain teasers because they lack true reasoning and verification. They generate solutions statistically without a guarantee of soundness.
⦁ Formal Verification's Role: Axiom's use of formal languages like Lean ensures that a proof is sound. Unlike a natural language proof from an LLM, which can have subtle flaws that are hard to spot, a Lean proof is machine-verifiable.
⦁ Interpretability: While the AI may generate proofs that are structured differently from human proofs, they are ultimately interpretable. The formal code of each step can be inspected and converted back to natural language, a significantly easier task than the initial formalization. The AI may find solutions that are convergent with what a human would find, acting like a collaborator with a different style, akin to the discovery of a self-taught genius like Ramanujan.
Applications Beyond Pure Mathematics
The core technology of generation paired with verification has profound implications for high-stakes commercial applications where correctness is critical. Formal verification is a major bottleneck in many industries, often consuming years of effort.
⦁ Hardware and Software Verification: In chip design, verification teams can be three to four times larger than design teams, with verification cycles taking years. AI-powered formal verification can dramatically reduce this time and lower the expertise required. AWS, for example, took five years to manually formalize just one component of its hypervisor.
⦁ Code Migration and Equivalence: When upgrading legacy systems, it's crucial to ensure the new code is perfectly equivalent to the old code. Formal methods can prove this equivalence, preventing regressions in critical business functions.
⦁ Database Consistency: Formal verification can be used to prove the consistency of database protocols, such as solving the Byzantine Generals Problem, ensuring reliability even in the presence of bad act...
Full story
tokenless.tech
She Raised $64M to Build an AI Math Prodigy | Carina Hong, CEO of Axiom | Tokenless
Carina Hong, Founder & CEO of Axiom, discusses building a self-improving AI reasoning engine that combines generation and verification. Starting with formal mathematics, Axiom's system has achieved superhuman results on the notoriously difficult Putnam Exam…
Inference at Scale:Breaking the Memory Wall
Sid Sheth, CEO of d-matrix, details their memory-centric approach to AI inference hardware, focusing on their Digital In-Memory Compute (DIMC) architecture. He explains how DIMC, an augmented SRAM technology, minimizes data movement to solve the memory bottleneck, delivering significant gains in latency and energy efficiency, particularly for the 'decode' phase of large language models.
Sid Sheth, CEO of d-matrix, details their memory-centric approach to AI inference hardware, focusing on their Digital In-Memory Compute (DIMC) architecture. He explains how DIMC, an augmented SRAM technology, minimizes data movement to solve the memory bottleneck, delivering significant gains in latency and energy efficiency, particularly for the 'decode' phase of large language models.
The Bet on Cloud Inference and Memory-Centric Design
Founded in 2019, before the rise of ChatGPT, d-matrix made a contrarian bet on data center and cloud inference. While many startups focused on edge computing or the highly competitive training market dominated by NVIDIA, d-matrix identified a gap for a dedicated, efficient inference solution in the cloud.
The founding team anticipated that AI models, particularly transformers like BERT and the emerging GPT-3, would continue to grow in size, making memory access the primary bottleneck. Their first-principles analysis of the inference workload revealed it to be a repetitive, parallel compute problem heavily dependent on memory access. This led to their core strategy: integrating memory and compute as closely as possible to build a fundamentally more efficient architecture.
The Memory Bottleneck: HBM vs. SRAM
The choice of memory technology is critical for AI hardware. Sid Sheth provides a clear breakdown of the trade-offs:
⦁ High-Bandwidth Memory (HBM): Originally developed for High-Performance Computing (HPC) and later adopted for AI training, HBM acts like a "highway with many lanes," providing high-bandwidth access to a processor. While effective for the massive, parallel data needs of training, HBM is a poor fit for mainstream inference due to three key factors:
⦁ Cost: It remains an expensive technology.
⦁ Energy: It is very power-hungry.
⦁ Bandwidth Limits: The pace of AI model growth is outstripping HBM's ability to scale its bandwidth, making it "not fast anymore" for cutting-edge inference needs.
⦁ SRAM (Static RAM): d-matrix, along with other early players like Grok and Cerebras, initially focused on SRAM for its speed. However, on-chip SRAM capacity is limited. Recognizing that models would quickly outgrow a single chip, d-matrix designed its system with a two-tiered memory approach from the start, using a large on-chip SRAM tier and a second, larger LPDDR memory tier to accommodate extremely large models and the exploding KV-cache sizes associated with long contexts.
Prefill vs. Decode: The Two Phases of Generative Inference
Generative AI models operate in two distinct phases, which have different hardware requirements:
1. Prefill (The "Thinking" Phase): When a model receives a prompt, it processes the input and generates the internal context (KV cache). This phase is compute-intensive.
2. Decode (The "Speaking" Phase): The model then generates the response token by token. Each new token requires accessing the entire KV cache. This phase is memory-intensive and highly sensitive to latency. A slow decode phase results in a poor user experience, with long delays between words.
d-matrix's architecture is particularly well-suited for accelerating the memory-bound decode phase, where low latency is paramount.
Digital In-Memory Compute (DIMC): The Core Innovation
d-matrix's key technology is Digital In-Memory Compute (DIMC). It's a novel architecture that turns memory itself into a compute fabric.
⦁ How it Works: A traditional SRAM cell uses six transistors (6T) to store one bit of data. d-matrix augmented this design, creating a ten-transistor (10T) cell that can both store a bit and perform a single-bit multiplication.
⦁ The Benefit: By embedding compute directly within the memory array, model parameters (weights) can be stored and used for matrix math calculations without being moved. This minimization of data movement is the key to efficiency. It saves a tremendous amount of time and energy, directly addressing the three most precious resources: money, time, and energy.
This approach allows all rows of the SRAM to be activated simultaneously, creating a dataflow engine with much higher throughput than a traditional SRAM.
System, Scale, and Performance Trade-offs
The d-matrix solution is bu...
Full story
Founded in 2019, before the rise of ChatGPT, d-matrix made a contrarian bet on data center and cloud inference. While many startups focused on edge computing or the highly competitive training market dominated by NVIDIA, d-matrix identified a gap for a dedicated, efficient inference solution in the cloud.
The founding team anticipated that AI models, particularly transformers like BERT and the emerging GPT-3, would continue to grow in size, making memory access the primary bottleneck. Their first-principles analysis of the inference workload revealed it to be a repetitive, parallel compute problem heavily dependent on memory access. This led to their core strategy: integrating memory and compute as closely as possible to build a fundamentally more efficient architecture.
The Memory Bottleneck: HBM vs. SRAM
The choice of memory technology is critical for AI hardware. Sid Sheth provides a clear breakdown of the trade-offs:
⦁ High-Bandwidth Memory (HBM): Originally developed for High-Performance Computing (HPC) and later adopted for AI training, HBM acts like a "highway with many lanes," providing high-bandwidth access to a processor. While effective for the massive, parallel data needs of training, HBM is a poor fit for mainstream inference due to three key factors:
⦁ Cost: It remains an expensive technology.
⦁ Energy: It is very power-hungry.
⦁ Bandwidth Limits: The pace of AI model growth is outstripping HBM's ability to scale its bandwidth, making it "not fast anymore" for cutting-edge inference needs.
⦁ SRAM (Static RAM): d-matrix, along with other early players like Grok and Cerebras, initially focused on SRAM for its speed. However, on-chip SRAM capacity is limited. Recognizing that models would quickly outgrow a single chip, d-matrix designed its system with a two-tiered memory approach from the start, using a large on-chip SRAM tier and a second, larger LPDDR memory tier to accommodate extremely large models and the exploding KV-cache sizes associated with long contexts.
Prefill vs. Decode: The Two Phases of Generative Inference
Generative AI models operate in two distinct phases, which have different hardware requirements:
1. Prefill (The "Thinking" Phase): When a model receives a prompt, it processes the input and generates the internal context (KV cache). This phase is compute-intensive.
2. Decode (The "Speaking" Phase): The model then generates the response token by token. Each new token requires accessing the entire KV cache. This phase is memory-intensive and highly sensitive to latency. A slow decode phase results in a poor user experience, with long delays between words.
d-matrix's architecture is particularly well-suited for accelerating the memory-bound decode phase, where low latency is paramount.
Digital In-Memory Compute (DIMC): The Core Innovation
d-matrix's key technology is Digital In-Memory Compute (DIMC). It's a novel architecture that turns memory itself into a compute fabric.
⦁ How it Works: A traditional SRAM cell uses six transistors (6T) to store one bit of data. d-matrix augmented this design, creating a ten-transistor (10T) cell that can both store a bit and perform a single-bit multiplication.
⦁ The Benefit: By embedding compute directly within the memory array, model parameters (weights) can be stored and used for matrix math calculations without being moved. This minimization of data movement is the key to efficiency. It saves a tremendous amount of time and energy, directly addressing the three most precious resources: money, time, and energy.
This approach allows all rows of the SRAM to be activated simultaneously, creating a dataflow engine with much higher throughput than a traditional SRAM.
System, Scale, and Performance Trade-offs
The d-matrix solution is bu...
Full story
tokenless.tech
Inference at Scale:Breaking the Memory Wall | Tokenless
Sid Sheth, CEO of d-matrix, details their memory-centric approach to AI inference hardware, focusing on their Digital In-Memory Compute (DIMC) architecture. He explains how DIMC, an augmented SRAM technology, minimizes data movement to solve the memory bottleneck…
A Philosophy of Building for the Future
The core development principle at Anthropic, and for Claude Code specifically, is to not build for the model of today, but for the model that will exist in six months. This forward-looking approach anticipates the rapid, exponential improvement in model capabilities. Builders are advised to identify the current frontiers where a model is weak, with the confidence that it will become proficient in those areas over time.
This philosophy is heavily influenced by Rich Sutton's "The Bitter Lesson," which posits that general models that leverage computation will ultimately outperform more specialized, human-designed systems. Consequently, the Claude Code team is cautious about building what they call "scaffolding"—product features or code that compensates for a model's current shortcomings. This scaffolding often provides a temporary 10-20% performance gain but is rendered obsolete by the next model iteration.
This results in a dynamic and ephemeral codebase. Virtually no part of Claude Code that existed six months ago is still in the product today. The entire application is constantly being written, rewritten, and refactored as model capabilities advance, with tools and features being added and removed every couple of weeks.
The Accidental Genius of the Terminal
Claude Code's existence as a command-line interface (CLI) was not a grand design but an accident. It began as a simple terminal-based chat application built by Boris Cherny to familiarize himself with the Anthropic API. The initial goal was simply to explore what a coding product could be.
The "aha!" moment came when the model was given a
The terminal, chosen for its simplicity and lack of UI overhead, proved to be a surprisingly effective and enduring form factor. Its constraints fostered an elegant and powerful developer experience that resonated deeply with engineers, leading to rapid, viral adoption within Anthropic long before its public release.
Features Born from Latent Demand
A key product principle is to identify and serve "latent demand"—making it easier for users to do what they are already trying to do. Many of Claude Code's core features originated from observing user workarounds and desires.
CLAUDE.md
The concept for
Plan Mode
Plan Mode was created in a 30-minute coding session on a Sunday night in response to observing users explicitly asking the model to "plan this out but don't write any code yet." The implementation is deceptively simple: it just adds a single sentence to the prompt, "please don't code." While currently a heavily used feature to ensure the model is on the right track before execution, Boris predicts it may have a limited lifespan as models become capable enough to generate and execute a correct plan from a single prompt.
From Solo Agent to Agent Swarms
The architecture of work is evolving from single-agent interactions to multi-agent collaboration. Claude Code heavil...
Full story
The core development principle at Anthropic, and for Claude Code specifically, is to not build for the model of today, but for the model that will exist in six months. This forward-looking approach anticipates the rapid, exponential improvement in model capabilities. Builders are advised to identify the current frontiers where a model is weak, with the confidence that it will become proficient in those areas over time.
This philosophy is heavily influenced by Rich Sutton's "The Bitter Lesson," which posits that general models that leverage computation will ultimately outperform more specialized, human-designed systems. Consequently, the Claude Code team is cautious about building what they call "scaffolding"—product features or code that compensates for a model's current shortcomings. This scaffolding often provides a temporary 10-20% performance gain but is rendered obsolete by the next model iteration.
"Never bet against the model... We could also just wait like a couple of months and the model can probably just do the thing instead."
This results in a dynamic and ephemeral codebase. Virtually no part of Claude Code that existed six months ago is still in the product today. The entire application is constantly being written, rewritten, and refactored as model capabilities advance, with tools and features being added and removed every couple of weeks.
The Accidental Genius of the Terminal
Claude Code's existence as a command-line interface (CLI) was not a grand design but an accident. It began as a simple terminal-based chat application built by Boris Cherny to familiarize himself with the Anthropic API. The initial goal was simply to explore what a coding product could be.
The "aha!" moment came when the model was given a
bash tool. When asked, "What music am I listening to?" the model, Sonnet 3.5 at the time, independently wrote and executed AppleScript to query the user's music player. This demonstrated an innate desire to use tools and interact with the world, which became a foundational insight for the product's direction.The terminal, chosen for its simplicity and lack of UI overhead, proved to be a surprisingly effective and enduring form factor. Its constraints fostered an elegant and powerful developer experience that resonated deeply with engineers, leading to rapid, viral adoption within Anthropic long before its public release.
Features Born from Latent Demand
A key product principle is to identify and serve "latent demand"—making it easier for users to do what they are already trying to do. Many of Claude Code's core features originated from observing user workarounds and desires.
CLAUDE.md
The concept for
CLAUDE.md emerged when developers were observed writing their own markdown files with instructions and context, which they would then feed to the model. This behavior was formalized into a feature that allows teams to maintain a shared set of instructions and context checked into their codebase. The advice for maintaining these files is to be minimal; if a CLAUDE.md becomes too long or complex, it's often best to delete it and start fresh, adding instructions back only as needed, as newer models require less guidance.Plan Mode
Plan Mode was created in a 30-minute coding session on a Sunday night in response to observing users explicitly asking the model to "plan this out but don't write any code yet." The implementation is deceptively simple: it just adds a single sentence to the prompt, "please don't code." While currently a heavily used feature to ensure the model is on the right track before execution, Boris predicts it may have a limited lifespan as models become capable enough to generate and execute a correct plan from a single prompt.
From Solo Agent to Agent Swarms
The architecture of work is evolving from single-agent interactions to multi-agent collaboration. Claude Code heavil...
Full story
tokenless.tech
Boris Cherny: How We Built Claude Code | Tokenless
Boris Cherny, creator of Claude Code, shares the development philosophy behind the AI coding tool, emphasizing building for future models, leveraging latent user demand, and the surprising longevity of the terminal interface.
The Laws of Thought: The Math of Minds and Machines, with Prof. Tom Griffiths
Princeton Professor Tom Griffiths discusses his book "The Laws of Thought," exploring the mathematical models that govern both biological and artificial intelligence. He details the fundamental differences between human and machine cognition, rooted in their vastly different constraints, and explains how concepts like inductive bias, probability, and curiosity can bridge the gap between cognitive science and modern AI.
Princeton Professor Tom Griffiths discusses his book "The Laws of Thought," exploring the mathematical models that govern both biological and artificial intelligence. He details the fundamental differences between human and machine cognition, rooted in their vastly different constraints, and explains how concepts like inductive bias, probability, and curiosity can bridge the gap between cognitive science and modern AI.
Professor Tom Griffiths of Princeton University explores the mathematical principles that form the foundation of both human and artificial intelligence, bridging the gap between two contrasting views of the human mind. While psychologists often highlight human irrationality and biases, computer scientists see human cognition as an inspiration for AI. Griffiths' work seeks to reconcile these perspectives by framing human intelligence as a rational adaptation to significant constraints.
The Laws of Thought: A Mathematical Theory of Mind
The core idea of Griffiths' book, The Laws of Thought, is that just as mathematical laws of nature describe the external, physical world, a complementary set of mathematical principles can describe our internal, mental world.
⦁ From Behaviorism to Cognitive Science: Early psychology struggled to scientifically study internal thoughts, leading to the rise of behaviorism, which focused only on observable behaviors. The cognitive revolution was made possible by the development of computers and mathematical concepts like logic and probability, which provided a new, rigorous language to form and test hypotheses about the mind.
⦁ Research Methodology: Modern cognitive science research often involves large-scale online experiments. In Griffiths' lab, participants are presented with problems that require them to make inferences or decisions from data. By analyzing the responses from thousands of participants using modern machine learning tools like neural networks, researchers can develop and refine computational models of human cognition.
Human vs. AI: A Tale of Two Intelligences
A key distinction between human and artificial intelligence lies in the constraints they operate under. Humans are limited by time (a finite lifespan), computation (a few pounds of neural tissue), and communication bandwidth. In contrast, AI systems can be scaled with more data and compute and can transfer information perfectly. This leads to fundamentally different problem-solving approaches.
⦁ Inductive Bias and The Data Gap: A human child learns a language in about five years, whereas an LLM requires the equivalent of thousands of years of text data. This vast difference highlights the powerful inductive biases, or priors, built into human cognition. These biases provide a starting framework that makes learning from sparse data possible.
⦁ The Machine Learning Paradigm: Since the success of AlexNet in 2012, the dominant paradigm in machine learning has been one of weak inductive biases and massive datasets. The philosophy is that with enough data, a sufficiently complex model can learn the necessary features and solutions without human-engineered priors. This is the opposite of the human approach.
⦁ Engineering Inductive Bias: To create more human-like AI, we may need to engineer these biases. Meta-learning is one such technique, where a model learns an optimal set of initial weights by being trained on a wide variety of tasks. This provides a "soft bias" that guides the model toward effective solutions without rigidly constraining it, making it better at few-shot learning.
Deconstructing Large Language Models
Griffiths' research provides a scientific lens for understanding the behavior of LLMs.
⦁ Deductive vs. Inductive Problems: Early symbolic AI excelled at deductive problems (e.g., logic, chess), where all necessary information is provided. However, it struggled with inductive problems—the cornerstone of human intelligence—where conclusions must be drawn from incomplete information. Probability theory, particularly Bayes' rule, provides the mathematical framework for induction.
⦁ "Embers of Autoregression": LLMs are trained to predict the next token in a sequence, which makes them highly sensitive to the statistical patterns in their training data. This can lead to counter-intuitive behavior. For example, an LLM might be mo...
Full story
The Laws of Thought: A Mathematical Theory of Mind
The core idea of Griffiths' book, The Laws of Thought, is that just as mathematical laws of nature describe the external, physical world, a complementary set of mathematical principles can describe our internal, mental world.
⦁ From Behaviorism to Cognitive Science: Early psychology struggled to scientifically study internal thoughts, leading to the rise of behaviorism, which focused only on observable behaviors. The cognitive revolution was made possible by the development of computers and mathematical concepts like logic and probability, which provided a new, rigorous language to form and test hypotheses about the mind.
⦁ Research Methodology: Modern cognitive science research often involves large-scale online experiments. In Griffiths' lab, participants are presented with problems that require them to make inferences or decisions from data. By analyzing the responses from thousands of participants using modern machine learning tools like neural networks, researchers can develop and refine computational models of human cognition.
Human vs. AI: A Tale of Two Intelligences
A key distinction between human and artificial intelligence lies in the constraints they operate under. Humans are limited by time (a finite lifespan), computation (a few pounds of neural tissue), and communication bandwidth. In contrast, AI systems can be scaled with more data and compute and can transfer information perfectly. This leads to fundamentally different problem-solving approaches.
⦁ Inductive Bias and The Data Gap: A human child learns a language in about five years, whereas an LLM requires the equivalent of thousands of years of text data. This vast difference highlights the powerful inductive biases, or priors, built into human cognition. These biases provide a starting framework that makes learning from sparse data possible.
⦁ The Machine Learning Paradigm: Since the success of AlexNet in 2012, the dominant paradigm in machine learning has been one of weak inductive biases and massive datasets. The philosophy is that with enough data, a sufficiently complex model can learn the necessary features and solutions without human-engineered priors. This is the opposite of the human approach.
⦁ Engineering Inductive Bias: To create more human-like AI, we may need to engineer these biases. Meta-learning is one such technique, where a model learns an optimal set of initial weights by being trained on a wide variety of tasks. This provides a "soft bias" that guides the model toward effective solutions without rigidly constraining it, making it better at few-shot learning.
Deconstructing Large Language Models
Griffiths' research provides a scientific lens for understanding the behavior of LLMs.
⦁ Deductive vs. Inductive Problems: Early symbolic AI excelled at deductive problems (e.g., logic, chess), where all necessary information is provided. However, it struggled with inductive problems—the cornerstone of human intelligence—where conclusions must be drawn from incomplete information. Probability theory, particularly Bayes' rule, provides the mathematical framework for induction.
⦁ "Embers of Autoregression": LLMs are trained to predict the next token in a sequence, which makes them highly sensitive to the statistical patterns in their training data. This can lead to counter-intuitive behavior. For example, an LLM might be mo...
Full story
tokenless.tech
The Laws of Thought: The Math of Minds and Machines, with Prof. Tom Griffiths | Tokenless
Princeton Professor Tom Griffiths discusses his book "The Laws of Thought," exploring the mathematical models that govern both biological and artificial intelligence. He details the fundamental differences between human and machine cognition, rooted in their…
How A Team Of 7 Keeps Breaking AI Benchmark Records
Poetiq, a startup by former DeepMind researchers, has developed a recursive self-improvement meta-system that builds "reasoning harnesses" on top of existing LLMs. This approach avoids the costly "fine-tuning trap" and has achieved state-of-the-art results on benchmarks like ARC-AGI and Humanity's Last Exam by automatically optimizing prompts and discovering novel reasoning strategies.
Poetiq, a startup by former DeepMind researchers, has developed a recursive self-improvement meta-system that builds "reasoning harnesses" on top of existing LLMs. This approach avoids the costly "fine-tuning trap" and has achieved state-of-the-art results on benchmarks like ARC-AGI and Humanity's Last Exam by automatically optimizing prompts and discovering novel reasoning strategies.
Poetiq is building a recursively self-improving system that acts as a "reasoning harness" for large language models (LLMs). The core insight is a method for recursive self-improvement—where an AI makes itself smarter—that is significantly faster and cheaper than traditional approaches, which typically require retraining a new LLM from scratch at a cost of hundreds of millions of dollars.
The Challenge: The "Fine-Tuning Trap"
Many companies building on LLMs face a significant challenge: the "fine-tuning trap." The conventional approach involves:
1. Collecting a large dataset (tens of thousands of examples).
2. Spending a great deal on compute to fine-tune a frontier or open-weights model.
3. Achieving improved performance on a specific task.
However, this process is vulnerable to the "bitter lesson." Soon after, a new, more powerful base model is released (e.g., GPT-4o) that outperforms the specialized, fine-tuned model out of the box. The company is then faced with the choice of repeating the expensive fine-tuning process or going out of business.
Poetiq's Solution: "Stilts" for LLMs
Poetiq offers a different paradigm. Instead of fine-tuning, it provides an agentic system, or "harness," that sits on top of one or more LLMs. This harness enhances the base model's capabilities, acting like "stilts" to make it perform better on specific, hard problems.
Key advantages of this approach include:
⦁ Model Agnostic: When a new frontier model is released, the same harness remains compatible and provides an immediate performance boost.
⦁ Cost-Effective: The optimization process is vastly cheaper than fine-tuning.
⦁ Continuous Improvement: The harness can be further optimized for the new model to achieve even greater performance, ensuring the system always outperforms the underlying base models.
How the Meta-System Works
Poetiq's core technology is a recursively self-improving meta-system. The output of this meta-system is not just a solution, but systems that solve hard problems. This automated optimization process can generate a complete reasoning harness—comprising code, prompts, and data—from scratch.
Furthermore, if a startup has already built its own agent, Poetiq can ingest that system and optimize its components, such as prompts or reasoning strategies. The meta-system analyzes the problem data to discover failure modes and identify robust reasoning paths, effectively outsourcing the deep, manual data analysis to the AI itself.
Automating Prompt and Reasoning Engineering
The system moves beyond simple prompt optimization. While automated prompt tuning (like the popular
In one example from a previous project at DeepMind, manual prompt optimization on a very hard task took performance to 5%. However, by adding optimized reasoning strategies, performance jumped from 5% to 95%. The Poetiq meta-system automates this discovery process, often generating non-intuitive prompts and strategies that a human might not devise.
Demonstrated Success on Key Benchmarks
Poetiq has validated its approach by achieving top rankings on difficult AI benchmarks:
⦁ ARC-AGI: Shortly after a new model set the state-of-the-art at 45%, Poetiq's system achieved 54% accuracy. Notably, it did so by building on a cheaper base model, costing only $32 per problem compared to the previous SOTA's $70+.
⦁ Humanity's Last Exam: On this set of 2,500 expert-level questions, Poetiq achieved a score of 55%, surpassing the previous record of 53.1% set by Anthropic's Claude Opus. The entire optimization run for this achievement cost less than $100,000, a fraction of the cost of training a foundation model.
These results demonstrate the system's ability to enhance both complex reasoning (ARC...
Full story
The Challenge: The "Fine-Tuning Trap"
Many companies building on LLMs face a significant challenge: the "fine-tuning trap." The conventional approach involves:
1. Collecting a large dataset (tens of thousands of examples).
2. Spending a great deal on compute to fine-tune a frontier or open-weights model.
3. Achieving improved performance on a specific task.
However, this process is vulnerable to the "bitter lesson." Soon after, a new, more powerful base model is released (e.g., GPT-4o) that outperforms the specialized, fine-tuned model out of the box. The company is then faced with the choice of repeating the expensive fine-tuning process or going out of business.
Poetiq's Solution: "Stilts" for LLMs
Poetiq offers a different paradigm. Instead of fine-tuning, it provides an agentic system, or "harness," that sits on top of one or more LLMs. This harness enhances the base model's capabilities, acting like "stilts" to make it perform better on specific, hard problems.
Key advantages of this approach include:
⦁ Model Agnostic: When a new frontier model is released, the same harness remains compatible and provides an immediate performance boost.
⦁ Cost-Effective: The optimization process is vastly cheaper than fine-tuning.
⦁ Continuous Improvement: The harness can be further optimized for the new model to achieve even greater performance, ensuring the system always outperforms the underlying base models.
How the Meta-System Works
Poetiq's core technology is a recursively self-improving meta-system. The output of this meta-system is not just a solution, but systems that solve hard problems. This automated optimization process can generate a complete reasoning harness—comprising code, prompts, and data—from scratch.
Furthermore, if a startup has already built its own agent, Poetiq can ingest that system and optimize its components, such as prompts or reasoning strategies. The meta-system analyzes the problem data to discover failure modes and identify robust reasoning paths, effectively outsourcing the deep, manual data analysis to the AI itself.
Automating Prompt and Reasoning Engineering
The system moves beyond simple prompt optimization. While automated prompt tuning (like the popular
DSPy framework) provides some gains, the most significant improvements come from discovering novel reasoning strategies that are implemented in code.In one example from a previous project at DeepMind, manual prompt optimization on a very hard task took performance to 5%. However, by adding optimized reasoning strategies, performance jumped from 5% to 95%. The Poetiq meta-system automates this discovery process, often generating non-intuitive prompts and strategies that a human might not devise.
Demonstrated Success on Key Benchmarks
Poetiq has validated its approach by achieving top rankings on difficult AI benchmarks:
⦁ ARC-AGI: Shortly after a new model set the state-of-the-art at 45%, Poetiq's system achieved 54% accuracy. Notably, it did so by building on a cheaper base model, costing only $32 per problem compared to the previous SOTA's $70+.
⦁ Humanity's Last Exam: On this set of 2,500 expert-level questions, Poetiq achieved a score of 55%, surpassing the previous record of 53.1% set by Anthropic's Claude Opus. The entire optimization run for this achievement cost less than $100,000, a fraction of the cost of training a foundation model.
These results demonstrate the system's ability to enhance both complex reasoning (ARC...
Full story
tokenless.tech
How A Team Of 7 Keeps Breaking AI Benchmark Records | Tokenless
Poetiq, a startup by former DeepMind researchers, has developed a recursive self-improvement meta-system that builds "reasoning harnesses" on top of existing LLMs. This approach avoids the costly "fine-tuning trap" and has achieved state-of-the-art results…
How AI is changing Software Engineering: A Conversation with Gergely Orosz, @pragmaticengineer
Gergely Orosz, author of The Pragmatic Engineer, discusses the bizarre trend of 'token maxing' in Big Tech, the evolving role of software engineers in the AI era, and why companies are heavily investing in internal AI infrastructure despite uncertain productivity gains.
Gergely Orosz, author of The Pragmatic Engineer, discusses the bizarre trend of 'token maxing' in Big Tech, the evolving role of software engineers in the AI era, and why companies are heavily investing in internal AI infrastructure despite uncertain productivity gains.
The Rise of "Token Maxing" in Big Tech
A strange cultural phenomenon known as "token maxing" has emerged within large tech companies like Meta, Microsoft, and Salesforce. It stems from these companies measuring developers' usage of internal AI tools, often through public leaderboards or spend tracking. At Salesforce, for example, there's a tool to see how many dollars colleagues have spent on AI tokens, with some teams having a target minimum spend of around $175 per month.
This measurement, coupled with industry-wide job insecurity, has led engineers to feel pressured to increase their token count to avoid being perceived as low performers. This pressure results in counterproductive behaviors:
⦁ Artificial Usage: Engineers run autonomous agents to "build junk" or ask AI assistants to summarize documentation (even if the AI does a poor job) simply to generate tokens and stay out of the bottom percentile.
⦁ Weaponized Metrics: While token count is just one of many data points in performance reviews, it can be "weaponized." A low performer with a low token count is seen as "not even trying," while a high performer with a high token count is seen as an innovator.
This trend is reminiscent of earlier flawed metrics like "lines of code," but it's now being driven by top tech companies. It's a response to an initial push from leadership who, seeing the success of AI-native companies like Anthropic, wanted to force adoption among skeptical, experienced engineers. The situation at Coinbase was an extreme example, where the CEO mandated AI tool usage under the threat of termination.
Is AI Actually Making Engineers More Productive?
While individual productivity is certainly increasing, team-level productivity is more of a question mark. It's proven difficult to retrofit AI into established workflows. An interesting perspective is that the most significant productivity gain may not be for engineers themselves, but for their non-technical collaborators. By enabling them with coding agents, they no longer have to wait for an engineer, effectively creating "serverless developers" and unlocking organizational productivity.
Mastering these AI tools requires a new mindset:
⦁ Continuous Learning: There is no manual for AI tools. It takes a long time to get good, and workflows are constantly changing.
⦁ Practice Over Theory: Unlike traditional computer science, understanding the underlying theory of models doesn't necessarily make you a better user. Hands-on experience is key.
⦁ Open-Mindedness: Success requires a low-ego approach, being "open to leaving your priors behind" and experimenting.
The Changing Role of the Software Engineer
AI is accelerating a trend that was already underway: the expansion of the software engineer's role. The role has already absorbed responsibilities from dedicated tester and DevOps teams. Now, it's beginning to incorporate product skills, giving rise to the "product engineer."
Expectations for seniority and business awareness are increasing even for early-career engineers. As a result, teams are shrinking. A VP of Engineering at John Deere noted their "two-pizza teams" are becoming "one-pizza teams," a direct result of these new tools and expanded roles.
The idea that engineers are becoming "engineering managers for AI agents" is a flawed analogy. The role is more akin to a Tech Lead or someone operating a "mech suit" (an analogy from DHH). You orchestrate tasks and can do more, faster, without the people-related challenges of management—the drama, conflicts, and slow feedback loops. With agents, the feedback loop is immediate.
Big Tech's Massive Investment in Internal AI Infrastructure
Many large tech companies like Uber, Airbnb, and Meta are investing heavily in building bespoke internal AI infrastructure, even if it hasn't yet translated to a dramatic increase in external product features. They...
Full story
A strange cultural phenomenon known as "token maxing" has emerged within large tech companies like Meta, Microsoft, and Salesforce. It stems from these companies measuring developers' usage of internal AI tools, often through public leaderboards or spend tracking. At Salesforce, for example, there's a tool to see how many dollars colleagues have spent on AI tokens, with some teams having a target minimum spend of around $175 per month.
This measurement, coupled with industry-wide job insecurity, has led engineers to feel pressured to increase their token count to avoid being perceived as low performers. This pressure results in counterproductive behaviors:
⦁ Artificial Usage: Engineers run autonomous agents to "build junk" or ask AI assistants to summarize documentation (even if the AI does a poor job) simply to generate tokens and stay out of the bottom percentile.
⦁ Weaponized Metrics: While token count is just one of many data points in performance reviews, it can be "weaponized." A low performer with a low token count is seen as "not even trying," while a high performer with a high token count is seen as an innovator.
This trend is reminiscent of earlier flawed metrics like "lines of code," but it's now being driven by top tech companies. It's a response to an initial push from leadership who, seeing the success of AI-native companies like Anthropic, wanted to force adoption among skeptical, experienced engineers. The situation at Coinbase was an extreme example, where the CEO mandated AI tool usage under the threat of termination.
Is AI Actually Making Engineers More Productive?
While individual productivity is certainly increasing, team-level productivity is more of a question mark. It's proven difficult to retrofit AI into established workflows. An interesting perspective is that the most significant productivity gain may not be for engineers themselves, but for their non-technical collaborators. By enabling them with coding agents, they no longer have to wait for an engineer, effectively creating "serverless developers" and unlocking organizational productivity.
Mastering these AI tools requires a new mindset:
⦁ Continuous Learning: There is no manual for AI tools. It takes a long time to get good, and workflows are constantly changing.
⦁ Practice Over Theory: Unlike traditional computer science, understanding the underlying theory of models doesn't necessarily make you a better user. Hands-on experience is key.
⦁ Open-Mindedness: Success requires a low-ego approach, being "open to leaving your priors behind" and experimenting.
The Changing Role of the Software Engineer
AI is accelerating a trend that was already underway: the expansion of the software engineer's role. The role has already absorbed responsibilities from dedicated tester and DevOps teams. Now, it's beginning to incorporate product skills, giving rise to the "product engineer."
Expectations for seniority and business awareness are increasing even for early-career engineers. As a result, teams are shrinking. A VP of Engineering at John Deere noted their "two-pizza teams" are becoming "one-pizza teams," a direct result of these new tools and expanded roles.
The idea that engineers are becoming "engineering managers for AI agents" is a flawed analogy. The role is more akin to a Tech Lead or someone operating a "mech suit" (an analogy from DHH). You orchestrate tasks and can do more, faster, without the people-related challenges of management—the drama, conflicts, and slow feedback loops. With agents, the feedback loop is immediate.
Big Tech's Massive Investment in Internal AI Infrastructure
Many large tech companies like Uber, Airbnb, and Meta are investing heavily in building bespoke internal AI infrastructure, even if it hasn't yet translated to a dramatic increase in external product features. They...
Full story
tokenless.tech
How AI is changing Software Engineering: A Conversation with Gergely Orosz, @pragmaticengineer | Tokenless
Gergely Orosz, author of The Pragmatic Engineer, discusses the bizarre trend of 'token maxing' in Big Tech, the evolving role of software engineers in the AI era, and why companies are heavily investing in internal AI infrastructure despite uncertain productivity…
Building Generative Image & Video models at Scale - Sander Dieleman (Veo and Nano Banana)
Sander Dieleman from Google DeepMind provides a behind-the-scenes look at the key components of training large-scale diffusion models for audio-visual data. The talk covers the entire pipeline, from the critical role of data curation and latent representations to the mechanics of diffusion, network architectures, sampling with guidance, and advanced control signals.
Sander Dieleman from Google DeepMind provides a behind-the-scenes look at the key components of training large-scale diffusion models for audio-visual data. The talk covers the entire pipeline, from the critical role of data curation and latent representations to the mechanics of diffusion, network architectures, sampling with guidance, and advanced control signals.
Diffusion models have become the dominant paradigm for generating high-quality audio-visual data, differing from the auto-regressive models that are prevalent in language modeling. This is a comprehensive overview of the entire process, from data to sampling.
Data Curation and Representation
A critical, yet often underrated, aspect of training large-scale models is meticulous data curation. In contrast to academic research which incentivizes using standard benchmarks, real-world success depends on actively curating and cleaning the training data. Time spent improving the dataset is often a better investment than tweaking model architecture or optimizers.
Training on raw pixel data is infeasible for high-resolution or long-duration content due to immense memory requirements. The solution is to work in a compressed latent space.
⦁ Autoencoder-based Compression: A custom autoencoder is trained to compress data into a compact latent representation and then decode it back to pixel space. The diffusion model is then trained exclusively on these latents.
⦁ Preserving Structure: Unlike standard codecs (e.g., JPEG), these learned latents preserve the spatial grid structure of the original data, albeit at a much lower resolution (e.g., a 256x256 image might become a 32x32 latent grid). This is crucial because the neural network architectures have strong inductive biases that rely on this grid structure.
⦁ Efficiency: This approach can reduce data size by up to two orders of magnitude, making it possible to fit training examples in memory. The latents abstract away fine-grained local textures while preserving the core semantic content of the image or video.
The Diffusion Modeling Mechanism
Diffusion is an iterative refinement process. It works by first defining a forward (corruption) process, where Gaussian noise is gradually added to an image until all information is destroyed. The model then learns a reverse (denoising) process to remove that noise.
Generation starts with pure noise and iteratively refines it:
1. The denoiser model is given a noisy input
2. Because noise removes information, this is an ill-posed problem. The model's prediction is effectively the average of all possible source images, resulting in a blurry output. This blurry prediction provides a direction for the next step.
3. A small step is taken in this predicted direction.
4. (Optional) A small amount of new noise is added back. This makes the process stochastic and more robust to the accumulation of the model's own errors.
5. This process is repeated. As noise is removed, the model has more information, its predictions become sharper, and the sample quality improves until a clean image is generated.
From a frequency perspective, this process can be viewed as spectral auto-regression. The noise corruption process obscures high-frequency details first, followed by lower-frequency global structures. The reverse denoising process, therefore, generates the image from coarse to fine, starting with low frequencies (global layout) and progressively adding high-frequency details.
Network Architecture and Training
⦁ Backbones: While early models like Stable Diffusion used U-Net architectures, the field has largely shifted to Transformers. This allows leveraging the extensive knowledge and tooling developed for scaling Large Language Models (LLMs), using adaptations like bidirectional attention instead of causal masking.
⦁ Video Models: There are two main approaches for video:
1. Joint Diffusion: Treat the entire 3D video volume (space and time) as a single entity to be noised and denoised.
2. Hybrid Approach: Use auto-regression in the time dimension (generating frame-by-frame) while using diffusion to generate each individual frame. This is useful for applications like...
Full story
Data Curation and Representation
A critical, yet often underrated, aspect of training large-scale models is meticulous data curation. In contrast to academic research which incentivizes using standard benchmarks, real-world success depends on actively curating and cleaning the training data. Time spent improving the dataset is often a better investment than tweaking model architecture or optimizers.
Training on raw pixel data is infeasible for high-resolution or long-duration content due to immense memory requirements. The solution is to work in a compressed latent space.
⦁ Autoencoder-based Compression: A custom autoencoder is trained to compress data into a compact latent representation and then decode it back to pixel space. The diffusion model is then trained exclusively on these latents.
⦁ Preserving Structure: Unlike standard codecs (e.g., JPEG), these learned latents preserve the spatial grid structure of the original data, albeit at a much lower resolution (e.g., a 256x256 image might become a 32x32 latent grid). This is crucial because the neural network architectures have strong inductive biases that rely on this grid structure.
⦁ Efficiency: This approach can reduce data size by up to two orders of magnitude, making it possible to fit training examples in memory. The latents abstract away fine-grained local textures while preserving the core semantic content of the image or video.
The Diffusion Modeling Mechanism
Diffusion is an iterative refinement process. It works by first defining a forward (corruption) process, where Gaussian noise is gradually added to an image until all information is destroyed. The model then learns a reverse (denoising) process to remove that noise.
Generation starts with pure noise and iteratively refines it:
1. The denoiser model is given a noisy input
XT and predicts the original, clean image X0.2. Because noise removes information, this is an ill-posed problem. The model's prediction is effectively the average of all possible source images, resulting in a blurry output. This blurry prediction provides a direction for the next step.
3. A small step is taken in this predicted direction.
4. (Optional) A small amount of new noise is added back. This makes the process stochastic and more robust to the accumulation of the model's own errors.
5. This process is repeated. As noise is removed, the model has more information, its predictions become sharper, and the sample quality improves until a clean image is generated.
From a frequency perspective, this process can be viewed as spectral auto-regression. The noise corruption process obscures high-frequency details first, followed by lower-frequency global structures. The reverse denoising process, therefore, generates the image from coarse to fine, starting with low frequencies (global layout) and progressively adding high-frequency details.
Network Architecture and Training
⦁ Backbones: While early models like Stable Diffusion used U-Net architectures, the field has largely shifted to Transformers. This allows leveraging the extensive knowledge and tooling developed for scaling Large Language Models (LLMs), using adaptations like bidirectional attention instead of causal masking.
⦁ Video Models: There are two main approaches for video:
1. Joint Diffusion: Treat the entire 3D video volume (space and time) as a single entity to be noised and denoised.
2. Hybrid Approach: Use auto-regression in the time dimension (generating frame-by-frame) while using diffusion to generate each individual frame. This is useful for applications like...
Full story
Open Models at Google DeepMind — Cassidy Hardin, Google DeepMind
Cassidy Hardin from Google DeepMind introduces Gemma 4, a new family of open-weight models with significant architectural and performance improvements. This summary covers the four new models (31B Dense, 26B MoE, and two "Effective" on-device models), deep dives into architectural changes like mixed global/local attention and Per-Layer Embeddings (PLE), and details the new native multimodal capabilities for vision and audio.
Cassidy Hardin from Google DeepMind introduces Gemma 4, a new family of open-weight models with significant architectural and performance improvements. This summary covers the four new models (31B Dense, 26B MoE, and two "Effective" on-device models), deep dives into architectural changes like mixed global/local attention and Per-Layer Embeddings (PLE), and details the new native multimodal capabilities for vision and audio.