Post-training best-in-class models in 2025
An expert overview of post-training techniques for language models, covering the entire workflow from data generation and curation to advanced algorithms like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Reinforcement Learning (RL), along with practical advice on evaluation and iteration.
An expert overview of post-training techniques for language models, covering the entire workflow from data generation and curation to advanced algorithms like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Reinforcement Learning (RL), along with practical advice on evaluation and iteration.
Post-training is the crucial process of transforming a pre-trained base model, which can only perform token completion, into a sophisticated model capable of following instructions and answering questions. This process is an iterative cycle of data curation, training, and evaluation, essential for creating specialized and high-performing models.
Supervised Fine-Tuning (SFT): The Foundation
The first step in post-training is Supervised Fine-Tuning (SFT). This involves training the base model on a large, high-quality dataset of instruction-answer pairs, often exceeding one million samples for general-purpose models.
Data Quality and Structure
The quality of the SFT dataset is paramount. A good dataset must be:
⦁ Accurate: The answers must be factually correct.
⦁ Diverse: It should cover a wide range of topics and tasks.
⦁ Complex: The tasks should be challenging enough to facilitate model learning.
The typical data structure includes an optional system prompt, a user instruction, and the expected model output. During training, the loss is calculated only on the model's generated output, making the quality of the provided answers critically important. A common data generation pipeline involves using a powerful LLM to generate responses to seed prompts with specific constraints, followed by automated checks, filtering, and decontamination to prevent training on test data.
SFT Techniques and Parameters
⦁ Full Fine-Tuning: Updates all model parameters, maximizing potential quality but requiring significant computational resources.
⦁ Parameter-Efficient Fine-Tuning (PEFT):
⦁ LoRA (Low-Rank Adaptation): Freezes the base model's weights and introduces small, trainable matrices (adapters). This drastically reduces the number of trainable parameters (e.g., to 0.1%), saving VRAM and speeding up training.
⦁ QLoRA (Quantized LoRA): Further reduces memory requirements by loading a quantized (e.g., 4-bit) version of the model before applying LoRA. This is a trade-off, as it can lead to a degradation in quality compared to standard LoRA.
The most critical hyperparameter to tune is the learning rate. An excessively high learning rate can cause "loss spikes," leading to a collapse in model performance. Monitoring the training loss for a smooth, descending curve is a key indicator of a successful run.
Preference Alignment with DPO
Direct Preference Optimization (DPO) is a powerful technique for aligning a model's behavior and style with human preferences. It moves beyond simple instruction-following to refine the nuances of the model's responses.
DPO uses a preference dataset composed of prompts, "chosen" (preferred) answers, and "rejected" (less preferred) answers. The training objective is contrastive: it increases the likelihood of the model generating responses similar to the chosen examples while decreasing the likelihood of generating those similar to the rejected ones.
A key hyperparameter in DPO is
DPO is highly effective at creating models that humans prefer, as measured by metrics like the Chatbot Arena Elo score. However, it's important to note that human preference is often weakly correlated with performance on academic benchmarks for tasks like math or reasoning.
Advanced Reasoning with Reinforcement Learning (RL)
For complex reasoning tasks like math and coding, Reinforcement Learning (RL) offers a powerful training paradigm. A popular approach, used for models like DeepSeek, involves a multi-stage process.
1. SFT Warm-up: The model is first fine-tuned on a specialized dataset where each answer is preceded by a "reasoning trace" or chain-of-thought. This teaches the model to structure its thinking process before providi...
Full story
Supervised Fine-Tuning (SFT): The Foundation
The first step in post-training is Supervised Fine-Tuning (SFT). This involves training the base model on a large, high-quality dataset of instruction-answer pairs, often exceeding one million samples for general-purpose models.
Data Quality and Structure
The quality of the SFT dataset is paramount. A good dataset must be:
⦁ Accurate: The answers must be factually correct.
⦁ Diverse: It should cover a wide range of topics and tasks.
⦁ Complex: The tasks should be challenging enough to facilitate model learning.
The typical data structure includes an optional system prompt, a user instruction, and the expected model output. During training, the loss is calculated only on the model's generated output, making the quality of the provided answers critically important. A common data generation pipeline involves using a powerful LLM to generate responses to seed prompts with specific constraints, followed by automated checks, filtering, and decontamination to prevent training on test data.
SFT Techniques and Parameters
⦁ Full Fine-Tuning: Updates all model parameters, maximizing potential quality but requiring significant computational resources.
⦁ Parameter-Efficient Fine-Tuning (PEFT):
⦁ LoRA (Low-Rank Adaptation): Freezes the base model's weights and introduces small, trainable matrices (adapters). This drastically reduces the number of trainable parameters (e.g., to 0.1%), saving VRAM and speeding up training.
⦁ QLoRA (Quantized LoRA): Further reduces memory requirements by loading a quantized (e.g., 4-bit) version of the model before applying LoRA. This is a trade-off, as it can lead to a degradation in quality compared to standard LoRA.
The most critical hyperparameter to tune is the learning rate. An excessively high learning rate can cause "loss spikes," leading to a collapse in model performance. Monitoring the training loss for a smooth, descending curve is a key indicator of a successful run.
Preference Alignment with DPO
Direct Preference Optimization (DPO) is a powerful technique for aligning a model's behavior and style with human preferences. It moves beyond simple instruction-following to refine the nuances of the model's responses.
DPO uses a preference dataset composed of prompts, "chosen" (preferred) answers, and "rejected" (less preferred) answers. The training objective is contrastive: it increases the likelihood of the model generating responses similar to the chosen examples while decreasing the likelihood of generating those similar to the rejected ones.
A key hyperparameter in DPO is
beta, which controls how closely the model must adhere to the reference model. A low beta allows for more exploration, while a high beta keeps the model's behavior constrained.DPO is highly effective at creating models that humans prefer, as measured by metrics like the Chatbot Arena Elo score. However, it's important to note that human preference is often weakly correlated with performance on academic benchmarks for tasks like math or reasoning.
Advanced Reasoning with Reinforcement Learning (RL)
For complex reasoning tasks like math and coding, Reinforcement Learning (RL) offers a powerful training paradigm. A popular approach, used for models like DeepSeek, involves a multi-stage process.
1. SFT Warm-up: The model is first fine-tuned on a specialized dataset where each answer is preceded by a "reasoning trace" or chain-of-thought. This teaches the model to structure its thinking process before providi...
Full story
tokenless.tech
Post-training best-in-class models in 2025 | Tokenless
An expert overview of post-training techniques for language models, covering the entire workflow from data generation and curation to advanced algorithms like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Reinforcement Learning (RL)…
Ben Lorica and Evangelos Simoudis of Synapse Partners analyze the major themes from CES 2026 and the shifting landscape of US-China AI chip export policies.
Key Takeaways from CES 2026
The Humanoid Robotics Explosion
CES 2026 was overwhelmingly a "humanoids conference," with over 30 companies, predominantly from China, showcasing their latest developments. However, a closer look reveals the industry is still in its early stages.
⦁ State of the Technology: Many demonstrations were teleoperated, and the robots are not yet ready for deployment in real-world environments. The term "autonomous" is applied loosely, as most units had human "minders" nearby.
⦁ The Dexterity Challenge: A significant hurdle remains the "magnificence of the hand." While robots may have up to 20 degrees of freedom, they lack the sophisticated sensing capabilities of human hands, limiting their dexterity and practical application.
⦁ Applications and Investment: There is a strong emphasis on humanoids within the broader category of adaptive robots, perhaps disproportionately so. Real-world applications remain largely conceptual, with discussions centered on manufacturing and logistics. The recent influx of venture capital into companies like Figure, Apptronik, and 1X seems driven by the "shiny object" appeal, popularized by Elon Musk's Optimus, rather than a deep analysis of market applications. The CEO of 1X admitted their robots are still mostly teleoperated and perform a very narrow range of tasks.
⦁ Market Impact: A notable development is Hyundai's plan to manufacture 20,000-30,000 Atlas robots over the next few years for its new factory in Georgia. This move signals a potential shift in manufacturing and raises questions about future employment, as automation could curtail the number of jobs created by these new industrial investments.
For a more grounded perspective, the speakers recommend Rodney Brooks' recent essay, which offers critical observations from a veteran in the field of robotics.
Software-Defined Vehicles (SDVs) and In-Cabin AI
The automotive sector at CES showcased a significant push towards software-centric vehicles and integrating AI into the user experience.
⦁ Defining the SDV: An SDV is a vehicle where hardware and software are cleanly separated, allowing for most of the vehicle's functionality to be defined and updated via software. This is exemplified by Tesla and Rivian.
⦁ Architectural Evolution: The industry is transitioning from domain-based architectures to more advanced zonal architectures. The key advantages of a zonal approach include using fewer, more powerful hardware components and enabling a much larger portion of the vehicle (upwards of 80%) to be updated over-the-air (OTA).
⦁ AI in the Cabin: The focus of automotive AI is expanding beyond autonomous driving to the in-cabin experience. Major automakers announced significant partnerships:
⦁ Mercedes-Benz: Collaborating with Google and Microsoft on in-cabin agents.
⦁ BMW: Partnering with Amazon for its "Alexa Plus" integration.
⦁ Personal Autonomy: There was a greater emphasis on Level 4 autonomous vehicles for personal use, not just for robotaxi fleets. Nvidia deepened its long-standing relationship with Mercedes, and Ford announced an ambitious plan to deliver a Level 3 autonomous vehicle by 2028 for around $30,000.
⦁ Open Source Initiative: In a potentially transformative move, 30 automakers and tier-one suppliers announced a consortium to develop open-source software for SDVs, which could accelerate innovation and standardization across the industry.
The Shifting AI Chip Export Controls
The conversation shifted to the recent relaxation of US export controls on AI chips to China, a policy announced by the Trump administration in late 2025. The policy itself is ambiguous, seemingly based on a Truth Social post rather than a formal do...
Full story
Key Takeaways from CES 2026
The Humanoid Robotics Explosion
CES 2026 was overwhelmingly a "humanoids conference," with over 30 companies, predominantly from China, showcasing their latest developments. However, a closer look reveals the industry is still in its early stages.
⦁ State of the Technology: Many demonstrations were teleoperated, and the robots are not yet ready for deployment in real-world environments. The term "autonomous" is applied loosely, as most units had human "minders" nearby.
⦁ The Dexterity Challenge: A significant hurdle remains the "magnificence of the hand." While robots may have up to 20 degrees of freedom, they lack the sophisticated sensing capabilities of human hands, limiting their dexterity and practical application.
⦁ Applications and Investment: There is a strong emphasis on humanoids within the broader category of adaptive robots, perhaps disproportionately so. Real-world applications remain largely conceptual, with discussions centered on manufacturing and logistics. The recent influx of venture capital into companies like Figure, Apptronik, and 1X seems driven by the "shiny object" appeal, popularized by Elon Musk's Optimus, rather than a deep analysis of market applications. The CEO of 1X admitted their robots are still mostly teleoperated and perform a very narrow range of tasks.
⦁ Market Impact: A notable development is Hyundai's plan to manufacture 20,000-30,000 Atlas robots over the next few years for its new factory in Georgia. This move signals a potential shift in manufacturing and raises questions about future employment, as automation could curtail the number of jobs created by these new industrial investments.
For a more grounded perspective, the speakers recommend Rodney Brooks' recent essay, which offers critical observations from a veteran in the field of robotics.
Software-Defined Vehicles (SDVs) and In-Cabin AI
The automotive sector at CES showcased a significant push towards software-centric vehicles and integrating AI into the user experience.
⦁ Defining the SDV: An SDV is a vehicle where hardware and software are cleanly separated, allowing for most of the vehicle's functionality to be defined and updated via software. This is exemplified by Tesla and Rivian.
⦁ Architectural Evolution: The industry is transitioning from domain-based architectures to more advanced zonal architectures. The key advantages of a zonal approach include using fewer, more powerful hardware components and enabling a much larger portion of the vehicle (upwards of 80%) to be updated over-the-air (OTA).
⦁ AI in the Cabin: The focus of automotive AI is expanding beyond autonomous driving to the in-cabin experience. Major automakers announced significant partnerships:
⦁ Mercedes-Benz: Collaborating with Google and Microsoft on in-cabin agents.
⦁ BMW: Partnering with Amazon for its "Alexa Plus" integration.
⦁ Personal Autonomy: There was a greater emphasis on Level 4 autonomous vehicles for personal use, not just for robotaxi fleets. Nvidia deepened its long-standing relationship with Mercedes, and Ford announced an ambitious plan to deliver a Level 3 autonomous vehicle by 2028 for around $30,000.
⦁ Open Source Initiative: In a potentially transformative move, 30 automakers and tier-one suppliers announced a consortium to develop open-source software for SDVs, which could accelerate innovation and standardization across the industry.
The Shifting AI Chip Export Controls
The conversation shifted to the recent relaxation of US export controls on AI chips to China, a policy announced by the Trump administration in late 2025. The policy itself is ambiguous, seemingly based on a Truth Social post rather than a formal do...
Full story
tokenless.tech
Humanoid Robots: Hype vs. Reality | Tokenless
A deep dive into the key takeaways from CES 2026, covering the surge in humanoid robotics and the evolution of software-defined vehicles, followed by a nuanced analysis of the shifting US-China export controls on advanced AI chips.
The Problem: Single LLMs Fail Silently
Single-agent Large Language Model (LLM) systems present a significant challenge in production environments: they fail silently and are often "confidently wrong." When a single LLM misses a critical detail, such as a hard-coded key or a SQL injection vulnerability, it doesn't express uncertainty. Instead, it provides a definitive, and incorrect, answer. This behavior stems from several inherent limitations:
⦁ No Uncertainty Quantification: A single agent doesn't communicate its level of confidence. It presents every answer as 100% certain.
⦁ Lack of Alternative Viewpoints: The output is confined to the perspective of the single model being used, with no mechanism to introduce alternative or challenging viewpoints.
⦁ No Self-Correction: Without an external challenge, a single agent has no impetus to reconsider its conclusions, even if they are flawed. As the speaker notes, "if it misses it, it's not going to tell you."
Structured Dissent: A Multi-Agent Debate Swarm
To address these failures, a multi-agent orchestration pattern called Structured Dissent is proposed. The core idea is to create a "Think Tank"—a Socratic debate for AI—where agents with opposing viewpoints discuss and challenge decisions before reaching a consensus. This introduces nuance and a mechanism for adversarial verification.
The swarm is typically composed of three distinct agent personas:
⦁ Believers: The optimists. They are solution-focused, seeking opportunities and positive outcomes.
⦁ Skeptics: The paranoids. They focus on failure modes, risks, and hidden costs, effectively acting as a security team.
⦁ Neutrals: The facilitators. They work to prevent groupthink, synthesize the arguments from believers and skeptics, and build a balanced consensus.
The Three-Phase Debate Process
The system operates in a structured, multi-round debate. The default configuration uses five agents (two believers, two skeptics, one neutral) engaged in a three-phase process:
1. Phase 1: Parallel Analysis: Each agent independently analyzes the initial input (e.g., a security scan report) and forms its initial opinion based on its persona.
2. Phase 2: Adversarial Debate: The agents see each other's analyses and begin to argue. Skeptics challenge the believers' optimistic timelines by pointing out complexities, while believers might counter with potential solutions. This is "adversarial verification in real time," where the agents act as judges for each other's reasoning.
3. Phase 3: Synthesis and Reporting: After the debate rounds, the agents present their final conclusions. The neutral agent, acting as a "foreperson," synthesizes these into a final report.
The output is not a simple binary answer. It includes:
⦁ A majority opinion.
⦁ A confidence score indicating the swarm's certainty.
⦁ A summary of resolved and unresolved conflicts.
⦁ Key minority opinions, ensuring that dissenting views are not lost.
If the confidence score falls within a certain range (e.g., 50-75%), the system flags the issue for human review, acknowledging that it needs "an adult."
Use Case: MCP Server Security Analysis
The primary demonstration involves a security swarm built to analyze findings from open-source tools (like Bandit, Semgrep, Syft) on MCP (Machine Comprehension Programming) servers.
⦁ Input: Reports from static analysis and dependency vulnerability scans (approx. 35,000 characters).
⦁ Process: The swarm debates the findings to assess the security posture of the MCP server.
⦁ Performance: A typical analysis takes 3-5 minutes and costs around $15 in API calls. This is a significant improvement over a manual security analyst review, which could take hours and cost thousands of dollars.
⦁ Output: The system generates an "executive appropriate" report (approx. 10,000 characters)...
Full story
Single-agent Large Language Model (LLM) systems present a significant challenge in production environments: they fail silently and are often "confidently wrong." When a single LLM misses a critical detail, such as a hard-coded key or a SQL injection vulnerability, it doesn't express uncertainty. Instead, it provides a definitive, and incorrect, answer. This behavior stems from several inherent limitations:
⦁ No Uncertainty Quantification: A single agent doesn't communicate its level of confidence. It presents every answer as 100% certain.
⦁ Lack of Alternative Viewpoints: The output is confined to the perspective of the single model being used, with no mechanism to introduce alternative or challenging viewpoints.
⦁ No Self-Correction: Without an external challenge, a single agent has no impetus to reconsider its conclusions, even if they are flawed. As the speaker notes, "if it misses it, it's not going to tell you."
Structured Dissent: A Multi-Agent Debate Swarm
To address these failures, a multi-agent orchestration pattern called Structured Dissent is proposed. The core idea is to create a "Think Tank"—a Socratic debate for AI—where agents with opposing viewpoints discuss and challenge decisions before reaching a consensus. This introduces nuance and a mechanism for adversarial verification.
The swarm is typically composed of three distinct agent personas:
⦁ Believers: The optimists. They are solution-focused, seeking opportunities and positive outcomes.
⦁ Skeptics: The paranoids. They focus on failure modes, risks, and hidden costs, effectively acting as a security team.
⦁ Neutrals: The facilitators. They work to prevent groupthink, synthesize the arguments from believers and skeptics, and build a balanced consensus.
The Three-Phase Debate Process
The system operates in a structured, multi-round debate. The default configuration uses five agents (two believers, two skeptics, one neutral) engaged in a three-phase process:
1. Phase 1: Parallel Analysis: Each agent independently analyzes the initial input (e.g., a security scan report) and forms its initial opinion based on its persona.
2. Phase 2: Adversarial Debate: The agents see each other's analyses and begin to argue. Skeptics challenge the believers' optimistic timelines by pointing out complexities, while believers might counter with potential solutions. This is "adversarial verification in real time," where the agents act as judges for each other's reasoning.
3. Phase 3: Synthesis and Reporting: After the debate rounds, the agents present their final conclusions. The neutral agent, acting as a "foreperson," synthesizes these into a final report.
The output is not a simple binary answer. It includes:
⦁ A majority opinion.
⦁ A confidence score indicating the swarm's certainty.
⦁ A summary of resolved and unresolved conflicts.
⦁ Key minority opinions, ensuring that dissenting views are not lost.
If the confidence score falls within a certain range (e.g., 50-75%), the system flags the issue for human review, acknowledging that it needs "an adult."
Use Case: MCP Server Security Analysis
The primary demonstration involves a security swarm built to analyze findings from open-source tools (like Bandit, Semgrep, Syft) on MCP (Machine Comprehension Programming) servers.
⦁ Input: Reports from static analysis and dependency vulnerability scans (approx. 35,000 characters).
⦁ Process: The swarm debates the findings to assess the security posture of the MCP server.
⦁ Performance: A typical analysis takes 3-5 minutes and costs around $15 in API calls. This is a significant improvement over a manual security analyst review, which could take hours and cost thousands of dollars.
⦁ Output: The system generates an "executive appropriate" report (approx. 10,000 characters)...
Full story
tokenless.tech
Structured Dissent Patterns for Agentic Production Reliability | Tokenless
This talk introduces 'structured dissent,' a multi-agent orchestration pattern where believer, skeptic, and neutral agents debate decisions to overcome the 'confidently wrong' failure mode of single-agent LLM systems, improving reliability for high-stakes…
The Asymmetric Design Cycle: AI's Compute Bottleneck
The fundamental bottleneck holding back AI progress is the asymmetric design cycle between AI models and the chips they run on. While new AI methods can be developed rapidly, designing and manufacturing the next generation of chips is a multi-year, multi-hundred-million-dollar process. This mismatch prevents the effective co-design and co-evolution of hardware, software, and AI workloads. The current paradigm often involves repurposing existing hardware, like GPUs originally designed for graphics, for AI tasks. While effective at matrix multiplication, these chips are not co-optimized for the specific neural network models being run. The vision is to dramatically shorten the chip design cycle, enabling a world where custom silicon can be created in tandem with new AI applications, bending the curve of the scaling laws that govern AI progress.
The Genesis: AlphaChip and the TPU Team
The journey began with the AlphaChip project at Google, which ultimately helped design four successive generations of Tensor Processing Units (TPUs). The project started by applying Reinforcement Learning (RL) to chip placement, a critical stage in the physical design process known as floorplanning.
The initial collaboration with Google's TPU team was met with significant skepticism. The research team, coming from an AI background, initially optimized for academic metrics like "half-perimeter wire length." The TPU engineers, however, were quick to point out that these metrics were irrelevant to them. They cared about a complex set of real-world constraints:
⦁ Routed wire length
⦁ Horizontal and vertical congestion
⦁ Timing violations
⦁ Power consumption
⦁ Area (PPA)
To gain trust, the AlphaChip team adopted a highly iterative, customer-obsessed approach. They met with the TPU team weekly for years, showing them new data and working collaboratively to build cost functions that approximated the metrics the engineers truly valued. This deep partnership was crucial. For an engineer to choose an AI-generated layout over their own, they had to be convinced it was superior on every single metric they cared about, as they were ultimately responsible for the block's performance.
A New Paradigm for Chip Design
The technical approach for AlphaChip was fundamentally different from traditional Electronic Design Automation (EDA) methods. Instead of using classical combinatorial optimization solvers, the team trained an RL agent to place the millions of components on a chip.
⦁ Learning from Experience: The RL agent learns through trial and error, interacting with a simulated environment. It learns from both positive and negative placement examples, iteratively improving its strategy. This ability to learn from experience allows the model to self-improve, much like a human expert who gets better with each new design, but at a vastly greater scale.
⦁ Superhuman and Unconventional Designs: The AI began to produce layouts that were radically different from human-designed ones. As Anna Goldie noted, "We saw these very strange like curved placements... donut shapes as well." Humans tend to create highly regular, grid-like layouts. The AI, however, discovered that curved and non-uniform shapes could reduce wire length, thereby improving power consumption and timing, even if they appeared counter-intuitive and complex.
The project's success was validated when the first chip designed with AlphaChip's help was taped out and returned from the fab fully functional. With each subsequent TPU generation, the AI's layouts were adopted across more of the chip, and the performance delta between the AI's design and the human baseline grew, demonstrating AI's ability to scale with more data and experience.
Ricursive Intelligence: From Fabless to Designless
The success of AlphaChip inspired the founding of Ricursive Int...
Full story
The fundamental bottleneck holding back AI progress is the asymmetric design cycle between AI models and the chips they run on. While new AI methods can be developed rapidly, designing and manufacturing the next generation of chips is a multi-year, multi-hundred-million-dollar process. This mismatch prevents the effective co-design and co-evolution of hardware, software, and AI workloads. The current paradigm often involves repurposing existing hardware, like GPUs originally designed for graphics, for AI tasks. While effective at matrix multiplication, these chips are not co-optimized for the specific neural network models being run. The vision is to dramatically shorten the chip design cycle, enabling a world where custom silicon can be created in tandem with new AI applications, bending the curve of the scaling laws that govern AI progress.
The Genesis: AlphaChip and the TPU Team
The journey began with the AlphaChip project at Google, which ultimately helped design four successive generations of Tensor Processing Units (TPUs). The project started by applying Reinforcement Learning (RL) to chip placement, a critical stage in the physical design process known as floorplanning.
The initial collaboration with Google's TPU team was met with significant skepticism. The research team, coming from an AI background, initially optimized for academic metrics like "half-perimeter wire length." The TPU engineers, however, were quick to point out that these metrics were irrelevant to them. They cared about a complex set of real-world constraints:
⦁ Routed wire length
⦁ Horizontal and vertical congestion
⦁ Timing violations
⦁ Power consumption
⦁ Area (PPA)
To gain trust, the AlphaChip team adopted a highly iterative, customer-obsessed approach. They met with the TPU team weekly for years, showing them new data and working collaboratively to build cost functions that approximated the metrics the engineers truly valued. This deep partnership was crucial. For an engineer to choose an AI-generated layout over their own, they had to be convinced it was superior on every single metric they cared about, as they were ultimately responsible for the block's performance.
A New Paradigm for Chip Design
The technical approach for AlphaChip was fundamentally different from traditional Electronic Design Automation (EDA) methods. Instead of using classical combinatorial optimization solvers, the team trained an RL agent to place the millions of components on a chip.
⦁ Learning from Experience: The RL agent learns through trial and error, interacting with a simulated environment. It learns from both positive and negative placement examples, iteratively improving its strategy. This ability to learn from experience allows the model to self-improve, much like a human expert who gets better with each new design, but at a vastly greater scale.
⦁ Superhuman and Unconventional Designs: The AI began to produce layouts that were radically different from human-designed ones. As Anna Goldie noted, "We saw these very strange like curved placements... donut shapes as well." Humans tend to create highly regular, grid-like layouts. The AI, however, discovered that curved and non-uniform shapes could reduce wire length, thereby improving power consumption and timing, even if they appeared counter-intuitive and complex.
The project's success was validated when the first chip designed with AlphaChip's help was taped out and returned from the fab fully functional. With each subsequent TPU generation, the AI's layouts were adopted across more of the chip, and the performance delta between the AI's design and the human baseline grew, demonstrating AI's ability to scale with more data and experience.
Ricursive Intelligence: From Fabless to Designless
The success of AlphaChip inspired the founding of Ricursive Int...
Full story
tokenless.tech
How Ricursive Intelligence’s Founders are Using AI to Shape The Future of Chip Design | Tokenless
Anna Goldie and Azalia Mirhoseini of Ricursive Intelligence discuss how their work on Google's AlphaChip, which used AI to design TPUs, is now being extended to automate the entire chip design process. They explain their vision for a 'designless' industry…
Why Every Brain Metaphor in History Has Been Wrong [SPECIAL EDITION]
An exploration of scientific simplification, questioning the metaphors we use to understand the brain and intelligence. This summary delves into the tension between creating useful models and mistaking them for reality, featuring insights on the mind-as-software debate, the limits of prediction versus understanding, and the philosophical underpinnings of our quest for AGI.
An exploration of scientific simplification, questioning the metaphors we use to understand the brain and intelligence. This summary delves into the tension between creating useful models and mistaking them for reality, featuring insights on the mind-as-software debate, the limits of prediction versus understanding, and the philosophical underpinnings of our quest for AGI.
Science operates by simplifying complex reality, but this necessary act raises a fundamental question: have we found a deep truth about the world, or are we mistaking our simplified model for the actual thing? This tension is embodied by the "spherical cow" joke in physics and is central to modern neuroscience and AI. As Professor Mazviita Chirimuuta explains in her book, The Brain Abstracted, we are limited creatures who must build models and leave things out. The critical disagreement, however, is what this success implies about reality itself.
This can be framed as a conflict between two perspectives:
⦁ Simplicius: Believes that science works because the universe is fundamentally simple and orderly underneath its apparent complexity. An elegant equation reflects reality.
⦁ Ignorantio: Argues that we simplify because we are cognitively limited. Our models are useful fictions—maps, not the territory—that work for our specific purposes, which doesn't prove that nature itself is simple.
Chirimuuta aligns with "learned ignorance" (docta ignorantia), the idea that true learning includes understanding the limits of what you know.
The Kaleidoscope Hypothesis: Is Reality Fundamentally Code?
Francois Chollet proposes the "kaleidoscope hypothesis," suggesting that beneath the messy surface of reality lies an intrinsic, underlying structure composed of simple, repeating "atoms of meaning." Much like a kaleidoscope creates infinite complexity from a few pieces of colored glass, the world is generated by the repetition and composition of these fundamental elements. Intelligence, in this view, is the process of mining experience to extract these abstractions.
Chirimuuta frames this not as a scientific certainty but as a philosophical bet, akin to Plato's theory of Forms. It's a wager that "real reality is neat, mathematical, and decomposable" beneath the complicated world of appearances.
The Ultimate Metaphor: Is the Mind Software?
The most pervasive simplification today is the idea that the mind is a computer running software. This has moved from a metaphor to what many consider a literal truth. Joscha Bach argues provocatively that this is not a metaphor at all: "Software is spirit." He posits that abstract patterns, like software or money, have real causal power, independent of their physical substrate. A program produces the same effects whether on a Mac, a PC, or potentially even neurons, because the causal power lies in the invariance of the pattern itself.
The counterargument is that this "sameness" is not inherent in nature but is imposed by a human observer. Physically, completely different things are happening inside different computer chips. The invariance exists only in our description. The causal power of money, for example, isn't in the paper or electrons but in the shared social agreements and interpretive practices of humans. The critique is that this view mistakes an elegant description for the fundamental structure of reality.
Historically, our metaphors for the brain have always tracked our most advanced technology:
⦁ Descartes: Hydraulic pumps in French royal gardens.
⦁ 19th Century: A telegraph network.
⦁ 20th Century: A telephone switchboard.
⦁ 21st Century: A digital computer.
As Jeff Beck bluntly states, "It will always be the case that our explanation for how the brain works will be by analogy to the most sophisticated technology that we have."
Ontology vs. Metaphysics: It Depends on Why You're Asking
Professor Luciano Floridi offers a framework to navigate this, distinguishing between metaphysics (reality as it is in itself, which is inaccessible) and ontology (the structure we impose on reality for a specific purpose). Our models of the world are not absolutely true or false; their value is relational.
Full story
This can be framed as a conflict between two perspectives:
⦁ Simplicius: Believes that science works because the universe is fundamentally simple and orderly underneath its apparent complexity. An elegant equation reflects reality.
⦁ Ignorantio: Argues that we simplify because we are cognitively limited. Our models are useful fictions—maps, not the territory—that work for our specific purposes, which doesn't prove that nature itself is simple.
Chirimuuta aligns with "learned ignorance" (docta ignorantia), the idea that true learning includes understanding the limits of what you know.
The Kaleidoscope Hypothesis: Is Reality Fundamentally Code?
Francois Chollet proposes the "kaleidoscope hypothesis," suggesting that beneath the messy surface of reality lies an intrinsic, underlying structure composed of simple, repeating "atoms of meaning." Much like a kaleidoscope creates infinite complexity from a few pieces of colored glass, the world is generated by the repetition and composition of these fundamental elements. Intelligence, in this view, is the process of mining experience to extract these abstractions.
Chirimuuta frames this not as a scientific certainty but as a philosophical bet, akin to Plato's theory of Forms. It's a wager that "real reality is neat, mathematical, and decomposable" beneath the complicated world of appearances.
The Ultimate Metaphor: Is the Mind Software?
The most pervasive simplification today is the idea that the mind is a computer running software. This has moved from a metaphor to what many consider a literal truth. Joscha Bach argues provocatively that this is not a metaphor at all: "Software is spirit." He posits that abstract patterns, like software or money, have real causal power, independent of their physical substrate. A program produces the same effects whether on a Mac, a PC, or potentially even neurons, because the causal power lies in the invariance of the pattern itself.
The counterargument is that this "sameness" is not inherent in nature but is imposed by a human observer. Physically, completely different things are happening inside different computer chips. The invariance exists only in our description. The causal power of money, for example, isn't in the paper or electrons but in the shared social agreements and interpretive practices of humans. The critique is that this view mistakes an elegant description for the fundamental structure of reality.
Historically, our metaphors for the brain have always tracked our most advanced technology:
⦁ Descartes: Hydraulic pumps in French royal gardens.
⦁ 19th Century: A telegraph network.
⦁ 20th Century: A telephone switchboard.
⦁ 21st Century: A digital computer.
As Jeff Beck bluntly states, "It will always be the case that our explanation for how the brain works will be by analogy to the most sophisticated technology that we have."
Ontology vs. Metaphysics: It Depends on Why You're Asking
Professor Luciano Floridi offers a framework to navigate this, distinguishing between metaphysics (reality as it is in itself, which is inaccessible) and ontology (the structure we impose on reality for a specific purpose). Our models of the world are not absolutely true or false; their value is relational.
Is it the same ship of Theseus? The question is a mistake. It provides no interface, what computer scientis...
Full story
tokenless.tech
Why Every Brain Metaphor in History Has Been Wrong [SPECIAL EDITION] | Tokenless
An exploration of scientific simplification, questioning the metaphors we use to understand the brain and intelligence. This summary delves into the tension between creating useful models and mistaking them for reality, featuring insights on the mind-as-software…
"We Made a Dream Machine That Runs on Your Gaming PC"
Shahbuland Matiana and Andrew Lapp from Overworld Labs introduce Waypoint 1, a 2 billion-parameter open-source world simulation model designed to run on consumer hardware at 60 FPS. They discuss its novel architecture, which combines a causal language model with an image diffusion model to denoise frames in real-time based on user prompts and controller inputs, emphasizing low-latency interaction and the importance of local execution for user privacy.
Shahbuland Matiana and Andrew Lapp from Overworld Labs introduce Waypoint 1, a 2 billion-parameter open-source world simulation model designed to run on consumer hardware at 60 FPS. They discuss its novel architecture, which combines a causal language model with an image diffusion model to denoise frames in real-time based on user prompts and controller inputs, emphasizing low-latency interaction and the importance of local execution for user privacy.
Overworld Labs has introduced Waypoint 1, a 2 billion-parameter world simulation model designed to run efficiently on consumer hardware. Unlike large-scale projects like Google's Genie, which rely on massive cloud infrastructure, Waypoint 1 is optimized for local execution on gaming PCs (e.g., NVIDIA 3070s, 4090s) and soon, Apple Silicon. The model, whose weights are being open-sourced, is capable of generating interactive, explorable worlds from text or image prompts at 60 frames per second.
The Vision: Sharable Lucid Dreams
The core motivation behind Overworld is to create a way to record and share the kinds of immersive, dynamic experiences found in dreams. Co-founder Shahbuland Matiana described a personal lucid dream that modern game engines cannot replicate:
The goal of Waypoint 1 is to enable the creation of such fully immersive experiences where the world bends and reacts to the user's actions, and then allow those experiences to be shared with others. This technology aims to be a "killer application" for AI, moving beyond static video generation into truly interactive entertainment.
Technical Architecture: A Real-Time Diffusion Transformer
Waypoint 1's architecture is a novel hybrid of a causal language model and an image diffusion model, optimized for real-time interaction.
1. Image Compression: The process begins with an autoencoder that compresses video frames (e.g., 360p) into a much smaller latent representation, such as a 32x32 grid. The model operates entirely in this compressed latent space, not on raw pixels.
2. Frame Generation: The core of the system is a transformer model. However, instead of autoregressively predicting the next token like a standard LLM, it denoises the next 256 tokens (representing one full frame) in a single forward pass.
3. Conditioning: Each frame is generated conditioned on a history of preceding frames, a text prompt, and controller inputs from the last 1/60th of a second. This conditioning is managed through cross-attention mechanisms within the transformer blocks.
4. Low Latency: To ensure playability and responsiveness, the model generates only one frame at a time. This is a key distinction from many video diffusion models that use temporal autoencoders to compress multiple frames together, which saves computation but introduces significant input lag (e.g., only accepting input every 4th frame).
Optimization and Distillation
Achieving 60 FPS on consumer hardware requires significant optimization. The team uses a four-step rectified flow model with an Euler sampler. In this process, the model starts with random noise and, over four steps, predicts the vector that moves the latent representation closer to the "clean," ideal frame.
A key insight is that reducing the number of diffusion steps primarily sacrifices diversity, not quality. For an autoregressive model like Waypoint 1, this is an acceptable trade-off. The strong conditioning from previous frames and user input already constrains the output, so the inherent diversity from a high-step diffusion process is less critical.
This speed is further enhanced by diffusion distillation (e.g., using methods like Distribution Matching Distillation or DMD), where a "student" model is trained to replicate the output of a larger model in fewer steps. This process effectively "bakes in" parameters like the classifier-free guidance (CFG) scale, which avoids the need for multiple forward passes during inference and dramatically speeds up generation.
Privacy and the Future
The team strongly advocates for ...
Full story
The Vision: Sharable Lucid Dreams
The core motivation behind Overworld is to create a way to record and share the kinds of immersive, dynamic experiences found in dreams. Co-founder Shahbuland Matiana described a personal lucid dream that modern game engines cannot replicate:
"I was in this like house floating in space and there was a giant like dragon circling the the house... I draw a katana from my like waist and I parry the dragon's teeth as it goes try to bite me. I feel a clang reverberate through my whole body. The floorboards crack beneath my feet. The window shatter around me."
The goal of Waypoint 1 is to enable the creation of such fully immersive experiences where the world bends and reacts to the user's actions, and then allow those experiences to be shared with others. This technology aims to be a "killer application" for AI, moving beyond static video generation into truly interactive entertainment.
Technical Architecture: A Real-Time Diffusion Transformer
Waypoint 1's architecture is a novel hybrid of a causal language model and an image diffusion model, optimized for real-time interaction.
1. Image Compression: The process begins with an autoencoder that compresses video frames (e.g., 360p) into a much smaller latent representation, such as a 32x32 grid. The model operates entirely in this compressed latent space, not on raw pixels.
2. Frame Generation: The core of the system is a transformer model. However, instead of autoregressively predicting the next token like a standard LLM, it denoises the next 256 tokens (representing one full frame) in a single forward pass.
3. Conditioning: Each frame is generated conditioned on a history of preceding frames, a text prompt, and controller inputs from the last 1/60th of a second. This conditioning is managed through cross-attention mechanisms within the transformer blocks.
4. Low Latency: To ensure playability and responsiveness, the model generates only one frame at a time. This is a key distinction from many video diffusion models that use temporal autoencoders to compress multiple frames together, which saves computation but introduces significant input lag (e.g., only accepting input every 4th frame).
Optimization and Distillation
Achieving 60 FPS on consumer hardware requires significant optimization. The team uses a four-step rectified flow model with an Euler sampler. In this process, the model starts with random noise and, over four steps, predicts the vector that moves the latent representation closer to the "clean," ideal frame.
A key insight is that reducing the number of diffusion steps primarily sacrifices diversity, not quality. For an autoregressive model like Waypoint 1, this is an acceptable trade-off. The strong conditioning from previous frames and user input already constrains the output, so the inherent diversity from a high-step diffusion process is less critical.
This speed is further enhanced by diffusion distillation (e.g., using methods like Distribution Matching Distillation or DMD), where a "student" model is trained to replicate the output of a larger model in fewer steps. This process effectively "bakes in" parameters like the classifier-free guidance (CFG) scale, which avoids the need for multiple forward passes during inference and dramatically speeds up generation.
Privacy and the Future
The team strongly advocates for ...
Full story
tokenless.tech
"We Made a Dream Machine That Runs on Your Gaming PC" | Tokenless
Shahbuland Matiana and Andrew Lapp from Overworld Labs introduce Waypoint 1, a 2 billion-parameter open-source world simulation model designed to run on consumer hardware at 60 FPS. They discuss its novel architecture, which combines a causal language model…
This Startup Beat Gemini 3 on ARC-AGI — at Half the Cost
Poetic, a startup by ex-DeepMind researchers, has significantly advanced performance on the ARC-AGI benchmark by applying a recursive self-improvement system to Gemini 3. Co-founder Ian Fisher discusses how their approach of automating prompt and system engineering provides a substantial performance boost without needing access to model weights, and explores its potential as a path toward AGI.
Poetic, a startup by ex-DeepMind researchers, has significantly advanced performance on the ARC-AGI benchmark by applying a recursive self-improvement system to Gemini 3. Co-founder Ian Fisher discusses how their approach of automating prompt and system engineering provides a substantial performance boost without needing access to model weights, and explores its potential as a path toward AGI.
Poetic, a new startup founded by former DeepMind researchers, has achieved a significant breakthrough on the ARC-AGI benchmark. By layering their proprietary system on top of Gemini 3, they achieved a 54% score on the private test set, a substantial leap from Gemini 3's baseline of approximately 33% and even surpassing the more advanced Gemini 3 Deep Think's 45% at half the cost.
The Core Technology: Recursive Self-Improvement
The central idea behind Poetic's success is a form of recursive self-improvement (RSI), which co-founder Ian Fisher describes as "the holy grail of AI." The goal is to create a system where the AI actively makes itself smarter.
Unlike methods that require fine-tuning or access to model weights, Poetic's approach operates purely at the system and prompt level. This is a crucial advantage when working with closed-source models available only through APIs. The methodology involves:
⦁ Ensemble Methods: The system calls the underlying model (e.g., Gemini 3) multiple times.
⦁ Independent Refinement: Each member of the ensemble works independently to refine its own answer.
⦁ Advanced Voting Schemes: The refined answers are combined using a sophisticated voting mechanism to produce a final, more accurate solution.
This system-level optimization is what differentiates Poetic from other prompt engineering frameworks like DSPy, containing what Fisher refers to as "trade secret insights" that yield a significant performance difference. The entire ARC-AGI solver was an output of their system, which was trained on ARC-1 and then applied to ARC-2 without any specific training on the latter.
The Gemini 3 Catalyst
The release of Gemini 3 was a pivotal moment. While Poetic's system showed promising results on ARC-1 with other models (reaching 89%), switching to Gemini 3 pushed their performance to 95%. When they applied this new combination to the more challenging ARC-2, they had a "holy cow moment" as the performance jumped to the state-of-the-art 54%.
Fisher attributes this leap to Gemini 3's exceptional ability to generate code for visual problem-solving, a capability that surpassed previous models. He also notes that other powerful models like Anthropic's Opus can be swapped in for Gemini 3 to achieve similar results, albeit at a higher cost.
A Path to AGI and Practical Applications
Fisher views RSI as both a practical tool for immediate performance gains and a credible path toward AGI.
⦁ Immediate Value: The performance "bump" from Poetic's system can be highly valuable. On the ARC-AGI benchmark, which allows for two solution submissions, their method provided a single, higher-quality solution that outperformed the underlying model's two submissions, sometimes at a lower overall cost.
⦁ Long-Term Vision: While not the only path, Fisher believes RSI is "the most exciting path to AGI and beyond." The process on ARC-AGI was stopped manually due to cost constraints, suggesting that with more resources, the performance could have "hill-climbed" even further.
Automating the Prompt Engineer
The broader vision for Poetic is to automate the complex and often manual process of prompt engineering and agent creation. Fisher draws an analogy to the evolution of deep learning, which automated the manual process of feature engineering.
He contrasts their previous manual work at DeepMind—akin to building a car by hand—with Poetic's technology, which is like "building a factory to build cars." The goal is to create a system that automatically discovers the optimal prompts and system configurations, removing the human from the tedious trial-and-error loop. While continuing their research and targeting other high-impact benchmarks, the six-person team is now also focusing on bringing t...
Full story
The Core Technology: Recursive Self-Improvement
The central idea behind Poetic's success is a form of recursive self-improvement (RSI), which co-founder Ian Fisher describes as "the holy grail of AI." The goal is to create a system where the AI actively makes itself smarter.
Unlike methods that require fine-tuning or access to model weights, Poetic's approach operates purely at the system and prompt level. This is a crucial advantage when working with closed-source models available only through APIs. The methodology involves:
⦁ Ensemble Methods: The system calls the underlying model (e.g., Gemini 3) multiple times.
⦁ Independent Refinement: Each member of the ensemble works independently to refine its own answer.
⦁ Advanced Voting Schemes: The refined answers are combined using a sophisticated voting mechanism to produce a final, more accurate solution.
This system-level optimization is what differentiates Poetic from other prompt engineering frameworks like DSPy, containing what Fisher refers to as "trade secret insights" that yield a significant performance difference. The entire ARC-AGI solver was an output of their system, which was trained on ARC-1 and then applied to ARC-2 without any specific training on the latter.
The Gemini 3 Catalyst
The release of Gemini 3 was a pivotal moment. While Poetic's system showed promising results on ARC-1 with other models (reaching 89%), switching to Gemini 3 pushed their performance to 95%. When they applied this new combination to the more challenging ARC-2, they had a "holy cow moment" as the performance jumped to the state-of-the-art 54%.
Fisher attributes this leap to Gemini 3's exceptional ability to generate code for visual problem-solving, a capability that surpassed previous models. He also notes that other powerful models like Anthropic's Opus can be swapped in for Gemini 3 to achieve similar results, albeit at a higher cost.
A Path to AGI and Practical Applications
Fisher views RSI as both a practical tool for immediate performance gains and a credible path toward AGI.
⦁ Immediate Value: The performance "bump" from Poetic's system can be highly valuable. On the ARC-AGI benchmark, which allows for two solution submissions, their method provided a single, higher-quality solution that outperformed the underlying model's two submissions, sometimes at a lower overall cost.
⦁ Long-Term Vision: While not the only path, Fisher believes RSI is "the most exciting path to AGI and beyond." The process on ARC-AGI was stopped manually due to cost constraints, suggesting that with more resources, the performance could have "hill-climbed" even further.
Automating the Prompt Engineer
The broader vision for Poetic is to automate the complex and often manual process of prompt engineering and agent creation. Fisher draws an analogy to the evolution of deep learning, which automated the manual process of feature engineering.
"We are quite intentionally automating ourselves, automating prompt engineers, automating people who are building agents. It's a power tool."
He contrasts their previous manual work at DeepMind—akin to building a car by hand—with Poetic's technology, which is like "building a factory to build cars." The goal is to create a system that automatically discovers the optimal prompts and system configurations, removing the human from the tedious trial-and-error loop. While continuing their research and targeting other high-impact benchmarks, the six-person team is now also focusing on bringing t...
Full story
tokenless.tech
This Startup Beat Gemini 3 on ARC-AGI — at Half the Cost | Tokenless
Poetic, a startup by ex-DeepMind researchers, has significantly advanced performance on the ARC-AGI benchmark by applying a recursive self-improvement system to Gemini 3. Co-founder Ian Fisher discusses how their approach of automating prompt and system engineering…
She Raised $64M to Build an AI Math Prodigy | Carina Hong, CEO of Axiom
Carina Hong, Founder & CEO of Axiom, discusses building a self-improving AI reasoning engine that combines generation and verification. Starting with formal mathematics, Axiom's system has achieved superhuman results on the notoriously difficult Putnam Exam by leveraging formal languages like Lean to overcome the probabilistic and unverifiable nature of standard LLMs. Hong explores how this technology can solve major bottlenecks in hardware and software verification, code migration, and database consistency, and what it means for the future of mathematical research.
Carina Hong, Founder & CEO of Axiom, discusses building a self-improving AI reasoning engine that combines generation and verification. Starting with formal mathematics, Axiom's system has achieved superhuman results on the notoriously difficult Putnam Exam by leveraging formal languages like Lean to overcome the probabilistic and unverifiable nature of standard LLMs. Hong explores how this technology can solve major bottlenecks in hardware and software verification, code migration, and database consistency, and what it means for the future of mathematical research.
Axiom's mission is to build a self-improving reasoning engine that uniquely combines generation and verification, an often-overlooked component in the current AI landscape. The company starts with an "AI mathematician" as a testing ground for this self-improvement loop, using formal languages like Lean to ground its natural language capabilities.
The Architecture of a Reasoning Engine
Axiom's system is built on three core components that interact with each other:
⦁ Prover: A system that can prove theorems.
⦁ Conjecturer: A system that proposes interesting and novel conjectures.
⦁ Knowledge Base: A database of what has already been proven, which both the prover and conjecturer can reference.
Tying these components together is auto-formalization, the process of converting natural language mathematics into a formal language. This is a core technology for Axiom, viewed as being as challenging and important as theorem proving itself.
Superhuman Performance on the Putnam Exam
Axiom's prover has demonstrated remarkable capabilities on the Putnam Mathematical Competition, an infamously difficult exam for undergraduates where the median score is often zero.
⦁ Axiom's system solved 8 out of 12 problems within the official time limit, a score that would place it in the top five (Putnam Fellow). A ninth problem was solved shortly after.
⦁ This performance significantly surpasses that of Axiom's founder, Carina Hong, who scored 4 out of 12.
⦁ This success showcases the power of combining deterministic, formal tools with probabilistic systems. Formal systems cannot "hand-wave" through difficult steps, forcing a level of rigor that informal LLMs lack. For instance, the AI prover might spend significant effort generating detailed code to rigorously prove convergence or limits, something a human might take for granted.
AI vs. Human Problem-Solving
While LLMs can seem impressive on some math problems, they often fail on seemingly simpler brain teasers because they lack true reasoning and verification. They generate solutions statistically without a guarantee of soundness.
⦁ Formal Verification's Role: Axiom's use of formal languages like Lean ensures that a proof is sound. Unlike a natural language proof from an LLM, which can have subtle flaws that are hard to spot, a Lean proof is machine-verifiable.
⦁ Interpretability: While the AI may generate proofs that are structured differently from human proofs, they are ultimately interpretable. The formal code of each step can be inspected and converted back to natural language, a significantly easier task than the initial formalization. The AI may find solutions that are convergent with what a human would find, acting like a collaborator with a different style, akin to the discovery of a self-taught genius like Ramanujan.
Applications Beyond Pure Mathematics
The core technology of generation paired with verification has profound implications for high-stakes commercial applications where correctness is critical. Formal verification is a major bottleneck in many industries, often consuming years of effort.
⦁ Hardware and Software Verification: In chip design, verification teams can be three to four times larger than design teams, with verification cycles taking years. AI-powered formal verification can dramatically reduce this time and lower the expertise required. AWS, for example, took five years to manually formalize just one component of its hypervisor.
⦁ Code Migration and Equivalence: When upgrading legacy systems, it's crucial to ensure the new code is perfectly equivalent to the old code. Formal methods can prove this equivalence, preventing regressions in critical business functions.
⦁ Database Consistency: Formal verification can be used to prove the consistency of database protocols, such as solving the Byzantine Generals Problem, ensuring reliability even in the presence of bad act...
Full story
The Architecture of a Reasoning Engine
Axiom's system is built on three core components that interact with each other:
⦁ Prover: A system that can prove theorems.
⦁ Conjecturer: A system that proposes interesting and novel conjectures.
⦁ Knowledge Base: A database of what has already been proven, which both the prover and conjecturer can reference.
Tying these components together is auto-formalization, the process of converting natural language mathematics into a formal language. This is a core technology for Axiom, viewed as being as challenging and important as theorem proving itself.
Superhuman Performance on the Putnam Exam
Axiom's prover has demonstrated remarkable capabilities on the Putnam Mathematical Competition, an infamously difficult exam for undergraduates where the median score is often zero.
⦁ Axiom's system solved 8 out of 12 problems within the official time limit, a score that would place it in the top five (Putnam Fellow). A ninth problem was solved shortly after.
⦁ This performance significantly surpasses that of Axiom's founder, Carina Hong, who scored 4 out of 12.
⦁ This success showcases the power of combining deterministic, formal tools with probabilistic systems. Formal systems cannot "hand-wave" through difficult steps, forcing a level of rigor that informal LLMs lack. For instance, the AI prover might spend significant effort generating detailed code to rigorously prove convergence or limits, something a human might take for granted.
AI vs. Human Problem-Solving
While LLMs can seem impressive on some math problems, they often fail on seemingly simpler brain teasers because they lack true reasoning and verification. They generate solutions statistically without a guarantee of soundness.
⦁ Formal Verification's Role: Axiom's use of formal languages like Lean ensures that a proof is sound. Unlike a natural language proof from an LLM, which can have subtle flaws that are hard to spot, a Lean proof is machine-verifiable.
⦁ Interpretability: While the AI may generate proofs that are structured differently from human proofs, they are ultimately interpretable. The formal code of each step can be inspected and converted back to natural language, a significantly easier task than the initial formalization. The AI may find solutions that are convergent with what a human would find, acting like a collaborator with a different style, akin to the discovery of a self-taught genius like Ramanujan.
Applications Beyond Pure Mathematics
The core technology of generation paired with verification has profound implications for high-stakes commercial applications where correctness is critical. Formal verification is a major bottleneck in many industries, often consuming years of effort.
⦁ Hardware and Software Verification: In chip design, verification teams can be three to four times larger than design teams, with verification cycles taking years. AI-powered formal verification can dramatically reduce this time and lower the expertise required. AWS, for example, took five years to manually formalize just one component of its hypervisor.
⦁ Code Migration and Equivalence: When upgrading legacy systems, it's crucial to ensure the new code is perfectly equivalent to the old code. Formal methods can prove this equivalence, preventing regressions in critical business functions.
⦁ Database Consistency: Formal verification can be used to prove the consistency of database protocols, such as solving the Byzantine Generals Problem, ensuring reliability even in the presence of bad act...
Full story
tokenless.tech
She Raised $64M to Build an AI Math Prodigy | Carina Hong, CEO of Axiom | Tokenless
Carina Hong, Founder & CEO of Axiom, discusses building a self-improving AI reasoning engine that combines generation and verification. Starting with formal mathematics, Axiom's system has achieved superhuman results on the notoriously difficult Putnam Exam…
Inference at Scale:Breaking the Memory Wall
Sid Sheth, CEO of d-matrix, details their memory-centric approach to AI inference hardware, focusing on their Digital In-Memory Compute (DIMC) architecture. He explains how DIMC, an augmented SRAM technology, minimizes data movement to solve the memory bottleneck, delivering significant gains in latency and energy efficiency, particularly for the 'decode' phase of large language models.
Sid Sheth, CEO of d-matrix, details their memory-centric approach to AI inference hardware, focusing on their Digital In-Memory Compute (DIMC) architecture. He explains how DIMC, an augmented SRAM technology, minimizes data movement to solve the memory bottleneck, delivering significant gains in latency and energy efficiency, particularly for the 'decode' phase of large language models.
The Bet on Cloud Inference and Memory-Centric Design
Founded in 2019, before the rise of ChatGPT, d-matrix made a contrarian bet on data center and cloud inference. While many startups focused on edge computing or the highly competitive training market dominated by NVIDIA, d-matrix identified a gap for a dedicated, efficient inference solution in the cloud.
The founding team anticipated that AI models, particularly transformers like BERT and the emerging GPT-3, would continue to grow in size, making memory access the primary bottleneck. Their first-principles analysis of the inference workload revealed it to be a repetitive, parallel compute problem heavily dependent on memory access. This led to their core strategy: integrating memory and compute as closely as possible to build a fundamentally more efficient architecture.
The Memory Bottleneck: HBM vs. SRAM
The choice of memory technology is critical for AI hardware. Sid Sheth provides a clear breakdown of the trade-offs:
⦁ High-Bandwidth Memory (HBM): Originally developed for High-Performance Computing (HPC) and later adopted for AI training, HBM acts like a "highway with many lanes," providing high-bandwidth access to a processor. While effective for the massive, parallel data needs of training, HBM is a poor fit for mainstream inference due to three key factors:
⦁ Cost: It remains an expensive technology.
⦁ Energy: It is very power-hungry.
⦁ Bandwidth Limits: The pace of AI model growth is outstripping HBM's ability to scale its bandwidth, making it "not fast anymore" for cutting-edge inference needs.
⦁ SRAM (Static RAM): d-matrix, along with other early players like Grok and Cerebras, initially focused on SRAM for its speed. However, on-chip SRAM capacity is limited. Recognizing that models would quickly outgrow a single chip, d-matrix designed its system with a two-tiered memory approach from the start, using a large on-chip SRAM tier and a second, larger LPDDR memory tier to accommodate extremely large models and the exploding KV-cache sizes associated with long contexts.
Prefill vs. Decode: The Two Phases of Generative Inference
Generative AI models operate in two distinct phases, which have different hardware requirements:
1. Prefill (The "Thinking" Phase): When a model receives a prompt, it processes the input and generates the internal context (KV cache). This phase is compute-intensive.
2. Decode (The "Speaking" Phase): The model then generates the response token by token. Each new token requires accessing the entire KV cache. This phase is memory-intensive and highly sensitive to latency. A slow decode phase results in a poor user experience, with long delays between words.
d-matrix's architecture is particularly well-suited for accelerating the memory-bound decode phase, where low latency is paramount.
Digital In-Memory Compute (DIMC): The Core Innovation
d-matrix's key technology is Digital In-Memory Compute (DIMC). It's a novel architecture that turns memory itself into a compute fabric.
⦁ How it Works: A traditional SRAM cell uses six transistors (6T) to store one bit of data. d-matrix augmented this design, creating a ten-transistor (10T) cell that can both store a bit and perform a single-bit multiplication.
⦁ The Benefit: By embedding compute directly within the memory array, model parameters (weights) can be stored and used for matrix math calculations without being moved. This minimization of data movement is the key to efficiency. It saves a tremendous amount of time and energy, directly addressing the three most precious resources: money, time, and energy.
This approach allows all rows of the SRAM to be activated simultaneously, creating a dataflow engine with much higher throughput than a traditional SRAM.
System, Scale, and Performance Trade-offs
The d-matrix solution is bu...
Full story
Founded in 2019, before the rise of ChatGPT, d-matrix made a contrarian bet on data center and cloud inference. While many startups focused on edge computing or the highly competitive training market dominated by NVIDIA, d-matrix identified a gap for a dedicated, efficient inference solution in the cloud.
The founding team anticipated that AI models, particularly transformers like BERT and the emerging GPT-3, would continue to grow in size, making memory access the primary bottleneck. Their first-principles analysis of the inference workload revealed it to be a repetitive, parallel compute problem heavily dependent on memory access. This led to their core strategy: integrating memory and compute as closely as possible to build a fundamentally more efficient architecture.
The Memory Bottleneck: HBM vs. SRAM
The choice of memory technology is critical for AI hardware. Sid Sheth provides a clear breakdown of the trade-offs:
⦁ High-Bandwidth Memory (HBM): Originally developed for High-Performance Computing (HPC) and later adopted for AI training, HBM acts like a "highway with many lanes," providing high-bandwidth access to a processor. While effective for the massive, parallel data needs of training, HBM is a poor fit for mainstream inference due to three key factors:
⦁ Cost: It remains an expensive technology.
⦁ Energy: It is very power-hungry.
⦁ Bandwidth Limits: The pace of AI model growth is outstripping HBM's ability to scale its bandwidth, making it "not fast anymore" for cutting-edge inference needs.
⦁ SRAM (Static RAM): d-matrix, along with other early players like Grok and Cerebras, initially focused on SRAM for its speed. However, on-chip SRAM capacity is limited. Recognizing that models would quickly outgrow a single chip, d-matrix designed its system with a two-tiered memory approach from the start, using a large on-chip SRAM tier and a second, larger LPDDR memory tier to accommodate extremely large models and the exploding KV-cache sizes associated with long contexts.
Prefill vs. Decode: The Two Phases of Generative Inference
Generative AI models operate in two distinct phases, which have different hardware requirements:
1. Prefill (The "Thinking" Phase): When a model receives a prompt, it processes the input and generates the internal context (KV cache). This phase is compute-intensive.
2. Decode (The "Speaking" Phase): The model then generates the response token by token. Each new token requires accessing the entire KV cache. This phase is memory-intensive and highly sensitive to latency. A slow decode phase results in a poor user experience, with long delays between words.
d-matrix's architecture is particularly well-suited for accelerating the memory-bound decode phase, where low latency is paramount.
Digital In-Memory Compute (DIMC): The Core Innovation
d-matrix's key technology is Digital In-Memory Compute (DIMC). It's a novel architecture that turns memory itself into a compute fabric.
⦁ How it Works: A traditional SRAM cell uses six transistors (6T) to store one bit of data. d-matrix augmented this design, creating a ten-transistor (10T) cell that can both store a bit and perform a single-bit multiplication.
⦁ The Benefit: By embedding compute directly within the memory array, model parameters (weights) can be stored and used for matrix math calculations without being moved. This minimization of data movement is the key to efficiency. It saves a tremendous amount of time and energy, directly addressing the three most precious resources: money, time, and energy.
This approach allows all rows of the SRAM to be activated simultaneously, creating a dataflow engine with much higher throughput than a traditional SRAM.
System, Scale, and Performance Trade-offs
The d-matrix solution is bu...
Full story
tokenless.tech
Inference at Scale:Breaking the Memory Wall | Tokenless
Sid Sheth, CEO of d-matrix, details their memory-centric approach to AI inference hardware, focusing on their Digital In-Memory Compute (DIMC) architecture. He explains how DIMC, an augmented SRAM technology, minimizes data movement to solve the memory bottleneck…
A Philosophy of Building for the Future
The core development principle at Anthropic, and for Claude Code specifically, is to not build for the model of today, but for the model that will exist in six months. This forward-looking approach anticipates the rapid, exponential improvement in model capabilities. Builders are advised to identify the current frontiers where a model is weak, with the confidence that it will become proficient in those areas over time.
This philosophy is heavily influenced by Rich Sutton's "The Bitter Lesson," which posits that general models that leverage computation will ultimately outperform more specialized, human-designed systems. Consequently, the Claude Code team is cautious about building what they call "scaffolding"—product features or code that compensates for a model's current shortcomings. This scaffolding often provides a temporary 10-20% performance gain but is rendered obsolete by the next model iteration.
This results in a dynamic and ephemeral codebase. Virtually no part of Claude Code that existed six months ago is still in the product today. The entire application is constantly being written, rewritten, and refactored as model capabilities advance, with tools and features being added and removed every couple of weeks.
The Accidental Genius of the Terminal
Claude Code's existence as a command-line interface (CLI) was not a grand design but an accident. It began as a simple terminal-based chat application built by Boris Cherny to familiarize himself with the Anthropic API. The initial goal was simply to explore what a coding product could be.
The "aha!" moment came when the model was given a
The terminal, chosen for its simplicity and lack of UI overhead, proved to be a surprisingly effective and enduring form factor. Its constraints fostered an elegant and powerful developer experience that resonated deeply with engineers, leading to rapid, viral adoption within Anthropic long before its public release.
Features Born from Latent Demand
A key product principle is to identify and serve "latent demand"—making it easier for users to do what they are already trying to do. Many of Claude Code's core features originated from observing user workarounds and desires.
CLAUDE.md
The concept for
Plan Mode
Plan Mode was created in a 30-minute coding session on a Sunday night in response to observing users explicitly asking the model to "plan this out but don't write any code yet." The implementation is deceptively simple: it just adds a single sentence to the prompt, "please don't code." While currently a heavily used feature to ensure the model is on the right track before execution, Boris predicts it may have a limited lifespan as models become capable enough to generate and execute a correct plan from a single prompt.
From Solo Agent to Agent Swarms
The architecture of work is evolving from single-agent interactions to multi-agent collaboration. Claude Code heavil...
Full story
The core development principle at Anthropic, and for Claude Code specifically, is to not build for the model of today, but for the model that will exist in six months. This forward-looking approach anticipates the rapid, exponential improvement in model capabilities. Builders are advised to identify the current frontiers where a model is weak, with the confidence that it will become proficient in those areas over time.
This philosophy is heavily influenced by Rich Sutton's "The Bitter Lesson," which posits that general models that leverage computation will ultimately outperform more specialized, human-designed systems. Consequently, the Claude Code team is cautious about building what they call "scaffolding"—product features or code that compensates for a model's current shortcomings. This scaffolding often provides a temporary 10-20% performance gain but is rendered obsolete by the next model iteration.
"Never bet against the model... We could also just wait like a couple of months and the model can probably just do the thing instead."
This results in a dynamic and ephemeral codebase. Virtually no part of Claude Code that existed six months ago is still in the product today. The entire application is constantly being written, rewritten, and refactored as model capabilities advance, with tools and features being added and removed every couple of weeks.
The Accidental Genius of the Terminal
Claude Code's existence as a command-line interface (CLI) was not a grand design but an accident. It began as a simple terminal-based chat application built by Boris Cherny to familiarize himself with the Anthropic API. The initial goal was simply to explore what a coding product could be.
The "aha!" moment came when the model was given a
bash tool. When asked, "What music am I listening to?" the model, Sonnet 3.5 at the time, independently wrote and executed AppleScript to query the user's music player. This demonstrated an innate desire to use tools and interact with the world, which became a foundational insight for the product's direction.The terminal, chosen for its simplicity and lack of UI overhead, proved to be a surprisingly effective and enduring form factor. Its constraints fostered an elegant and powerful developer experience that resonated deeply with engineers, leading to rapid, viral adoption within Anthropic long before its public release.
Features Born from Latent Demand
A key product principle is to identify and serve "latent demand"—making it easier for users to do what they are already trying to do. Many of Claude Code's core features originated from observing user workarounds and desires.
CLAUDE.md
The concept for
CLAUDE.md emerged when developers were observed writing their own markdown files with instructions and context, which they would then feed to the model. This behavior was formalized into a feature that allows teams to maintain a shared set of instructions and context checked into their codebase. The advice for maintaining these files is to be minimal; if a CLAUDE.md becomes too long or complex, it's often best to delete it and start fresh, adding instructions back only as needed, as newer models require less guidance.Plan Mode
Plan Mode was created in a 30-minute coding session on a Sunday night in response to observing users explicitly asking the model to "plan this out but don't write any code yet." The implementation is deceptively simple: it just adds a single sentence to the prompt, "please don't code." While currently a heavily used feature to ensure the model is on the right track before execution, Boris predicts it may have a limited lifespan as models become capable enough to generate and execute a correct plan from a single prompt.
From Solo Agent to Agent Swarms
The architecture of work is evolving from single-agent interactions to multi-agent collaboration. Claude Code heavil...
Full story
tokenless.tech
Boris Cherny: How We Built Claude Code | Tokenless
Boris Cherny, creator of Claude Code, shares the development philosophy behind the AI coding tool, emphasizing building for future models, leveraging latent user demand, and the surprising longevity of the terminal interface.
The Laws of Thought: The Math of Minds and Machines, with Prof. Tom Griffiths
Princeton Professor Tom Griffiths discusses his book "The Laws of Thought," exploring the mathematical models that govern both biological and artificial intelligence. He details the fundamental differences between human and machine cognition, rooted in their vastly different constraints, and explains how concepts like inductive bias, probability, and curiosity can bridge the gap between cognitive science and modern AI.
Princeton Professor Tom Griffiths discusses his book "The Laws of Thought," exploring the mathematical models that govern both biological and artificial intelligence. He details the fundamental differences between human and machine cognition, rooted in their vastly different constraints, and explains how concepts like inductive bias, probability, and curiosity can bridge the gap between cognitive science and modern AI.
Professor Tom Griffiths of Princeton University explores the mathematical principles that form the foundation of both human and artificial intelligence, bridging the gap between two contrasting views of the human mind. While psychologists often highlight human irrationality and biases, computer scientists see human cognition as an inspiration for AI. Griffiths' work seeks to reconcile these perspectives by framing human intelligence as a rational adaptation to significant constraints.
The Laws of Thought: A Mathematical Theory of Mind
The core idea of Griffiths' book, The Laws of Thought, is that just as mathematical laws of nature describe the external, physical world, a complementary set of mathematical principles can describe our internal, mental world.
⦁ From Behaviorism to Cognitive Science: Early psychology struggled to scientifically study internal thoughts, leading to the rise of behaviorism, which focused only on observable behaviors. The cognitive revolution was made possible by the development of computers and mathematical concepts like logic and probability, which provided a new, rigorous language to form and test hypotheses about the mind.
⦁ Research Methodology: Modern cognitive science research often involves large-scale online experiments. In Griffiths' lab, participants are presented with problems that require them to make inferences or decisions from data. By analyzing the responses from thousands of participants using modern machine learning tools like neural networks, researchers can develop and refine computational models of human cognition.
Human vs. AI: A Tale of Two Intelligences
A key distinction between human and artificial intelligence lies in the constraints they operate under. Humans are limited by time (a finite lifespan), computation (a few pounds of neural tissue), and communication bandwidth. In contrast, AI systems can be scaled with more data and compute and can transfer information perfectly. This leads to fundamentally different problem-solving approaches.
⦁ Inductive Bias and The Data Gap: A human child learns a language in about five years, whereas an LLM requires the equivalent of thousands of years of text data. This vast difference highlights the powerful inductive biases, or priors, built into human cognition. These biases provide a starting framework that makes learning from sparse data possible.
⦁ The Machine Learning Paradigm: Since the success of AlexNet in 2012, the dominant paradigm in machine learning has been one of weak inductive biases and massive datasets. The philosophy is that with enough data, a sufficiently complex model can learn the necessary features and solutions without human-engineered priors. This is the opposite of the human approach.
⦁ Engineering Inductive Bias: To create more human-like AI, we may need to engineer these biases. Meta-learning is one such technique, where a model learns an optimal set of initial weights by being trained on a wide variety of tasks. This provides a "soft bias" that guides the model toward effective solutions without rigidly constraining it, making it better at few-shot learning.
Deconstructing Large Language Models
Griffiths' research provides a scientific lens for understanding the behavior of LLMs.
⦁ Deductive vs. Inductive Problems: Early symbolic AI excelled at deductive problems (e.g., logic, chess), where all necessary information is provided. However, it struggled with inductive problems—the cornerstone of human intelligence—where conclusions must be drawn from incomplete information. Probability theory, particularly Bayes' rule, provides the mathematical framework for induction.
⦁ "Embers of Autoregression": LLMs are trained to predict the next token in a sequence, which makes them highly sensitive to the statistical patterns in their training data. This can lead to counter-intuitive behavior. For example, an LLM might be mo...
Full story
The Laws of Thought: A Mathematical Theory of Mind
The core idea of Griffiths' book, The Laws of Thought, is that just as mathematical laws of nature describe the external, physical world, a complementary set of mathematical principles can describe our internal, mental world.
⦁ From Behaviorism to Cognitive Science: Early psychology struggled to scientifically study internal thoughts, leading to the rise of behaviorism, which focused only on observable behaviors. The cognitive revolution was made possible by the development of computers and mathematical concepts like logic and probability, which provided a new, rigorous language to form and test hypotheses about the mind.
⦁ Research Methodology: Modern cognitive science research often involves large-scale online experiments. In Griffiths' lab, participants are presented with problems that require them to make inferences or decisions from data. By analyzing the responses from thousands of participants using modern machine learning tools like neural networks, researchers can develop and refine computational models of human cognition.
Human vs. AI: A Tale of Two Intelligences
A key distinction between human and artificial intelligence lies in the constraints they operate under. Humans are limited by time (a finite lifespan), computation (a few pounds of neural tissue), and communication bandwidth. In contrast, AI systems can be scaled with more data and compute and can transfer information perfectly. This leads to fundamentally different problem-solving approaches.
⦁ Inductive Bias and The Data Gap: A human child learns a language in about five years, whereas an LLM requires the equivalent of thousands of years of text data. This vast difference highlights the powerful inductive biases, or priors, built into human cognition. These biases provide a starting framework that makes learning from sparse data possible.
⦁ The Machine Learning Paradigm: Since the success of AlexNet in 2012, the dominant paradigm in machine learning has been one of weak inductive biases and massive datasets. The philosophy is that with enough data, a sufficiently complex model can learn the necessary features and solutions without human-engineered priors. This is the opposite of the human approach.
⦁ Engineering Inductive Bias: To create more human-like AI, we may need to engineer these biases. Meta-learning is one such technique, where a model learns an optimal set of initial weights by being trained on a wide variety of tasks. This provides a "soft bias" that guides the model toward effective solutions without rigidly constraining it, making it better at few-shot learning.
Deconstructing Large Language Models
Griffiths' research provides a scientific lens for understanding the behavior of LLMs.
⦁ Deductive vs. Inductive Problems: Early symbolic AI excelled at deductive problems (e.g., logic, chess), where all necessary information is provided. However, it struggled with inductive problems—the cornerstone of human intelligence—where conclusions must be drawn from incomplete information. Probability theory, particularly Bayes' rule, provides the mathematical framework for induction.
⦁ "Embers of Autoregression": LLMs are trained to predict the next token in a sequence, which makes them highly sensitive to the statistical patterns in their training data. This can lead to counter-intuitive behavior. For example, an LLM might be mo...
Full story
tokenless.tech
The Laws of Thought: The Math of Minds and Machines, with Prof. Tom Griffiths | Tokenless
Princeton Professor Tom Griffiths discusses his book "The Laws of Thought," exploring the mathematical models that govern both biological and artificial intelligence. He details the fundamental differences between human and machine cognition, rooted in their…
How A Team Of 7 Keeps Breaking AI Benchmark Records
Poetiq, a startup by former DeepMind researchers, has developed a recursive self-improvement meta-system that builds "reasoning harnesses" on top of existing LLMs. This approach avoids the costly "fine-tuning trap" and has achieved state-of-the-art results on benchmarks like ARC-AGI and Humanity's Last Exam by automatically optimizing prompts and discovering novel reasoning strategies.
Poetiq, a startup by former DeepMind researchers, has developed a recursive self-improvement meta-system that builds "reasoning harnesses" on top of existing LLMs. This approach avoids the costly "fine-tuning trap" and has achieved state-of-the-art results on benchmarks like ARC-AGI and Humanity's Last Exam by automatically optimizing prompts and discovering novel reasoning strategies.