Tokenless - The Best of AI, ML & CS Talks

The Problem: Single LLMs Fail Silently

Single-agent Large Language Model (LLM) systems present a significant challenge in production environments: they fail silently and are often "confidently wrong." When a single LLM misses a critical detail, such as a hard-coded key or a SQL injection vulnerability, it doesn't express uncertainty. Instead, it provides a definitive, and incorrect, answer. This behavior stems from several inherent limitations:

⦁ No Uncertainty Quantification: A single agent doesn't communicate its level of confidence. It presents every answer as 100% certain.
⦁ Lack of Alternative Viewpoints: The output is confined to the perspective of the single model being used, with no mechanism to introduce alternative or challenging viewpoints.
⦁ No Self-Correction: Without an external challenge, a single agent has no impetus to reconsider its conclusions, even if they are flawed. As the speaker notes, "if it misses it, it's not going to tell you."

Structured Dissent: A Multi-Agent Debate Swarm

To address these failures, a multi-agent orchestration pattern called Structured Dissent is proposed. The core idea is to create a "Think Tank"—a Socratic debate for AI—where agents with opposing viewpoints discuss and challenge decisions before reaching a consensus. This introduces nuance and a mechanism for adversarial verification.

The swarm is typically composed of three distinct agent personas:

⦁ Believers: The optimists. They are solution-focused, seeking opportunities and positive outcomes.
⦁ Skeptics: The paranoids. They focus on failure modes, risks, and hidden costs, effectively acting as a security team.
⦁ Neutrals: The facilitators. They work to prevent groupthink, synthesize the arguments from believers and skeptics, and build a balanced consensus.

The Three-Phase Debate Process

The system operates in a structured, multi-round debate. The default configuration uses five agents (two believers, two skeptics, one neutral) engaged in a three-phase process:

1. Phase 1: Parallel Analysis: Each agent independently analyzes the initial input (e.g., a security scan report) and forms its initial opinion based on its persona.
2. Phase 2: Adversarial Debate: The agents see each other's analyses and begin to argue. Skeptics challenge the believers' optimistic timelines by pointing out complexities, while believers might counter with potential solutions. This is "adversarial verification in real time," where the agents act as judges for each other's reasoning.
3. Phase 3: Synthesis and Reporting: After the debate rounds, the agents present their final conclusions. The neutral agent, acting as a "foreperson," synthesizes these into a final report.

The output is not a simple binary answer. It includes:
⦁ A majority opinion.
⦁ A confidence score indicating the swarm's certainty.
⦁ A summary of resolved and unresolved conflicts.
⦁ Key minority opinions, ensuring that dissenting views are not lost.

If the confidence score falls within a certain range (e.g., 50-75%), the system flags the issue for human review, acknowledging that it needs "an adult."

Use Case: MCP Server Security Analysis

The primary demonstration involves a security swarm built to analyze findings from open-source tools (like Bandit, Semgrep, Syft) on MCP (Machine Comprehension Programming) servers.

⦁ Input: Reports from static analysis and dependency vulnerability scans (approx. 35,000 characters).
⦁ Process: The swarm debates the findings to assess the security posture of the MCP server.
⦁ Performance: A typical analysis takes 3-5 minutes and costs around $15 in API calls. This is a significant improvement over a manual security analyst review, which could take hours and cost thousands of dollars.
⦁ Output: The system generates an "executive appropriate" report (approx. 10,000 characters)...

Full story

tokenless.tech

Structured Dissent Patterns for Agentic Production Reliability | Tokenless

This talk introduces 'structured dissent,' a multi-agent orchestration pattern where believer, skeptic, and neutral agents debate decisions to overcome the 'confidently wrong' failure mode of single-agent LLM systems, improving reliability for high-stakes…

4 views09:56

Tokenless - The Best of AI, ML & CS Talks

The Asymmetric Design Cycle: AI's Compute Bottleneck

The fundamental bottleneck holding back AI progress is the asymmetric design cycle between AI models and the chips they run on. While new AI methods can be developed rapidly, designing and manufacturing the next generation of chips is a multi-year, multi-hundred-million-dollar process. This mismatch prevents the effective co-design and co-evolution of hardware, software, and AI workloads. The current paradigm often involves repurposing existing hardware, like GPUs originally designed for graphics, for AI tasks. While effective at matrix multiplication, these chips are not co-optimized for the specific neural network models being run. The vision is to dramatically shorten the chip design cycle, enabling a world where custom silicon can be created in tandem with new AI applications, bending the curve of the scaling laws that govern AI progress.

The Genesis: AlphaChip and the TPU Team

The journey began with the AlphaChip project at Google, which ultimately helped design four successive generations of Tensor Processing Units (TPUs). The project started by applying Reinforcement Learning (RL) to chip placement, a critical stage in the physical design process known as floorplanning.

The initial collaboration with Google's TPU team was met with significant skepticism. The research team, coming from an AI background, initially optimized for academic metrics like "half-perimeter wire length." The TPU engineers, however, were quick to point out that these metrics were irrelevant to them. They cared about a complex set of real-world constraints:
⦁ Routed wire length
⦁ Horizontal and vertical congestion
⦁ Timing violations
⦁ Power consumption
⦁ Area (PPA)

To gain trust, the AlphaChip team adopted a highly iterative, customer-obsessed approach. They met with the TPU team weekly for years, showing them new data and working collaboratively to build cost functions that approximated the metrics the engineers truly valued. This deep partnership was crucial. For an engineer to choose an AI-generated layout over their own, they had to be convinced it was superior on every single metric they cared about, as they were ultimately responsible for the block's performance.

A New Paradigm for Chip Design

The technical approach for AlphaChip was fundamentally different from traditional Electronic Design Automation (EDA) methods. Instead of using classical combinatorial optimization solvers, the team trained an RL agent to place the millions of components on a chip.

⦁ Learning from Experience: The RL agent learns through trial and error, interacting with a simulated environment. It learns from both positive and negative placement examples, iteratively improving its strategy. This ability to learn from experience allows the model to self-improve, much like a human expert who gets better with each new design, but at a vastly greater scale.
⦁ Superhuman and Unconventional Designs: The AI began to produce layouts that were radically different from human-designed ones. As Anna Goldie noted, "We saw these very strange like curved placements... donut shapes as well." Humans tend to create highly regular, grid-like layouts. The AI, however, discovered that curved and non-uniform shapes could reduce wire length, thereby improving power consumption and timing, even if they appeared counter-intuitive and complex.

The project's success was validated when the first chip designed with AlphaChip's help was taped out and returned from the fab fully functional. With each subsequent TPU generation, the AI's layouts were adopted across more of the chip, and the performance delta between the AI's design and the human baseline grew, demonstrating AI's ability to scale with more data and experience.

Ricursive Intelligence: From Fabless to Designless

The success of AlphaChip inspired the founding of Ricursive Int...

Full story

tokenless.tech

How Ricursive Intelligence’s Founders are Using AI to Shape The Future of Chip Design | Tokenless

Anna Goldie and Azalia Mirhoseini of Ricursive Intelligence discuss how their work on Google's AlphaChip, which used AI to design TPUs, is now being extended to automate the entire chip design process. They explain their vision for a 'designless' industry…

3 views11:12

Tokenless - The Best of AI, ML & CS Talks

Why Every Brain Metaphor in History Has Been Wrong [SPECIAL EDITION]

An exploration of scientific simplification, questioning the metaphors we use to understand the brain and intelligence. This summary delves into the tension between creating useful models and mistaking them for reality, featuring insights on the mind-as-software debate, the limits of prediction versus understanding, and the philosophical underpinnings of our quest for AGI.

1 view22:40

Tokenless - The Best of AI, ML & CS Talks

Science operates by simplifying complex reality, but this necessary act raises a fundamental question: have we found a deep truth about the world, or are we mistaking our simplified model for the actual thing? This tension is embodied by the "spherical cow" joke in physics and is central to modern neuroscience and AI. As Professor Mazviita Chirimuuta explains in her book, The Brain Abstracted, we are limited creatures who must build models and leave things out. The critical disagreement, however, is what this success implies about reality itself.

This can be framed as a conflict between two perspectives:
⦁ Simplicius: Believes that science works because the universe is fundamentally simple and orderly underneath its apparent complexity. An elegant equation reflects reality.
⦁ Ignorantio: Argues that we simplify because we are cognitively limited. Our models are useful fictions—maps, not the territory—that work for our specific purposes, which doesn't prove that nature itself is simple.

Chirimuuta aligns with "learned ignorance" (docta ignorantia), the idea that true learning includes understanding the limits of what you know.

The Kaleidoscope Hypothesis: Is Reality Fundamentally Code?

Francois Chollet proposes the "kaleidoscope hypothesis," suggesting that beneath the messy surface of reality lies an intrinsic, underlying structure composed of simple, repeating "atoms of meaning." Much like a kaleidoscope creates infinite complexity from a few pieces of colored glass, the world is generated by the repetition and composition of these fundamental elements. Intelligence, in this view, is the process of mining experience to extract these abstractions.

Chirimuuta frames this not as a scientific certainty but as a philosophical bet, akin to Plato's theory of Forms. It's a wager that "real reality is neat, mathematical, and decomposable" beneath the complicated world of appearances.

The Ultimate Metaphor: Is the Mind Software?

The most pervasive simplification today is the idea that the mind is a computer running software. This has moved from a metaphor to what many consider a literal truth. Joscha Bach argues provocatively that this is not a metaphor at all: "Software is spirit." He posits that abstract patterns, like software or money, have real causal power, independent of their physical substrate. A program produces the same effects whether on a Mac, a PC, or potentially even neurons, because the causal power lies in the invariance of the pattern itself.

The counterargument is that this "sameness" is not inherent in nature but is imposed by a human observer. Physically, completely different things are happening inside different computer chips. The invariance exists only in our description. The causal power of money, for example, isn't in the paper or electrons but in the shared social agreements and interpretive practices of humans. The critique is that this view mistakes an elegant description for the fundamental structure of reality.

Historically, our metaphors for the brain have always tracked our most advanced technology:
⦁ Descartes: Hydraulic pumps in French royal gardens.
⦁ 19th Century: A telegraph network.
⦁ 20th Century: A telephone switchboard.
⦁ 21st Century: A digital computer.

As Jeff Beck bluntly states, "It will always be the case that our explanation for how the brain works will be by analogy to the most sophisticated technology that we have."

Ontology vs. Metaphysics: It Depends on Why You're Asking

Professor Luciano Floridi offers a framework to navigate this, distinguishing between metaphysics (reality as it is in itself, which is inaccessible) and ontology (the structure we impose on reality for a specific purpose). Our models of the world are not absolutely true or false; their value is relational.

Is it the same ship of Theseus? The question is a mistake. It provides no interface, what computer scientis...

Full story

tokenless.tech

Why Every Brain Metaphor in History Has Been Wrong [SPECIAL EDITION] | Tokenless

An exploration of scientific simplification, questioning the metaphors we use to understand the brain and intelligence. This summary delves into the tension between creating useful models and mistaking them for reality, featuring insights on the mind-as-software…

1 view22:40

Tokenless - The Best of AI, ML & CS Talks

"We Made a Dream Machine That Runs on Your Gaming PC"

Shahbuland Matiana and Andrew Lapp from Overworld Labs introduce Waypoint 1, a 2 billion-parameter open-source world simulation model designed to run on consumer hardware at 60 FPS. They discuss its novel architecture, which combines a causal language model with an image diffusion model to denoise frames in real-time based on user prompts and controller inputs, emphasizing low-latency interaction and the importance of local execution for user privacy.

1 view07:24

Tokenless - The Best of AI, ML & CS Talks

Overworld Labs has introduced Waypoint 1, a 2 billion-parameter world simulation model designed to run efficiently on consumer hardware. Unlike large-scale projects like Google's Genie, which rely on massive cloud infrastructure, Waypoint 1 is optimized for local execution on gaming PCs (e.g., NVIDIA 3070s, 4090s) and soon, Apple Silicon. The model, whose weights are being open-sourced, is capable of generating interactive, explorable worlds from text or image prompts at 60 frames per second.

The Vision: Sharable Lucid Dreams

The core motivation behind Overworld is to create a way to record and share the kinds of immersive, dynamic experiences found in dreams. Co-founder Shahbuland Matiana described a personal lucid dream that modern game engines cannot replicate:

"I was in this like house floating in space and there was a giant like dragon circling the the house... I draw a katana from my like waist and I parry the dragon's teeth as it goes try to bite me. I feel a clang reverberate through my whole body. The floorboards crack beneath my feet. The window shatter around me."

The goal of Waypoint 1 is to enable the creation of such fully immersive experiences where the world bends and reacts to the user's actions, and then allow those experiences to be shared with others. This technology aims to be a "killer application" for AI, moving beyond static video generation into truly interactive entertainment.

Technical Architecture: A Real-Time Diffusion Transformer

Waypoint 1's architecture is a novel hybrid of a causal language model and an image diffusion model, optimized for real-time interaction.

1. Image Compression: The process begins with an autoencoder that compresses video frames (e.g., 360p) into a much smaller latent representation, such as a 32x32 grid. The model operates entirely in this compressed latent space, not on raw pixels.

2. Frame Generation: The core of the system is a transformer model. However, instead of autoregressively predicting the next token like a standard LLM, it denoises the next 256 tokens (representing one full frame) in a single forward pass.

3. Conditioning: Each frame is generated conditioned on a history of preceding frames, a text prompt, and controller inputs from the last 1/60th of a second. This conditioning is managed through cross-attention mechanisms within the transformer blocks.

4. Low Latency: To ensure playability and responsiveness, the model generates only one frame at a time. This is a key distinction from many video diffusion models that use temporal autoencoders to compress multiple frames together, which saves computation but introduces significant input lag (e.g., only accepting input every 4th frame).

Optimization and Distillation

Achieving 60 FPS on consumer hardware requires significant optimization. The team uses a four-step rectified flow model with an Euler sampler. In this process, the model starts with random noise and, over four steps, predicts the vector that moves the latent representation closer to the "clean," ideal frame.

A key insight is that reducing the number of diffusion steps primarily sacrifices diversity, not quality. For an autoregressive model like Waypoint 1, this is an acceptable trade-off. The strong conditioning from previous frames and user input already constrains the output, so the inherent diversity from a high-step diffusion process is less critical.

This speed is further enhanced by diffusion distillation (e.g., using methods like Distribution Matching Distillation or DMD), where a "student" model is trained to replicate the output of a larger model in fewer steps. This process effectively "bakes in" parameters like the classifier-free guidance (CFG) scale, which avoids the need for multiple forward passes during inference and dramatically speeds up generation.

Privacy and the Future

The team strongly advocates for ...

Full story

tokenless.tech

"We Made a Dream Machine That Runs on Your Gaming PC" | Tokenless

Shahbuland Matiana and Andrew Lapp from Overworld Labs introduce Waypoint 1, a 2 billion-parameter open-source world simulation model designed to run on consumer hardware at 60 FPS. They discuss its novel architecture, which combines a causal language model…

1 view07:24

Tokenless - The Best of AI, ML & CS Talks

This Startup Beat Gemini 3 on ARC-AGI — at Half the Cost

Poetic, a startup by ex-DeepMind researchers, has significantly advanced performance on the ARC-AGI benchmark by applying a recursive self-improvement system to Gemini 3. Co-founder Ian Fisher discusses how their approach of automating prompt and system engineering provides a substantial performance boost without needing access to model weights, and explores its potential as a path toward AGI.

1 view09:56

Tokenless - The Best of AI, ML & CS Talks

Poetic, a new startup founded by former DeepMind researchers, has achieved a significant breakthrough on the ARC-AGI benchmark. By layering their proprietary system on top of Gemini 3, they achieved a 54% score on the private test set, a substantial leap from Gemini 3's baseline of approximately 33% and even surpassing the more advanced Gemini 3 Deep Think's 45% at half the cost.

The Core Technology: Recursive Self-Improvement

The central idea behind Poetic's success is a form of recursive self-improvement (RSI), which co-founder Ian Fisher describes as "the holy grail of AI." The goal is to create a system where the AI actively makes itself smarter.

Unlike methods that require fine-tuning or access to model weights, Poetic's approach operates purely at the system and prompt level. This is a crucial advantage when working with closed-source models available only through APIs. The methodology involves:
⦁ Ensemble Methods: The system calls the underlying model (e.g., Gemini 3) multiple times.
⦁ Independent Refinement: Each member of the ensemble works independently to refine its own answer.
⦁ Advanced Voting Schemes: The refined answers are combined using a sophisticated voting mechanism to produce a final, more accurate solution.

This system-level optimization is what differentiates Poetic from other prompt engineering frameworks like DSPy, containing what Fisher refers to as "trade secret insights" that yield a significant performance difference. The entire ARC-AGI solver was an output of their system, which was trained on ARC-1 and then applied to ARC-2 without any specific training on the latter.

The Gemini 3 Catalyst

The release of Gemini 3 was a pivotal moment. While Poetic's system showed promising results on ARC-1 with other models (reaching 89%), switching to Gemini 3 pushed their performance to 95%. When they applied this new combination to the more challenging ARC-2, they had a "holy cow moment" as the performance jumped to the state-of-the-art 54%.

Fisher attributes this leap to Gemini 3's exceptional ability to generate code for visual problem-solving, a capability that surpassed previous models. He also notes that other powerful models like Anthropic's Opus can be swapped in for Gemini 3 to achieve similar results, albeit at a higher cost.

A Path to AGI and Practical Applications

Fisher views RSI as both a practical tool for immediate performance gains and a credible path toward AGI.
⦁ Immediate Value: The performance "bump" from Poetic's system can be highly valuable. On the ARC-AGI benchmark, which allows for two solution submissions, their method provided a single, higher-quality solution that outperformed the underlying model's two submissions, sometimes at a lower overall cost.
⦁ Long-Term Vision: While not the only path, Fisher believes RSI is "the most exciting path to AGI and beyond." The process on ARC-AGI was stopped manually due to cost constraints, suggesting that with more resources, the performance could have "hill-climbed" even further.

Automating the Prompt Engineer

The broader vision for Poetic is to automate the complex and often manual process of prompt engineering and agent creation. Fisher draws an analogy to the evolution of deep learning, which automated the manual process of feature engineering.

"We are quite intentionally automating ourselves, automating prompt engineers, automating people who are building agents. It's a power tool."

He contrasts their previous manual work at DeepMind—akin to building a car by hand—with Poetic's technology, which is like "building a factory to build cars." The goal is to create a system that automatically discovers the optimal prompts and system configurations, removing the human from the tedious trial-and-error loop. While continuing their research and targeting other high-impact benchmarks, the six-person team is now also focusing on bringing t...

Full story

tokenless.tech

This Startup Beat Gemini 3 on ARC-AGI — at Half the Cost | Tokenless

Poetic, a startup by ex-DeepMind researchers, has significantly advanced performance on the ARC-AGI benchmark by applying a recursive self-improvement system to Gemini 3. Co-founder Ian Fisher discusses how their approach of automating prompt and system engineering…

2 views09:56

Tokenless - The Best of AI, ML & CS Talks

She Raised $64M to Build an AI Math Prodigy | Carina Hong, CEO of Axiom

Carina Hong, Founder & CEO of Axiom, discusses building a self-improving AI reasoning engine that combines generation and verification. Starting with formal mathematics, Axiom's system has achieved superhuman results on the notoriously difficult Putnam Exam by leveraging formal languages like Lean to overcome the probabilistic and unverifiable nature of standard LLMs. Hong explores how this technology can solve major bottlenecks in hardware and software verification, code migration, and database consistency, and what it means for the future of mathematical research.

1 view12:16

Tokenless - The Best of AI, ML & CS Talks

Axiom's mission is to build a self-improving reasoning engine that uniquely combines generation and verification, an often-overlooked component in the current AI landscape. The company starts with an "AI mathematician" as a testing ground for this self-improvement loop, using formal languages like Lean to ground its natural language capabilities.

The Architecture of a Reasoning Engine

Axiom's system is built on three core components that interact with each other:
⦁ Prover: A system that can prove theorems.
⦁ Conjecturer: A system that proposes interesting and novel conjectures.
⦁ Knowledge Base: A database of what has already been proven, which both the prover and conjecturer can reference.

Tying these components together is auto-formalization, the process of converting natural language mathematics into a formal language. This is a core technology for Axiom, viewed as being as challenging and important as theorem proving itself.

Superhuman Performance on the Putnam Exam

Axiom's prover has demonstrated remarkable capabilities on the Putnam Mathematical Competition, an infamously difficult exam for undergraduates where the median score is often zero.
⦁ Axiom's system solved 8 out of 12 problems within the official time limit, a score that would place it in the top five (Putnam Fellow). A ninth problem was solved shortly after.
⦁ This performance significantly surpasses that of Axiom's founder, Carina Hong, who scored 4 out of 12.
⦁ This success showcases the power of combining deterministic, formal tools with probabilistic systems. Formal systems cannot "hand-wave" through difficult steps, forcing a level of rigor that informal LLMs lack. For instance, the AI prover might spend significant effort generating detailed code to rigorously prove convergence or limits, something a human might take for granted.

AI vs. Human Problem-Solving

While LLMs can seem impressive on some math problems, they often fail on seemingly simpler brain teasers because they lack true reasoning and verification. They generate solutions statistically without a guarantee of soundness.
⦁ Formal Verification's Role: Axiom's use of formal languages like Lean ensures that a proof is sound. Unlike a natural language proof from an LLM, which can have subtle flaws that are hard to spot, a Lean proof is machine-verifiable.
⦁ Interpretability: While the AI may generate proofs that are structured differently from human proofs, they are ultimately interpretable. The formal code of each step can be inspected and converted back to natural language, a significantly easier task than the initial formalization. The AI may find solutions that are convergent with what a human would find, acting like a collaborator with a different style, akin to the discovery of a self-taught genius like Ramanujan.

Applications Beyond Pure Mathematics

The core technology of generation paired with verification has profound implications for high-stakes commercial applications where correctness is critical. Formal verification is a major bottleneck in many industries, often consuming years of effort.
⦁ Hardware and Software Verification: In chip design, verification teams can be three to four times larger than design teams, with verification cycles taking years. AI-powered formal verification can dramatically reduce this time and lower the expertise required. AWS, for example, took five years to manually formalize just one component of its hypervisor.
⦁ Code Migration and Equivalence: When upgrading legacy systems, it's crucial to ensure the new code is perfectly equivalent to the old code. Formal methods can prove this equivalence, preventing regressions in critical business functions.
⦁ Database Consistency: Formal verification can be used to prove the consistency of database protocols, such as solving the Byzantine Generals Problem, ensuring reliability even in the presence of bad act...

Full story

tokenless.tech

She Raised $64M to Build an AI Math Prodigy | Carina Hong, CEO of Axiom | Tokenless

Carina Hong, Founder & CEO of Axiom, discusses building a self-improving AI reasoning engine that combines generation and verification. Starting with formal mathematics, Axiom's system has achieved superhuman results on the notoriously difficult Putnam Exam…

3 views12:16

Tokenless - The Best of AI, ML & CS Talks

Inference at Scale:Breaking the Memory Wall

Sid Sheth, CEO of d-matrix, details their memory-centric approach to AI inference hardware, focusing on their Digital In-Memory Compute (DIMC) architecture. He explains how DIMC, an augmented SRAM technology, minimizes data movement to solve the memory bottleneck, delivering significant gains in latency and energy efficiency, particularly for the 'decode' phase of large language models.

2 views10:00

Tokenless - The Best of AI, ML & CS Talks

The Bet on Cloud Inference and Memory-Centric Design
Founded in 2019, before the rise of ChatGPT, d-matrix made a contrarian bet on data center and cloud inference. While many startups focused on edge computing or the highly competitive training market dominated by NVIDIA, d-matrix identified a gap for a dedicated, efficient inference solution in the cloud.

The founding team anticipated that AI models, particularly transformers like BERT and the emerging GPT-3, would continue to grow in size, making memory access the primary bottleneck. Their first-principles analysis of the inference workload revealed it to be a repetitive, parallel compute problem heavily dependent on memory access. This led to their core strategy: integrating memory and compute as closely as possible to build a fundamentally more efficient architecture.

The Memory Bottleneck: HBM vs. SRAM
The choice of memory technology is critical for AI hardware. Sid Sheth provides a clear breakdown of the trade-offs:

⦁ High-Bandwidth Memory (HBM): Originally developed for High-Performance Computing (HPC) and later adopted for AI training, HBM acts like a "highway with many lanes," providing high-bandwidth access to a processor. While effective for the massive, parallel data needs of training, HBM is a poor fit for mainstream inference due to three key factors:
⦁ Cost: It remains an expensive technology.
⦁ Energy: It is very power-hungry.
⦁ Bandwidth Limits: The pace of AI model growth is outstripping HBM's ability to scale its bandwidth, making it "not fast anymore" for cutting-edge inference needs.

⦁ SRAM (Static RAM): d-matrix, along with other early players like Grok and Cerebras, initially focused on SRAM for its speed. However, on-chip SRAM capacity is limited. Recognizing that models would quickly outgrow a single chip, d-matrix designed its system with a two-tiered memory approach from the start, using a large on-chip SRAM tier and a second, larger LPDDR memory tier to accommodate extremely large models and the exploding KV-cache sizes associated with long contexts.

Prefill vs. Decode: The Two Phases of Generative Inference
Generative AI models operate in two distinct phases, which have different hardware requirements:

1. Prefill (The "Thinking" Phase): When a model receives a prompt, it processes the input and generates the internal context (KV cache). This phase is compute-intensive.
2. Decode (The "Speaking" Phase): The model then generates the response token by token. Each new token requires accessing the entire KV cache. This phase is memory-intensive and highly sensitive to latency. A slow decode phase results in a poor user experience, with long delays between words.

d-matrix's architecture is particularly well-suited for accelerating the memory-bound decode phase, where low latency is paramount.

Digital In-Memory Compute (DIMC): The Core Innovation
d-matrix's key technology is Digital In-Memory Compute (DIMC). It's a novel architecture that turns memory itself into a compute fabric.

⦁ How it Works: A traditional SRAM cell uses six transistors (6T) to store one bit of data. d-matrix augmented this design, creating a ten-transistor (10T) cell that can both store a bit and perform a single-bit multiplication.
⦁ The Benefit: By embedding compute directly within the memory array, model parameters (weights) can be stored and used for matrix math calculations without being moved. This minimization of data movement is the key to efficiency. It saves a tremendous amount of time and energy, directly addressing the three most precious resources: money, time, and energy.

This approach allows all rows of the SRAM to be activated simultaneously, creating a dataflow engine with much higher throughput than a traditional SRAM.

System, Scale, and Performance Trade-offs
The d-matrix solution is bu...

Full story

tokenless.tech

Inference at Scale:Breaking the Memory Wall | Tokenless

Sid Sheth, CEO of d-matrix, details their memory-centric approach to AI inference hardware, focusing on their Digital In-Memory Compute (DIMC) architecture. He explains how DIMC, an augmented SRAM technology, minimizes data movement to solve the memory bottleneck…

3 views10:00

Tokenless - The Best of AI, ML & CS Talks

Boris Cherny: How We Built Claude Code

Boris Cherny, creator of Claude Code, shares the development philosophy behind the AI coding tool, emphasizing building for future models, leveraging latent user demand, and the surprising longevity of the terminal interface.

1 view12:41

Tokenless - The Best of AI, ML & CS Talks

A Philosophy of Building for the Future

The core development principle at Anthropic, and for Claude Code specifically, is to not build for the model of today, but for the model that will exist in six months. This forward-looking approach anticipates the rapid, exponential improvement in model capabilities. Builders are advised to identify the current frontiers where a model is weak, with the confidence that it will become proficient in those areas over time.

This philosophy is heavily influenced by Rich Sutton's "The Bitter Lesson," which posits that general models that leverage computation will ultimately outperform more specialized, human-designed systems. Consequently, the Claude Code team is cautious about building what they call "scaffolding"—product features or code that compensates for a model's current shortcomings. This scaffolding often provides a temporary 10-20% performance gain but is rendered obsolete by the next model iteration.

"Never bet against the model... We could also just wait like a couple of months and the model can probably just do the thing instead."

This results in a dynamic and ephemeral codebase. Virtually no part of Claude Code that existed six months ago is still in the product today. The entire application is constantly being written, rewritten, and refactored as model capabilities advance, with tools and features being added and removed every couple of weeks.

The Accidental Genius of the Terminal

Claude Code's existence as a command-line interface (CLI) was not a grand design but an accident. It began as a simple terminal-based chat application built by Boris Cherny to familiarize himself with the Anthropic API. The initial goal was simply to explore what a coding product could be.

The "aha!" moment came when the model was given a bash tool. When asked, "What music am I listening to?" the model, Sonnet 3.5 at the time, independently wrote and executed AppleScript to query the user's music player. This demonstrated an innate desire to use tools and interact with the world, which became a foundational insight for the product's direction.

The terminal, chosen for its simplicity and lack of UI overhead, proved to be a surprisingly effective and enduring form factor. Its constraints fostered an elegant and powerful developer experience that resonated deeply with engineers, leading to rapid, viral adoption within Anthropic long before its public release.

Features Born from Latent Demand

A key product principle is to identify and serve "latent demand"—making it easier for users to do what they are already trying to do. Many of Claude Code's core features originated from observing user workarounds and desires.

CLAUDE.md
The concept for CLAUDE.md emerged when developers were observed writing their own markdown files with instructions and context, which they would then feed to the model. This behavior was formalized into a feature that allows teams to maintain a shared set of instructions and context checked into their codebase. The advice for maintaining these files is to be minimal; if a CLAUDE.md becomes too long or complex, it's often best to delete it and start fresh, adding instructions back only as needed, as newer models require less guidance.

Plan Mode
Plan Mode was created in a 30-minute coding session on a Sunday night in response to observing users explicitly asking the model to "plan this out but don't write any code yet." The implementation is deceptively simple: it just adds a single sentence to the prompt, "please don't code." While currently a heavily used feature to ensure the model is on the right track before execution, Boris predicts it may have a limited lifespan as models become capable enough to generate and execute a correct plan from a single prompt.

From Solo Agent to Agent Swarms

The architecture of work is evolving from single-agent interactions to multi-agent collaboration. Claude Code heavil...

Full story

tokenless.tech

Boris Cherny: How We Built Claude Code | Tokenless

Boris Cherny, creator of Claude Code, shares the development philosophy behind the AI coding tool, emphasizing building for future models, leveraging latent user demand, and the surprising longevity of the terminal interface.

3 views12:41

Tokenless - The Best of AI, ML & CS Talks

The Laws of Thought: The Math of Minds and Machines, with Prof. Tom Griffiths

Princeton Professor Tom Griffiths discusses his book "The Laws of Thought," exploring the mathematical models that govern both biological and artificial intelligence. He details the fundamental differences between human and machine cognition, rooted in their vastly different constraints, and explains how concepts like inductive bias, probability, and curiosity can bridge the gap between cognitive science and modern AI.

1 view16:12

Tokenless - The Best of AI, ML & CS Talks

Professor Tom Griffiths of Princeton University explores the mathematical principles that form the foundation of both human and artificial intelligence, bridging the gap between two contrasting views of the human mind. While psychologists often highlight human irrationality and biases, computer scientists see human cognition as an inspiration for AI. Griffiths' work seeks to reconcile these perspectives by framing human intelligence as a rational adaptation to significant constraints.

The Laws of Thought: A Mathematical Theory of Mind

The core idea of Griffiths' book, The Laws of Thought, is that just as mathematical laws of nature describe the external, physical world, a complementary set of mathematical principles can describe our internal, mental world.
⦁ From Behaviorism to Cognitive Science: Early psychology struggled to scientifically study internal thoughts, leading to the rise of behaviorism, which focused only on observable behaviors. The cognitive revolution was made possible by the development of computers and mathematical concepts like logic and probability, which provided a new, rigorous language to form and test hypotheses about the mind.
⦁ Research Methodology: Modern cognitive science research often involves large-scale online experiments. In Griffiths' lab, participants are presented with problems that require them to make inferences or decisions from data. By analyzing the responses from thousands of participants using modern machine learning tools like neural networks, researchers can develop and refine computational models of human cognition.

Human vs. AI: A Tale of Two Intelligences

A key distinction between human and artificial intelligence lies in the constraints they operate under. Humans are limited by time (a finite lifespan), computation (a few pounds of neural tissue), and communication bandwidth. In contrast, AI systems can be scaled with more data and compute and can transfer information perfectly. This leads to fundamentally different problem-solving approaches.

⦁ Inductive Bias and The Data Gap: A human child learns a language in about five years, whereas an LLM requires the equivalent of thousands of years of text data. This vast difference highlights the powerful inductive biases, or priors, built into human cognition. These biases provide a starting framework that makes learning from sparse data possible.
⦁ The Machine Learning Paradigm: Since the success of AlexNet in 2012, the dominant paradigm in machine learning has been one of weak inductive biases and massive datasets. The philosophy is that with enough data, a sufficiently complex model can learn the necessary features and solutions without human-engineered priors. This is the opposite of the human approach.
⦁ Engineering Inductive Bias: To create more human-like AI, we may need to engineer these biases. Meta-learning is one such technique, where a model learns an optimal set of initial weights by being trained on a wide variety of tasks. This provides a "soft bias" that guides the model toward effective solutions without rigidly constraining it, making it better at few-shot learning.

Deconstructing Large Language Models

Griffiths' research provides a scientific lens for understanding the behavior of LLMs.
⦁ Deductive vs. Inductive Problems: Early symbolic AI excelled at deductive problems (e.g., logic, chess), where all necessary information is provided. However, it struggled with inductive problems—the cornerstone of human intelligence—where conclusions must be drawn from incomplete information. Probability theory, particularly Bayes' rule, provides the mathematical framework for induction.
⦁ "Embers of Autoregression": LLMs are trained to predict the next token in a sequence, which makes them highly sensitive to the statistical patterns in their training data. This can lead to counter-intuitive behavior. For example, an LLM might be mo...

Full story

tokenless.tech

The Laws of Thought: The Math of Minds and Machines, with Prof. Tom Griffiths | Tokenless

Princeton Professor Tom Griffiths discusses his book "The Laws of Thought," exploring the mathematical models that govern both biological and artificial intelligence. He details the fundamental differences between human and machine cognition, rooted in their…

2 views16:12

Tokenless - The Best of AI, ML & CS Talks

How A Team Of 7 Keeps Breaking AI Benchmark Records

Poetiq, a startup by former DeepMind researchers, has developed a recursive self-improvement meta-system that builds "reasoning harnesses" on top of existing LLMs. This approach avoids the costly "fine-tuning trap" and has achieved state-of-the-art results on benchmarks like ARC-AGI and Humanity's Last Exam by automatically optimizing prompts and discovering novel reasoning strategies.

1 view16:47

Tokenless - The Best of AI, ML & CS Talks

Poetiq is building a recursively self-improving system that acts as a "reasoning harness" for large language models (LLMs). The core insight is a method for recursive self-improvement—where an AI makes itself smarter—that is significantly faster and cheaper than traditional approaches, which typically require retraining a new LLM from scratch at a cost of hundreds of millions of dollars.

The Challenge: The "Fine-Tuning Trap"

Many companies building on LLMs face a significant challenge: the "fine-tuning trap." The conventional approach involves:
1. Collecting a large dataset (tens of thousands of examples).
2. Spending a great deal on compute to fine-tune a frontier or open-weights model.
3. Achieving improved performance on a specific task.

However, this process is vulnerable to the "bitter lesson." Soon after, a new, more powerful base model is released (e.g., GPT-4o) that outperforms the specialized, fine-tuned model out of the box. The company is then faced with the choice of repeating the expensive fine-tuning process or going out of business.

Poetiq's Solution: "Stilts" for LLMs

Poetiq offers a different paradigm. Instead of fine-tuning, it provides an agentic system, or "harness," that sits on top of one or more LLMs. This harness enhances the base model's capabilities, acting like "stilts" to make it perform better on specific, hard problems.

Key advantages of this approach include:
⦁ Model Agnostic: When a new frontier model is released, the same harness remains compatible and provides an immediate performance boost.
⦁ Cost-Effective: The optimization process is vastly cheaper than fine-tuning.
⦁ Continuous Improvement: The harness can be further optimized for the new model to achieve even greater performance, ensuring the system always outperforms the underlying base models.

How the Meta-System Works

Poetiq's core technology is a recursively self-improving meta-system. The output of this meta-system is not just a solution, but systems that solve hard problems. This automated optimization process can generate a complete reasoning harness—comprising code, prompts, and data—from scratch.

Furthermore, if a startup has already built its own agent, Poetiq can ingest that system and optimize its components, such as prompts or reasoning strategies. The meta-system analyzes the problem data to discover failure modes and identify robust reasoning paths, effectively outsourcing the deep, manual data analysis to the AI itself.

Automating Prompt and Reasoning Engineering

The system moves beyond simple prompt optimization. While automated prompt tuning (like the popular DSPy framework) provides some gains, the most significant improvements come from discovering novel reasoning strategies that are implemented in code.

In one example from a previous project at DeepMind, manual prompt optimization on a very hard task took performance to 5%. However, by adding optimized reasoning strategies, performance jumped from 5% to 95%. The Poetiq meta-system automates this discovery process, often generating non-intuitive prompts and strategies that a human might not devise.

Demonstrated Success on Key Benchmarks

Poetiq has validated its approach by achieving top rankings on difficult AI benchmarks:

⦁ ARC-AGI: Shortly after a new model set the state-of-the-art at 45%, Poetiq's system achieved 54% accuracy. Notably, it did so by building on a cheaper base model, costing only $32 per problem compared to the previous SOTA's $70+.
⦁ Humanity's Last Exam: On this set of 2,500 expert-level questions, Poetiq achieved a score of 55%, surpassing the previous record of 53.1% set by Anthropic's Claude Opus. The entire optimization run for this achievement cost less than $100,000, a fraction of the cost of training a foundation model.

These results demonstrate the system's ability to enhance both complex reasoning (ARC...

Full story

tokenless.tech

How A Team Of 7 Keeps Breaking AI Benchmark Records | Tokenless

Poetiq, a startup by former DeepMind researchers, has developed a recursive self-improvement meta-system that builds "reasoning harnesses" on top of existing LLMs. This approach avoids the costly "fine-tuning trap" and has achieved state-of-the-art results…

2 views16:47

Tokenless - The Best of AI, ML & CS Talks

How AI is changing Software Engineering: A Conversation with Gergely Orosz, @pragmaticengineer

Gergely Orosz, author of The Pragmatic Engineer, discusses the bizarre trend of 'token maxing' in Big Tech, the evolving role of software engineers in the AI era, and why companies are heavily investing in internal AI infrastructure despite uncertain productivity gains.

2 views14:35

Tokenless - The Best of AI, ML & CS Talks

The Rise of "Token Maxing" in Big Tech

A strange cultural phenomenon known as "token maxing" has emerged within large tech companies like Meta, Microsoft, and Salesforce. It stems from these companies measuring developers' usage of internal AI tools, often through public leaderboards or spend tracking. At Salesforce, for example, there's a tool to see how many dollars colleagues have spent on AI tokens, with some teams having a target minimum spend of around $175 per month.

This measurement, coupled with industry-wide job insecurity, has led engineers to feel pressured to increase their token count to avoid being perceived as low performers. This pressure results in counterproductive behaviors:
⦁ Artificial Usage: Engineers run autonomous agents to "build junk" or ask AI assistants to summarize documentation (even if the AI does a poor job) simply to generate tokens and stay out of the bottom percentile.
⦁ Weaponized Metrics: While token count is just one of many data points in performance reviews, it can be "weaponized." A low performer with a low token count is seen as "not even trying," while a high performer with a high token count is seen as an innovator.

This trend is reminiscent of earlier flawed metrics like "lines of code," but it's now being driven by top tech companies. It's a response to an initial push from leadership who, seeing the success of AI-native companies like Anthropic, wanted to force adoption among skeptical, experienced engineers. The situation at Coinbase was an extreme example, where the CEO mandated AI tool usage under the threat of termination.

Is AI Actually Making Engineers More Productive?

While individual productivity is certainly increasing, team-level productivity is more of a question mark. It's proven difficult to retrofit AI into established workflows. An interesting perspective is that the most significant productivity gain may not be for engineers themselves, but for their non-technical collaborators. By enabling them with coding agents, they no longer have to wait for an engineer, effectively creating "serverless developers" and unlocking organizational productivity.

Mastering these AI tools requires a new mindset:
⦁ Continuous Learning: There is no manual for AI tools. It takes a long time to get good, and workflows are constantly changing.
⦁ Practice Over Theory: Unlike traditional computer science, understanding the underlying theory of models doesn't necessarily make you a better user. Hands-on experience is key.
⦁ Open-Mindedness: Success requires a low-ego approach, being "open to leaving your priors behind" and experimenting.

The Changing Role of the Software Engineer

AI is accelerating a trend that was already underway: the expansion of the software engineer's role. The role has already absorbed responsibilities from dedicated tester and DevOps teams. Now, it's beginning to incorporate product skills, giving rise to the "product engineer."

Expectations for seniority and business awareness are increasing even for early-career engineers. As a result, teams are shrinking. A VP of Engineering at John Deere noted their "two-pizza teams" are becoming "one-pizza teams," a direct result of these new tools and expanded roles.

The idea that engineers are becoming "engineering managers for AI agents" is a flawed analogy. The role is more akin to a Tech Lead or someone operating a "mech suit" (an analogy from DHH). You orchestrate tasks and can do more, faster, without the people-related challenges of management—the drama, conflicts, and slow feedback loops. With agents, the feedback loop is immediate.

Big Tech's Massive Investment in Internal AI Infrastructure

Many large tech companies like Uber, Airbnb, and Meta are investing heavily in building bespoke internal AI infrastructure, even if it hasn't yet translated to a dramatic increase in external product features. They...

Full story

tokenless.tech

How AI is changing Software Engineering: A Conversation with Gergely Orosz, @pragmaticengineer | Tokenless

Gergely Orosz, author of The Pragmatic Engineer, discusses the bizarre trend of 'token maxing' in Big Tech, the evolving role of software engineers in the AI era, and why companies are heavily investing in internal AI infrastructure despite uncertain productivity…

3 views14:35

Tokenless - The Best of AI, ML & CS Talks

Building Generative Image & Video models at Scale - Sander Dieleman (Veo and Nano Banana)

Sander Dieleman from Google DeepMind provides a behind-the-scenes look at the key components of training large-scale diffusion models for audio-visual data. The talk covers the entire pipeline, from the critical role of data curation and latent representations to the mechanics of diffusion, network architectures, sampling with guidance, and advanced control signals.

2 views14:38

About

Blog

Apps

Platform