Tokenless - The Best of AI, ML & CS Talks
3 subscribers
99 photos
114 links
Daily posts on the internals of AI, ML, and CS — straight from the experts. No hype, no bullshit news.

#AI #AInews #newsletter
Download Telegram
The paper "Small Language Models are the Future of Agentic AI" posits that the trend toward ever-larger models may be misguided for agentic systems. Instead, it argues that a heterogeneous ecosystem of smaller, specialized models offers a more powerful, efficient, and economical path forward. The core intuition is to scale out by composing specialized "Lego" blocks (SLMs) rather than scaling up a single monolithic model (LLM).

The Three Pillars of the SLM Argument

The authors build their case on three primary arguments:

1. SLMs are Powerful Enough: Recent advancements have produced SLMs (e.g., Microsoft's Phi series) that achieve competitive performance on benchmarks for reasoning, language, and coding tasks when compared to models 10-20 times their size. For the vast majority of agentic tasks, which often involve a limited subset of an LLM's full capabilities, this level of performance is sufficient. The broad, general intelligence of a massive LLM can be "intelligence overkill" when an agent is only performing a narrow, specific function.

2. SLMs are Operationally Superior:
Performance: They offer significantly lower inference latency and require less memory, making them faster and easier to deploy.
Flexibility: Their small size allows for greater operational flexibility, including deployment on edge devices and consumer-grade GPUs without specialized infrastructure.
Behavioral Alignment: Agentic systems require predictable, structured interactions, often using formats like JSON or YAML. It's easier and more reliable to fine-tune an SLM to consistently produce a specific format, reducing the risk of hallucinations or formatting errors that can occur with a general-purpose LLM trained on countless formats.
Heterogeneity: Agentic workflows are naturally composed of diverse tasks with varying complexity. A system can dynamically choose the best model for each sub-task—a simple SLM for a simple task and perhaps a more powerful model only when necessary.

3. SLMs are More Economical:
Inference Costs: Serving a 7B parameter SLM can be 10-30 times cheaper than serving a 70B+ parameter LLM.
Operational Simplicity: SLMs avoid the complexities of multi-GPU and multi-node parallelization, simplifying infrastructure management and maintenance.
Fine-Tuning Agility: Fine-tuning an SLM requires only a few GPU hours, enabling rapid iteration and specialization, compared to the weeks and significant resources needed for large models.
Parameter Utilization: SLMs are fundamentally more efficient, activating a higher percentage of their parameters for a given task compared to the sparse activation in very large models.

A compelling argument is that agentic systems naturally evolve toward SLMs. Each interaction an agent has (prompt, output, user feedback) generates valuable training data. Even if a system starts with an LLM, this continuous stream of task-specific data creates the perfect conditions for optimizing and distilling that capability into a smaller, expert model.

Counterarguments and Practical Challenges

The discussion also highlighted significant counterarguments and real-world barriers:

Scaling Laws and Generalization: LLMs benefit from scaling laws, giving them a more nuanced and abstract understanding of concepts, multi-linguality, and multi-modality that SLMs may lack. This deep generalization might be crucial for a top-level "supervisor" agent that needs to orchestrate complex tasks.
The Cost of a Fleet: While a single SLM is cheap, managing an entire fleet of specialized models introduces its own operational complexity and costs, including infrastructure, talent, and orchestration. Centralized LLM endpoints can benefit from higher utilization, potentially making them more cost-effective at scale than multiple, under-utilized SLM endpoints.
Real-World Ne...

Full story
The Day AI Solves My Puzzles Is The Day I Worry (Prof. Cristopher Moore)

Professor Cristopher Moore of the Santa Fe Institute discusses the surprising effectiveness of AI, arguing it stems from the rich, non-random structure of the real world. He explores the limits of current models, the nature of intelligence as creative problem-solving and abstraction, the importance of grounding and shared reality, and the profound implications of computational irreducibility and the need for algorithmic transparency in high-stakes applications.
Cristopher Moore, a self-described "frog" in the world of science, prefers diving deep into concrete problems over taking a high-level "bird's-eye view". This perspective informs his analysis of artificial intelligence, computational theory, and the nature of intelligence itself.

The Structure of the World and the Success of AI

The surprising effectiveness of large models like transformers stems not from a magical architecture, but from the nature of the data they are trained on. Real-world data is neither completely random nor adversarially designed to be difficult. Instead, it is filled with rich structure, patterns, and hierarchies. Moore argues: "the real world presents us with examples of these problems where there is so much rich structure to sink your teeth into."

Any sufficiently rich architecture can learn to exploit this structure. We will likely look back and realize that what truly matters is that "the world is structured and any architecture which is capable of capturing some of that structure is going to do well at prediction." This contrasts with theoretical work in computer science and statistical physics, which often proves hardness based on worst-case adversarial examples or purely random data models. While concepts like phase transitions—sharp shifts in problem difficulty based on signal-to-noise ratios, analogous to a magnet losing its field at a critical temperature—are powerful for understanding random problems, they don't capture the full picture of real-world AI performance.

Intelligence as Creative Problem-Solving

Despite their success, current models falter on tasks requiring novel reasoning and abstraction, such as modern Sudoku variants with complex, layered rules. These puzzles, designed by humans for humans, require insights and the creation of new logical constraints on the fly, a process current AI struggles with. Moore notes that the ability of AI to absorb rules and perform intelligent search "hasn't happened yet."

This highlights a deeper aspect of intelligence: the ability to transform hard problems into simpler ones. It's about inventing heuristics and new forms of "partial knowledge" to navigate a problem space. Humans fluidly switch their approach, asking "which piece can fit here?" and then "where can this piece go?" This process of formalization and mathematization is a crucial, creative step that often constitutes 90% of the work in scientific modeling. True intelligence involves inventing the variables and constraints to address a problem, not just solving a pre-defined one.

Grounding, Meaning, and Shared Reality

A significant limitation of current language models is their lack of grounding in the physical world. When asked to summarize a nuanced essay, a model might regress to the mean, producing a "lowest common denominator" summary based on common arguments about the topic, completely missing the author's unique, subtle point. This indicates a failure to grasp meaning beyond statistical correlation.

Moore, a self-professed Platonist, believes in a shared, objective reality of abstract concepts. When two people visualize a cube, they perceive the same object with 8 corners and 12 edges. This shared perception allows for meaningful agreement and correction. He suggests that once AI systems can utilize multimodal "workspaces"—to doodle, run code, and manipulate virtual objects—they will move closer to this kind of grounded understanding.

Computation, Universality, and Irreducibility

The conversation delves into the fundamental nature of computation and its relationship to intelligence.

Computational Irreducibility: Drawing on Stephen Wolfram's work, Moore discusses systems where there are no analytical shortcuts to predict a future state. To know the outcome, "you have to do the work" of simulating every intervening step. While our only method for proving a system is irreducible is to build a universal c...

Full story
921: NPUs vs GPUs vs CPUs for Local AI Workloads — with Dell’s Ish Shah and Shirish Gupta

Shirish Gupta and Ish Shah from Dell Technologies explore the evolving landscape of AI hardware. They discuss why Windows, enhanced by WSL 2, remains a dominant platform for developers, and delve into the distinct roles of CPUs, GPUs, and the increasingly important Neural Processing Units (NPUs). The conversation covers the trade-offs between local and cloud computing for AI workloads and introduces new hardware, like workstations with discrete NPUs, that are making on-device AI more powerful and accessible than ever.
The Operating System Debate: Windows for AI Development

While the AI and data science communities often gravitate towards Unix-based systems, Windows remains a formidable platform for development. Statistically, Windows is the most popular OS among software developers, used by approximately 64%. Its familiarity, user-friendliness, and compatibility with essential productivity applications make it an accessible starting point. For enterprise environments, Windows offers streamlined IT management and security integrations.

A common best practice is to develop in an environment that mirrors production, and since most large-scale ML deployments run on Linux, this has traditionally been a point of friction. However, the gap is closing significantly with Windows Subsystem for Linux (WSL 2), which allows a full Linux kernel to run directly on Windows. This provides developers with the "best of both worlds": the productivity and enterprise benefits of Windows alongside the native command-line tools and environment of Linux, eliminating the need for dual-booting.

A New Trinity of Processors: CPU, GPU, and NPU

The modern AI landscape requires a nuanced approach to processing, moving beyond a one-size-fits-all model. The choice of hardware is no longer a simple matrix; it has expanded to an "8x8" complexity of options tailored to specific needs. The key is using the right tool for the right job.

The Rise of the NPU (Neural Processing Unit)

An NPU is a specialized processor purpose-built to handle AI and ML workloads, particularly the vector math that forms the foundation of neural networks.
Efficiency is Key: NPUs are architected for maximum performance per watt. This is critical for mobile devices and laptops, where they can handle tasks like background blur or speech-to-text without draining the battery, offloading this work from the CPU.
Integrated vs. Discrete: NPUs can be integrated into the main chipset (SoC) or exist as powerful discrete cards. Dell has announced the Dell Pro Max mobile workstation, which will feature a discrete NPU, capable of running a 109-billion-parameter model locally. This is a game-changer for use cases requiring high performance in offline, secure, or latency-sensitive environments.

GPUs: The Powerhouse for Scalable Performance

GPUs remain the champions of parallel processing and are currently more versatile and scalable than NPUs for high-end AI tasks.
High-End Training and Inference: New hardware like the Nvidia Blackwell RTX Pro GPUs continues to push boundaries. For instance, the GB10 appliance will be capable of fine-tuning a 200-billion-parameter model locally.
Versatility: GPUs are not only for AI; they accelerate a wide range of tasks from CAD software to gaming, offering a dual-use benefit for many users. Their primary trade-off, especially on client devices, is higher power consumption compared to NPUs.

The CPU: The Enduring Workhorse

The CPU is still the core of the system, handling general-purpose tasks and running the operating system. If a system lacks a dedicated NPU or GPU, the CPU must also take on the new AI workloads, potentially impacting overall performance. Modern architectures like Intel's Lunar Lake are enhancing CPU capabilities with features like on-chip memory for faster data transfer and powerful integrated GPUs that can rival some entry-level discrete cards.

The Developer Experience: Bridging Hardware and Software

To simplify the complexity of targeting these different processors, Dell has introduced Dell Pro AI Studio. This software layer abstracts away the underlying toolchains (like Intel OpenVINO) required to run models on specific silicon (Intel, AMD, Qualcomm, etc.).
Democratizing Access: It dramatically reduces development time. In one case study, a task that took a team three months to complete using standard toolchains was accompl...

Full story
Why language models hallucinate, revisiting Amodei’s code prediction and AI in the job market

Experts discuss an OpenAI paper that reframes hallucinations as a feature driven by training incentives, not just a bug. The panel also revisits Dario Amodei's prediction on AI coding, explores AI's chaotic impact on the job market, and imagines the future of running LLMs on business-card-sized devices.
Reframing Language Model Hallucinations

A recent paper from OpenAI, "Why Language Models Hallucinate," suggests that the issue of hallucination is more complex than a simple bug to be fixed. The core argument is that the problem is inherent to the current model training paradigm.

Models are incentivized to guess rather than admit uncertainty. During training, particularly with reinforcement learning, a model receives a reward for a correct answer but gets zero points for stating "I don't know." This reward structure encourages the model to take a chance on an answer, as there's a possibility of being correct. This is compounded by the landscape of external evaluations and benchmarks, which often rely on binary (yes/no) scoring. Model providers, aiming for the highest possible scores on these leaderboards, are therefore disincentivized from training models that frequently express uncertainty.

The paper challenges the myth that simply increasing model accuracy will decrease hallucinations. The authors argue that accuracy and hallucination are different measures. The proposed solution is not to eliminate guessing but to achieve better calibration between accuracy and uncertainty through more sophisticated reward functions and evaluations.

This leads to a broader discussion on the role of hallucinations:

A Tool for Creativity: For certain use cases, such as generating creative text or adopting a persona (e.g., "act like a pirate"), hallucination is not a bug but a feature. It represents a form of creative inference, combining disparate concepts to generate novel outputs. A world without this capability would lead to rigid, uncreative models.
Need for a Better Definition: The community lacks a clear, agreed-upon definition of what constitutes a hallucination versus the model simply being incorrect due to conflicting data in its training set.
A Multi-Faceted Solution: Eliminating hallucinations entirely is likely impossible. The path forward involves a combination of better-calibrated models and a suite of external tools, including guardrails, symbolic approaches, and Retrieval-Augmented Generation (RAG), to verify model outputs against grounded context.

Revisiting the 90% AI Coding Prediction

In March, Anthropic's CEO Dario Amodei predicted that within six months, AI would be writing 90% of the code for software developers. With that timeframe now passed, the panel reflected on its accuracy.

The key distinction is between automation (replacing developers) and augmentation (assisting developers). While 90% of developers have not lost their jobs, it is plausible that AI now assists in generating a significant portion of code, perhaps approaching that 90% figure for developers who have fully integrated tools like GitHub Copilot.

The prediction might be correct in terms of technological capability, even if societal adoption and developer tooling haven't caught up yet. The discussion framed this shift as another layer of abstraction in software development, similar to the move to object-oriented programming or the adoption of ORMs that generate database code. However, there are still areas where these tools struggle, such as generating reliable and complex SQL, where understanding intricate database schemas remains a significant challenge.

AI's Impact on the "Hellish" Job Market

Referencing an article from The Atlantic, the discussion turned to the chaotic state of the job market, which has become an "arms race" between AI-powered tools.

Candidates use generative AI to automate job applications, tailor CVs, and even assist during interviews.
Recruiters use AI screening tools to filter the resulting flood of applications.

The outcome is a noisy, impersonal system where it's difficult for humans to connect. This has led to a renewed emphasis on "old-school" techniques like leveraging personal networks to find opportunities. ...

Full story
Fully autonomous robots are much closer than you think – Sergey Levine

Sergey Levine, co-founder of Physical Intelligence, outlines the path to general-purpose robots, predicting a 'self-improvement flywheel' could lead to fully autonomous household robots by 2030. He discusses the architecture of vision-language-action models, the critical role of embodiment in solving the data problem, and how robotics will scale faster than self-driving cars.
Sergey Levine, co-founder of Physical Intelligence and a professor at UC Berkeley, envisions a near future where general-purpose robots are commonplace, estimating a median timeline of 2030 for robots capable of autonomously running a household. The key is not a single breakthrough, but initiating a "self-improvement flywheel": deploying robots that are useful enough in narrow domains to begin collecting vast amounts of real-world experience, which is then used to improve the general model, enabling wider deployment and more data collection.

The Path to General-Purpose Robots: Flywheels and Common Sense

The central goal is to create robotic foundation models—general-purpose systems that can control any robot for any task. The initial challenge is not achieving full autonomy, but reaching a level of competence where the flywheel can start. This could begin with narrow, repetitive tasks and gradually expand in scope as the models improve, much like the evolution of coding assistants from simple autocompletion to generating entire pull requests.

Levine argues that robotics will scale faster than self-driving cars for several key reasons:
Learning from Mistakes: Many manipulation tasks are more forgiving than driving. A robot can make a mistake, like dropping a T-shirt, correct it, and learn from the experience. The consequences of a mistake in autonomous driving are far more severe, making this kind of trial-and-error learning difficult.
The Role of Common Sense: The advent of Large Language Models (LLMs) and Vision-Language Models (VLMs) provides a source of common sense that was absent in the early days of self-driving. A model can now be queried about abstract concepts ("What does a 'slippery floor' sign mean?") to infer potential outcomes without having to experience them firsthand.
Human-in-the-Loop Interaction: The feedback loop for improvement is more natural. A human supervising a robot can provide simple verbal instructions ("pick up the cup"), which serve as valuable training data. This seamless integration of human feedback accelerates on-the-job learning.

How Robotic Foundation Models Work

Physical Intelligence's models are built upon the architecture of pre-trained Vision-Language Models (VLMs). The core idea is to leverage the vast prior knowledge about the world embedded in these models.
Architecture: The model can be conceptualized as an LLM (like Google's open-source Gemma) with a "visual cortex" (a vision encoder) and a "motor cortex" (an action decoder) grafted onto it. It's an end-to-end transformer, often using a mixture-of-experts (MoE) structure.
Action Generation: The model takes in sensory data, performs internal chain-of-thought reasoning to break down a command (e.g., "clean the kitchen"), and then passes the final instruction to the action expert. Because motor control requires high-frequency, continuous outputs, the actions are not discrete tokens but are generated using techniques like diffusion or flow matching for precision.
Emergent Capabilities: This approach leads to compositional generalization and emergent behaviors not explicitly present in the training data. For instance, a robot tasked with folding laundry might encounter a second T-shirt accidentally picked up with the first. The model can reason to pick up the extraneous item and place it back in the bin before continuing its primary task. Another example is the robot turning shorts right-side out before folding them, demonstrating a deeper, compositional understanding of the task.

The Data and Representation Challenge

While scaling data is crucial, Levine emphasizes that it's not just about quantity. The challenge is identifying the right axes of scale to improve specific capabilities like robustness, efficiency, and edge-case handling.

A key problem in AI has been that video models have not been as effec...

Full story
Faster Science, Better Drugs

Erik Torenberg, Patrick Hsu (Arc Institute), and Jorge Conde (a16z) discuss Arc's moonshot to create 'virtual cells' using foundation models to simulate biology. They cover why science is slow, how AI can accelerate drug discovery by predicting cellular perturbations, and the remaining bottlenecks in clinical trials and capital intensity that the biotech industry faces.
The Core Problem: Why Science is Slow

Scientific progress, particularly in biology, is hindered by a "weird Gordian knot" of factors. Unlike AI research that moves at the speed of GPUs, biological research involves moving atoms and is constrained by the real-time processes of growing cells, tissues, and animals. The core issues are multifactorial, stemming from incentives, funding structures, and the training system.

A significant challenge is the increasing need for interdisciplinary collaboration. It is exceptionally difficult for individual research groups or companies to excel at more than two distinct domains simultaneously (e.g., computational biology and genomics). Modern problems require integrating five or more fields, such as neuroscience, immunology, machine learning, chemical biology, and genomics.

The Arc Institute's Approach: Fostering Collaboration

The Arc Institute was founded as an "organizational experiment" to address these challenges. By bringing experts from five distinct domains under one physical roof, the goal is to "increase the collision frequency" and unlock a new space of research problems. This model contrasts with a traditional university setting, where physical distance and misaligned incentives often discourage deep collaboration. In academia, researchers are primarily incentivized to publish their own papers and make their own discoveries, not necessarily to work on larger, collective flagship projects. Arc is designed to enable these larger projects, such as finding new Alzheimer's drug targets and building "virtual cells."

The Moonshot: Simulating Biology with "Virtual Cells"

The central moonshot at Arc is to create "virtual cells" to simulate human biology using foundation models. The ambition is to make these models the default tool for experimentalists, accelerating discovery to the speed of a neural network's forward pass.

However, modeling biology with AI is fundamentally harder than modeling language or images.
Lack of Native Intuition: Humans are native speakers of language and interpreters of images, making it easy to evaluate the output of models like GPT-4 or DALL-E. In contrast, we don't "speak the language of biology" and can only interpret it with a "thick accent." Evaluating a DNA foundation model's output is not intuitive.
The "Lab in the Loop" Bottleneck: The iteration cycle for biological models is slow because predictions must be validated with physical lab experiments. Increasing the speed and dimensionality of this experimental feedback is a critical challenge.
Incomplete Data: We are almost certainly not measuring all the critical components of a cell. While we can scale the measurement of transcriptional information (RNA), this is only a "lower resolution mirror" for what is happening at the protein or metabolic level. The strategy is to bet on what can be scaled today (genomics) and layer in other data modalities over time, trusting that scaling laws will help fill in the gaps.

Defining the Virtual Cell: Perturbation Prediction as the Goal

The most famous success of ML in biology is AlphaFold, which accurately predicts a protein's 3D structure from its amino acid sequence. The goal for virtual cells is to achieve a similar "AlphaFold moment" for cell biology.

At Arc, this is operationalized as perturbation prediction. The model's core task is to predict the necessary interventions to move a cell from one state to another across a manifold of cell states (e.g., from inflamed to quiescent, or from a fibroblast to a stem cell). This directly mirrors the process of drug discovery, which is fundamentally about finding a molecule (a perturbation) that shifts a cell from a disease state to a healthy one.

The objective is to create a practical "co-pilot for a wet lab biologist" that can suggest combinatorial perturbations and facilitate in-silico target identification, ultimately forming the basi...

Full story
Upwork's Radical Bet on Reinforcement Learning: Building RLEF from Scratch | Andrew Rabinovich (CTO)

Andrew Rabinovich, CTO and Head of AI at Upwork, details their strategy for building AI agents for digital work. He introduces a custom reinforcement learning approach called RLEF (Reinforcement Learning from Experience), explains why digital work marketplaces are ideal training grounds, and shares his vision for a future where AI delivers finished projects, orchestrated by a meta-agent named Uma.
Upwork's AI Strategy: From Matchmaking to Work Delivery

At Upwork, the AI strategy is centered around a meta-agent named Uma (Upwork's Mindful AI). Initially, Uma's role is to facilitate the connection between clients and freelancers by understanding a client's project needs and recommending the right talent. This represents a shift from a traditional marketplace to an AI-guided platform. The long-term vision, however, extends far beyond matchmaking to a point where a client describes a project to Uma, and Uma delivers the completed work.

A Hybrid AI Architecture: MoE, RAG, and Knowledge Graphs

Upwork employs a sophisticated, hybrid AI architecture rather than relying on a single monolithic model. The system is designed as a Mixture of Experts (MoE), where Uma possesses various specialized "skills," each fine-tuned for a specific task:
⦁ Creating a detailed job post from a client conversation.
⦁ Identifying and ranking suitable freelancers.
⦁ Assisting freelancers in drafting compelling proposals.
⦁ Helping clients evaluate and select freelancers based on those proposals.

To ground these skills in real-time platform data, Upwork heavily utilizes Retrieval-Augmented Generation (RAG). A crucial component of this is a knowledge graph that serves two purposes:
1. Routing: It directs queries to the appropriate data sources and RAG systems.
2. Inference and Query Expansion: It understands relationships between concepts, allowing for more intelligent search. For example, a search for a "front-end developer" can be automatically expanded to include related skills like "JavaScript" or "React," which are then used to retrieve a richer context for the language model.

A key advantage for tuning this RAG system is Upwork's vast amount of "self-curating" data. A successful contract between a client and a freelancer serves as a strong positive label, validating the effectiveness of the search and matching process. This feedback loop allows for continuous optimization of data sources and retrieval algorithms.

The Thesis: Digital Work as the Ultimate Agent Training Ground

Digital work marketplaces provide a unique and powerful environment for training AI agents, superior in many ways to traditional methods like simulation or game-based self-play.

If you allow agents to learn from an environment that is realistic, relevant relatively to their real world, then the results are really incredible... digital work is the kind of domain where it is almost okay to make mistakes so long that you can learn from them.

Unlike training self-driving cars where mistakes have severe consequences, or game environments like Go where the reward function is clearly defined, digital work offers a real-world setting with low-stakes failure. This allows agents to explore unconventional solutions—the equivalent of AlphaGo's "Move 37"—in creative and business tasks. The challenge in the real world is the absence of a predefined value function. Upwork solves this by leveraging its network of human experts who can evaluate the agents' outputs and provide the necessary reward signals.

RLEF: A New Paradigm Beyond RLHF

To train these agents, Upwork is developing a novel framework called Reinforcement Learning from Experience (RLEF), which diverges significantly from the more common Reinforcement Learning from Human Feedback (RLHF).

RLHF: Typically involves humans ranking a set of machine-generated outputs (e.g., A is better than B). This confines the model's learning to the boundaries of human preference and imagination.
RLEF: Allows an agent to freely explore a vast landscape of possible solutions. A human expert then provides a direct reward signal on the final output, similar to classical reinforcement learning. This encourages the agent to discover solutions that a human might never have conceived.

To overcome the sample inefficiency of RL, Upwork's approach leverages the...

Full story
Dion: The distributed orthonormal update revolution is here

Kwangjun Ahn from Microsoft Research introduces Dion, a next-generation optimizer that improves upon Muon by using amortized power iteration. Dion enables efficient, scalable training for massive models by orthonormalizing a low-rank subspace, reducing compute and communication overhead in distributed settings.
While Adam and its variants have long been the standard for training AI models, the "orthonormal updates revolution," led by optimizers like Muon, has demonstrated the potential for faster convergence, more stable training, and better performance with large batch sizes. This new class of optimizers has already been used in production-level models such as Kimi-K2 and GLM-4.5.

The Principle of Orthonormal Updates

Unlike a standard Stochastic Gradient Descent (SGD) step, which updates weights Xₜ directly along the negative gradient Gₜ (Xₜ = Xₜ₋₁ - ηGₜ), an orthonormal update first decomposes the gradient matrix Gₜ via Singular Value Decomposition (SVD) into UΣVᵀ. The optimizer then uses only the orthonormal component, UVᵀ, as the update direction.

The theoretical justification for this approach is that an orthonormal update transforms any input activation by an equal amount, which in practice leads to significant performance improvements. To avoid the prohibitive cost of a full SVD at every step, the Muon optimizer implements this orthonormalization using an iterative matrix multiplication method known as the Newton-Schulz iteration.

Muon's Scalability Wall

Despite its benefits, Muon encounters a significant bottleneck when scaling to dense models with over 100 billion parameters. The core issue lies in the Newton-Schulz iteration, which requires dense matrix multiplications on the full weight matrices. This requirement clashes directly with common distributed training strategies like weight sharding (e.g., FSDP), where each GPU only holds a partial slice of the weights. To perform the full matrix multiplication, the system must either engage in heavy cross-shard communication to reconstruct the full matrix or perform redundant computations, both of which severely limit scalability.

Dion: Orthonormal Updates for Distributed Training

Dion is a next-generation optimizer designed to provide the benefits of orthonormal updates without the scalability limitations of Muon. The central design question it answers is: "Can we design orthonormal updates without full-matrix materialization?"

To achieve this, Dion replaces Muon's Newton-Schulz iteration with amortized power iteration. This technique is far more compatible with sharded weights, as it does not require the materialization of the full matrix. While preserving the convergence and stability benefits of Muon, Dion introduces several key features that enhance scalability.

Key Features and Innovations

Low-Rank Fraction as a Scalability Lever: Instead of orthonormalizing the entire gradient, Dion operates on a top-r subspace. This "low-rank fraction" acts as a new scalability lever, allowing practitioners to trade off computational and communication costs. A lower rank means a cheaper optimizer step.
Error-Feedback Mechanism: To maintain the quality of the updates at lower ranks, Dion incorporates an error-feedback mechanism, ensuring that performance is not sacrificed for speed.
Efficiency at Scale: Empirical studies show that as model size increases, the performance gap between high-rank and low-rank Dion narrows. This suggests that for very large models, one can "get away with a smaller-rank fraction," making Dion increasingly efficient at scale.
Performance Benchmarks: Microbenchmarks demonstrate that Dion's time-per-step is significantly lower than Muon's, especially for large matrices. For a Llama 3 405-billion parameter dense model configuration, Dion is shown to be "a lot more tractable" and feasible, unlike Muon.
Compatibility and Flexibility: Dion is designed for the realities of modern large-scale training. The open-source implementation includes efficient support for both one-way and two-way weight sharding. The underlying algorithm also allows for greater flexibility, leading to variants like "Lazy-Dion" for further speedups.
...

Full story
From Vibe Coding to Vibe Researching: OpenAI’s Mark Chen and Jakub Pachocki

OpenAI’s Chief Scientist, Jakub Pachocki, and Chief Research Officer, Mark Chen, discuss the research behind GPT-5, the push toward long-horizon reasoning, and the grand vision of an automated researcher. They cover how OpenAI evaluates progress beyond saturated benchmarks, the surprising durability of reinforcement learning, and the culture required to protect fundamental research while shipping world-class products.
GPT-5: Integrating Reasoning into the Mainstream

GPT-5 represents a deliberate effort to merge two distinct lines of model development: the instant-response models of the GPT-2/3/4 series and the more contemplative "O series" models, which were designed to "think" for a longer time to produce the best possible answer. The goal was to eliminate the user's need to choose a mode by having the model intelligently determine the appropriate amount of reasoning for any given prompt. This fusion is a foundational step toward delivering more agentic behavior and making advanced reasoning a default capability.

While the model features improvements across the board, the primary focus was to make this reasoning mode accessible to a broader audience, paving the way for more sophisticated agentic systems.

The Future of Evaluation: From Benchmarks to Economic Impact

Traditional benchmarks and evaluations are becoming saturated, and inching up from 98% to 99% is no longer the most meaningful measure of progress. The research paradigm has shifted. Previously, a single pre-training recipe was applied, and evals served as a general yardstick. Now, with reinforcement learning focused on specific reasoning domains, models can be trained to become experts in a narrow area. This can lead to exceptional performance on some evals but doesn't guarantee broad generalization.

OpenAI acknowledges a "deficit of great evaluations" and is shifting focus to new frontiers:
Real-World Competitions: The most exciting progress has been in mathematics and programming competitions (e.g., atcoder, IMO). These are seen as valid proxies for success in future research, as many top human researchers have backgrounds in these contests.
Automated Discovery: The next generation of milestones and evaluations will focus on the model's ability to make novel discoveries and achieve results that are "economically relevant." The ultimate test is whether the model can generate new, valuable ideas.

The Grand Vision: An Automated Researcher

The central, long-term goal of OpenAI's research is to create an "automated researcher"—a system capable of automating the discovery of new ideas. While this includes the self-referential goal of automating machine learning research, the vision extends to accelerating progress in all other scientific fields.

A key metric for tracking progress toward this goal is the time horizon over which a model can autonomously reason and work on a problem. Current models are approaching mastery on tasks that take one to five hours, such as high school math competitions. The research is now focused on extending this horizon by enhancing the model's ability to plan over longer periods and retain memory. This vision transforms the nature of research from manual execution to a more intuitive process, which Jakub Pachocki calls "vibe researching."

The Surprising Power of Reinforcement Learning (RL)

For years, skeptics have predicted that the performance gains from RL would plateau due to challenges like mode collapse or generalization failures. However, RL has proven to be the "gift that keeps on giving." Its enduring success comes from combining the powerful, versatile learning paradigm of RL with the incredibly rich and nuanced environment provided by large-scale language pre-training.

For a long time, the main challenge for RL was finding the right environment. The breakthrough of language modeling provided a robust, complex world for RL agents to operate in, unlocking a vast number of new and promising research directions. For businesses looking to apply RL, the advice is to remain flexible; methods for tasks like reward modeling will evolve and become simpler, moving closer to "humanlike learning."

The Evolution of Coding with AI

The latest GBT-5 Codecs model aims to translate the raw intelligence of reasoning models into practical utility for messy, real-world coding ...

Full story
Richard Sutton – Father of RL thinks LLMs are a dead end

Richard Sutton, a foundational figure in reinforcement learning, argues that Large Language Models (LLMs) are a flawed paradigm for achieving true intelligence. He posits that LLMs are mimics of human-generated text, lacking genuine goals, world models, and the ability to learn continually from experience. Sutton advocates for a return to the principles of reinforcement learning, where an agent learns from the consequences of its actions in the real world, a method he believes is truly scalable and fundamental to all animal and human intelligence.
Richard Sutton, a key architect of modern reinforcement learning (RL), presents a viewpoint that challenges the current enthusiasm for Large Language Models (LLMs). He argues that the LLM paradigm, focused on mimicking human-generated data, is a deviation from the fundamental principles of intelligence. True intelligence, in his view, is not about imitation but about learning to achieve goals through direct experience with the world.

LLMs vs. The Reinforcement Learning Paradigm

Sutton draws a sharp distinction between the goals of LLMs and RL.
LLMs as Mimics: He characterizes LLMs as systems designed to "mimic people." They learn from a static corpus of what humans have said or written, implicitly suggesting that the right action is to do what a person did in a similar situation.
RL as Experiential Learning: In contrast, RL is about an agent "figuring out what to do" on its own. It is grounded in the concept of learning from experience, where an agent takes actions, observes consequences, and updates its behavior to achieve a goal.

A core point of disagreement is the nature of the "world model" in LLMs. While LLMs can predict what a person might say next with high accuracy, Sutton argues this is not a true world model. A genuine world model predicts the consequences of actions in the world, not just patterns in a text corpus. An agent with a real world model would be "surprised" by an unexpected outcome and adjust its understanding accordingly, a capability he claims LLMs fundamentally lack in their current architecture.

The Necessity of Goals and Ground Truth

For Sutton, a goal is the essence of intelligence. He cites John McCarthy's definition: "intelligence is the computational part of the ability to achieve goals."
⦁ LLMs, he argues, lack a substantive goal related to the external world. Next-token prediction is a goal related to the data, not to influencing or achieving something in the environment.
⦁ In RL, the reward signal provides a clear definition of what is "right"—the action that maximizes future reward. This creates a ground truth. An agent can test its knowledge and actions against this ground truth continually during its interaction with the world.
⦁ LLMs lack this ground truth. Without a goal, there is no objective measure of a "right" or "wrong" response, only what is more or less probable according to the training data. This makes true continual learning impossible, as there is no signal to learn from during deployment.

The Bitter Lesson Revisited

Sutton's influential essay, "The Bitter Lesson," posits that general methods leveraging massive computation ultimately outperform approaches that rely on human knowledge. He sees the current LLM trend as another potential instance of this lesson.

While LLMs leverage massive compute, they are fundamentally dependent on a massive corpus of human knowledge (the internet). Sutton predicts that this reliance on human-generated data is a bottleneck. Systems that can learn directly from experience have access to a much more scalable source of data and will eventually supersede those limited to human text. He is skeptical of the idea of using LLMs as a "prior" for experiential learning, noting that historically, "people get locked into the human knowledge approach, and they get their lunch eaten by the methods that are truly scalable."

How Animals Learn: Experience over Imitation

Sutton firmly rejects the idea that imitation is the primary learning mechanism for humans or animals.
"It's obvious—if you look at animals and how they learn, and you look at psychology and our theories of them—that supervised learning is not part of the way animals learn. We don't have examples of desired behavior... Supervised learning is not something that happens in nature."

He argues that from infancy, humans and animals learn through active trial-and-error—waving their hands, moving their e...

Full story
29.4% ARC-AGI-2 🤯 (TOP SCORE!) - Jeremy Berman

Jeremy Berman, winner of the ARC-AGI v2 public leaderboard, discusses his novel evolutionary approach that refines natural language descriptions instead of code. He explores the idea of building AI that synthesizes new knowledge by constructing deductive "knowledge trees" rather than merely compressing data into "knowledge webs," touching on the fundamental challenges of reasoning, continual learning, and creativity in current models.
Jeremy Berman, a research scientist at Reflection AI, recently won the ARC-AGI v2 public leaderboard with an elegant evolutionary algorithm. A significant shift from his v1 solution, which evolved Python programs, his new architecture evolves natural language descriptions of algorithms. This approach propelled him to the top of the leaderboard with approximately 30% accuracy, highlighting a move towards more expressive and general problem-solving frameworks.

From Python to Natural Language: A More Expressive Approach

Berman's initial success on ARC v1 involved generating and iteratively refining Python programs. He found that even for simple tasks, models struggled on the first attempt, but a revision loop that fed back errors significantly improved performance. Python was chosen for its deterministic nature and the ease of verifying a solution's correctness.

However, ARC v2 introduced more compositional tasks with multiple rules, which proved difficult to express concisely in Python. Berman observed that any ARC v2 task could be described in 5-10 bullet points of plain English. This led to the core innovation of his v2 solution: switching from evolving Python code to evolving natural language descriptions.

"Really what you want is more expressive program. And so that's why I switched from Python to English which is a much more expressive program. You can describe every single ARC v2 task in 10 bullet points of plain English, most of them in five bullet points."

This shift came with a trade-off. While natural language is more expressive and allows the model to leverage its inductive biases more fully, it isn't directly executable. This necessitates a "checker" agent to interpret the natural language instructions and generate the output grid for verification. Interestingly, Berman found that the checker agent needed to be a more powerful model than the instruction-creating agent to work effectively.

Knowledge Trees vs. Knowledge Webs

A central theme of the discussion is the distinction between memorized knowledge and deduced knowledge. Berman posits that pre-training treats all information as a "knowledge web"—a network of connected embeddings without a guaranteed causal structure. This is why models can feel like "stochastic parrots." As he memorably quoted from his paper:

"A parrot that lives in a courthouse will regurgitate more correct statements than a parrot that lives in a mad house."

True intelligence, he argues, is about compression through deduction. It involves building a "knowledge tree" from foundational axioms, where knowledge is causally and logically structured. Reasoning is the process of pruning the knowledge web and replacing it with this deductive tree. Reinforcement learning with verifiable rewards is a key process for this, as it forces the model's internal circuits to align with the deductive, causal structure of a correct solution.

Understanding is the possession of this knowledge tree.
Intelligence is the efficiency with which an agent can build and expand its garden of trees.
Reasoning is the meta-skill of building the tree, which can then be applied to learn all other skills.

Fundamental Challenges: Continual Learning and Creativity

The conversation delves into the core limitations of current AI systems, particularly catastrophic forgetting. When a model is fine-tuned on a new task, it risks losing its existing knowledge and capabilities.

"The ideal system would be we have a set of data. Our language model is bad at a certain thing. We can just give it this data and then all of a sudden it keeps all of its knowledge and then also gets really good at this new thing. We we are not there yet. And that to me is like a fundamental missing part."

Berman suggests that future breakthroughs may involve making models more composable, perhaps by freezing expert layers or modules, creating an architecture that ca...

Full story