Dion: The distributed orthonormal update revolution is here
Kwangjun Ahn from Microsoft Research introduces Dion, a next-generation optimizer that improves upon Muon by using amortized power iteration. Dion enables efficient, scalable training for massive models by orthonormalizing a low-rank subspace, reducing compute and communication overhead in distributed settings.
Kwangjun Ahn from Microsoft Research introduces Dion, a next-generation optimizer that improves upon Muon by using amortized power iteration. Dion enables efficient, scalable training for massive models by orthonormalizing a low-rank subspace, reducing compute and communication overhead in distributed settings.
While Adam and its variants have long been the standard for training AI models, the "orthonormal updates revolution," led by optimizers like Muon, has demonstrated the potential for faster convergence, more stable training, and better performance with large batch sizes. This new class of optimizers has already been used in production-level models such as Kimi-K2 and GLM-4.5.
The Principle of Orthonormal Updates
Unlike a standard Stochastic Gradient Descent (SGD) step, which updates weights
The theoretical justification for this approach is that an orthonormal update transforms any input activation by an equal amount, which in practice leads to significant performance improvements. To avoid the prohibitive cost of a full SVD at every step, the Muon optimizer implements this orthonormalization using an iterative matrix multiplication method known as the Newton-Schulz iteration.
Muon's Scalability Wall
Despite its benefits, Muon encounters a significant bottleneck when scaling to dense models with over 100 billion parameters. The core issue lies in the Newton-Schulz iteration, which requires dense matrix multiplications on the full weight matrices. This requirement clashes directly with common distributed training strategies like weight sharding (e.g., FSDP), where each GPU only holds a partial slice of the weights. To perform the full matrix multiplication, the system must either engage in heavy cross-shard communication to reconstruct the full matrix or perform redundant computations, both of which severely limit scalability.
Dion: Orthonormal Updates for Distributed Training
Dion is a next-generation optimizer designed to provide the benefits of orthonormal updates without the scalability limitations of Muon. The central design question it answers is: "Can we design orthonormal updates without full-matrix materialization?"
To achieve this, Dion replaces Muon's Newton-Schulz iteration with amortized power iteration. This technique is far more compatible with sharded weights, as it does not require the materialization of the full matrix. While preserving the convergence and stability benefits of Muon, Dion introduces several key features that enhance scalability.
Key Features and Innovations
⦁ Low-Rank Fraction as a Scalability Lever: Instead of orthonormalizing the entire gradient, Dion operates on a top-
⦁ Error-Feedback Mechanism: To maintain the quality of the updates at lower ranks, Dion incorporates an error-feedback mechanism, ensuring that performance is not sacrificed for speed.
⦁ Efficiency at Scale: Empirical studies show that as model size increases, the performance gap between high-rank and low-rank Dion narrows. This suggests that for very large models, one can "get away with a smaller-rank fraction," making Dion increasingly efficient at scale.
⦁ Performance Benchmarks: Microbenchmarks demonstrate that Dion's time-per-step is significantly lower than Muon's, especially for large matrices. For a Llama 3 405-billion parameter dense model configuration, Dion is shown to be "a lot more tractable" and feasible, unlike Muon.
⦁ Compatibility and Flexibility: Dion is designed for the realities of modern large-scale training. The open-source implementation includes efficient support for both one-way and two-way weight sharding. The underlying algorithm also allows for greater flexibility, leading to variants like "Lazy-Dion" for further speedups.
...
Full story
The Principle of Orthonormal Updates
Unlike a standard Stochastic Gradient Descent (SGD) step, which updates weights
Xₜ directly along the negative gradient Gₜ (Xₜ = Xₜ₋₁ - ηGₜ), an orthonormal update first decomposes the gradient matrix Gₜ via Singular Value Decomposition (SVD) into UΣVᵀ. The optimizer then uses only the orthonormal component, UVᵀ, as the update direction.The theoretical justification for this approach is that an orthonormal update transforms any input activation by an equal amount, which in practice leads to significant performance improvements. To avoid the prohibitive cost of a full SVD at every step, the Muon optimizer implements this orthonormalization using an iterative matrix multiplication method known as the Newton-Schulz iteration.
Muon's Scalability Wall
Despite its benefits, Muon encounters a significant bottleneck when scaling to dense models with over 100 billion parameters. The core issue lies in the Newton-Schulz iteration, which requires dense matrix multiplications on the full weight matrices. This requirement clashes directly with common distributed training strategies like weight sharding (e.g., FSDP), where each GPU only holds a partial slice of the weights. To perform the full matrix multiplication, the system must either engage in heavy cross-shard communication to reconstruct the full matrix or perform redundant computations, both of which severely limit scalability.
Dion: Orthonormal Updates for Distributed Training
Dion is a next-generation optimizer designed to provide the benefits of orthonormal updates without the scalability limitations of Muon. The central design question it answers is: "Can we design orthonormal updates without full-matrix materialization?"
To achieve this, Dion replaces Muon's Newton-Schulz iteration with amortized power iteration. This technique is far more compatible with sharded weights, as it does not require the materialization of the full matrix. While preserving the convergence and stability benefits of Muon, Dion introduces several key features that enhance scalability.
Key Features and Innovations
⦁ Low-Rank Fraction as a Scalability Lever: Instead of orthonormalizing the entire gradient, Dion operates on a top-
r subspace. This "low-rank fraction" acts as a new scalability lever, allowing practitioners to trade off computational and communication costs. A lower rank means a cheaper optimizer step.⦁ Error-Feedback Mechanism: To maintain the quality of the updates at lower ranks, Dion incorporates an error-feedback mechanism, ensuring that performance is not sacrificed for speed.
⦁ Efficiency at Scale: Empirical studies show that as model size increases, the performance gap between high-rank and low-rank Dion narrows. This suggests that for very large models, one can "get away with a smaller-rank fraction," making Dion increasingly efficient at scale.
⦁ Performance Benchmarks: Microbenchmarks demonstrate that Dion's time-per-step is significantly lower than Muon's, especially for large matrices. For a Llama 3 405-billion parameter dense model configuration, Dion is shown to be "a lot more tractable" and feasible, unlike Muon.
⦁ Compatibility and Flexibility: Dion is designed for the realities of modern large-scale training. The open-source implementation includes efficient support for both one-way and two-way weight sharding. The underlying algorithm also allows for greater flexibility, leading to variants like "Lazy-Dion" for further speedups.
...
Full story
tokenless.tech
Dion: The distributed orthonormal update revolution is here | Tokenless
Kwangjun Ahn from Microsoft Research introduces Dion, a next-generation optimizer that improves upon Muon by using amortized power iteration. Dion enables efficient, scalable training for massive models by orthonormalizing a low-rank subspace, reducing compute…
From Vibe Coding to Vibe Researching: OpenAI’s Mark Chen and Jakub Pachocki
OpenAI’s Chief Scientist, Jakub Pachocki, and Chief Research Officer, Mark Chen, discuss the research behind GPT-5, the push toward long-horizon reasoning, and the grand vision of an automated researcher. They cover how OpenAI evaluates progress beyond saturated benchmarks, the surprising durability of reinforcement learning, and the culture required to protect fundamental research while shipping world-class products.
OpenAI’s Chief Scientist, Jakub Pachocki, and Chief Research Officer, Mark Chen, discuss the research behind GPT-5, the push toward long-horizon reasoning, and the grand vision of an automated researcher. They cover how OpenAI evaluates progress beyond saturated benchmarks, the surprising durability of reinforcement learning, and the culture required to protect fundamental research while shipping world-class products.
GPT-5: Integrating Reasoning into the Mainstream
GPT-5 represents a deliberate effort to merge two distinct lines of model development: the instant-response models of the GPT-2/3/4 series and the more contemplative "O series" models, which were designed to "think" for a longer time to produce the best possible answer. The goal was to eliminate the user's need to choose a mode by having the model intelligently determine the appropriate amount of reasoning for any given prompt. This fusion is a foundational step toward delivering more agentic behavior and making advanced reasoning a default capability.
While the model features improvements across the board, the primary focus was to make this reasoning mode accessible to a broader audience, paving the way for more sophisticated agentic systems.
The Future of Evaluation: From Benchmarks to Economic Impact
Traditional benchmarks and evaluations are becoming saturated, and inching up from 98% to 99% is no longer the most meaningful measure of progress. The research paradigm has shifted. Previously, a single pre-training recipe was applied, and evals served as a general yardstick. Now, with reinforcement learning focused on specific reasoning domains, models can be trained to become experts in a narrow area. This can lead to exceptional performance on some evals but doesn't guarantee broad generalization.
OpenAI acknowledges a "deficit of great evaluations" and is shifting focus to new frontiers:
⦁ Real-World Competitions: The most exciting progress has been in mathematics and programming competitions (e.g., atcoder, IMO). These are seen as valid proxies for success in future research, as many top human researchers have backgrounds in these contests.
⦁ Automated Discovery: The next generation of milestones and evaluations will focus on the model's ability to make novel discoveries and achieve results that are "economically relevant." The ultimate test is whether the model can generate new, valuable ideas.
The Grand Vision: An Automated Researcher
The central, long-term goal of OpenAI's research is to create an "automated researcher"—a system capable of automating the discovery of new ideas. While this includes the self-referential goal of automating machine learning research, the vision extends to accelerating progress in all other scientific fields.
A key metric for tracking progress toward this goal is the time horizon over which a model can autonomously reason and work on a problem. Current models are approaching mastery on tasks that take one to five hours, such as high school math competitions. The research is now focused on extending this horizon by enhancing the model's ability to plan over longer periods and retain memory. This vision transforms the nature of research from manual execution to a more intuitive process, which Jakub Pachocki calls "vibe researching."
The Surprising Power of Reinforcement Learning (RL)
For years, skeptics have predicted that the performance gains from RL would plateau due to challenges like mode collapse or generalization failures. However, RL has proven to be the "gift that keeps on giving." Its enduring success comes from combining the powerful, versatile learning paradigm of RL with the incredibly rich and nuanced environment provided by large-scale language pre-training.
For a long time, the main challenge for RL was finding the right environment. The breakthrough of language modeling provided a robust, complex world for RL agents to operate in, unlocking a vast number of new and promising research directions. For businesses looking to apply RL, the advice is to remain flexible; methods for tasks like reward modeling will evolve and become simpler, moving closer to "humanlike learning."
The Evolution of Coding with AI
The latest GBT-5 Codecs model aims to translate the raw intelligence of reasoning models into practical utility for messy, real-world coding ...
Full story
GPT-5 represents a deliberate effort to merge two distinct lines of model development: the instant-response models of the GPT-2/3/4 series and the more contemplative "O series" models, which were designed to "think" for a longer time to produce the best possible answer. The goal was to eliminate the user's need to choose a mode by having the model intelligently determine the appropriate amount of reasoning for any given prompt. This fusion is a foundational step toward delivering more agentic behavior and making advanced reasoning a default capability.
While the model features improvements across the board, the primary focus was to make this reasoning mode accessible to a broader audience, paving the way for more sophisticated agentic systems.
The Future of Evaluation: From Benchmarks to Economic Impact
Traditional benchmarks and evaluations are becoming saturated, and inching up from 98% to 99% is no longer the most meaningful measure of progress. The research paradigm has shifted. Previously, a single pre-training recipe was applied, and evals served as a general yardstick. Now, with reinforcement learning focused on specific reasoning domains, models can be trained to become experts in a narrow area. This can lead to exceptional performance on some evals but doesn't guarantee broad generalization.
OpenAI acknowledges a "deficit of great evaluations" and is shifting focus to new frontiers:
⦁ Real-World Competitions: The most exciting progress has been in mathematics and programming competitions (e.g., atcoder, IMO). These are seen as valid proxies for success in future research, as many top human researchers have backgrounds in these contests.
⦁ Automated Discovery: The next generation of milestones and evaluations will focus on the model's ability to make novel discoveries and achieve results that are "economically relevant." The ultimate test is whether the model can generate new, valuable ideas.
The Grand Vision: An Automated Researcher
The central, long-term goal of OpenAI's research is to create an "automated researcher"—a system capable of automating the discovery of new ideas. While this includes the self-referential goal of automating machine learning research, the vision extends to accelerating progress in all other scientific fields.
A key metric for tracking progress toward this goal is the time horizon over which a model can autonomously reason and work on a problem. Current models are approaching mastery on tasks that take one to five hours, such as high school math competitions. The research is now focused on extending this horizon by enhancing the model's ability to plan over longer periods and retain memory. This vision transforms the nature of research from manual execution to a more intuitive process, which Jakub Pachocki calls "vibe researching."
The Surprising Power of Reinforcement Learning (RL)
For years, skeptics have predicted that the performance gains from RL would plateau due to challenges like mode collapse or generalization failures. However, RL has proven to be the "gift that keeps on giving." Its enduring success comes from combining the powerful, versatile learning paradigm of RL with the incredibly rich and nuanced environment provided by large-scale language pre-training.
For a long time, the main challenge for RL was finding the right environment. The breakthrough of language modeling provided a robust, complex world for RL agents to operate in, unlocking a vast number of new and promising research directions. For businesses looking to apply RL, the advice is to remain flexible; methods for tasks like reward modeling will evolve and become simpler, moving closer to "humanlike learning."
The Evolution of Coding with AI
The latest GBT-5 Codecs model aims to translate the raw intelligence of reasoning models into practical utility for messy, real-world coding ...
Full story
tokenless.tech
From Vibe Coding to Vibe Researching: OpenAI’s Mark Chen and Jakub Pachocki | Tokenless
OpenAI’s Chief Scientist, Jakub Pachocki, and Chief Research Officer, Mark Chen, discuss the research behind GPT-5, the push toward long-horizon reasoning, and the grand vision of an automated researcher. They cover how OpenAI evaluates progress beyond saturated…
Richard Sutton – Father of RL thinks LLMs are a dead end
Richard Sutton, a foundational figure in reinforcement learning, argues that Large Language Models (LLMs) are a flawed paradigm for achieving true intelligence. He posits that LLMs are mimics of human-generated text, lacking genuine goals, world models, and the ability to learn continually from experience. Sutton advocates for a return to the principles of reinforcement learning, where an agent learns from the consequences of its actions in the real world, a method he believes is truly scalable and fundamental to all animal and human intelligence.
Richard Sutton, a foundational figure in reinforcement learning, argues that Large Language Models (LLMs) are a flawed paradigm for achieving true intelligence. He posits that LLMs are mimics of human-generated text, lacking genuine goals, world models, and the ability to learn continually from experience. Sutton advocates for a return to the principles of reinforcement learning, where an agent learns from the consequences of its actions in the real world, a method he believes is truly scalable and fundamental to all animal and human intelligence.
Richard Sutton, a key architect of modern reinforcement learning (RL), presents a viewpoint that challenges the current enthusiasm for Large Language Models (LLMs). He argues that the LLM paradigm, focused on mimicking human-generated data, is a deviation from the fundamental principles of intelligence. True intelligence, in his view, is not about imitation but about learning to achieve goals through direct experience with the world.
LLMs vs. The Reinforcement Learning Paradigm
Sutton draws a sharp distinction between the goals of LLMs and RL.
⦁ LLMs as Mimics: He characterizes LLMs as systems designed to "mimic people." They learn from a static corpus of what humans have said or written, implicitly suggesting that the right action is to do what a person did in a similar situation.
⦁ RL as Experiential Learning: In contrast, RL is about an agent "figuring out what to do" on its own. It is grounded in the concept of learning from experience, where an agent takes actions, observes consequences, and updates its behavior to achieve a goal.
A core point of disagreement is the nature of the "world model" in LLMs. While LLMs can predict what a person might say next with high accuracy, Sutton argues this is not a true world model. A genuine world model predicts the consequences of actions in the world, not just patterns in a text corpus. An agent with a real world model would be "surprised" by an unexpected outcome and adjust its understanding accordingly, a capability he claims LLMs fundamentally lack in their current architecture.
The Necessity of Goals and Ground Truth
For Sutton, a goal is the essence of intelligence. He cites John McCarthy's definition: "intelligence is the computational part of the ability to achieve goals."
⦁ LLMs, he argues, lack a substantive goal related to the external world. Next-token prediction is a goal related to the data, not to influencing or achieving something in the environment.
⦁ In RL, the reward signal provides a clear definition of what is "right"—the action that maximizes future reward. This creates a ground truth. An agent can test its knowledge and actions against this ground truth continually during its interaction with the world.
⦁ LLMs lack this ground truth. Without a goal, there is no objective measure of a "right" or "wrong" response, only what is more or less probable according to the training data. This makes true continual learning impossible, as there is no signal to learn from during deployment.
The Bitter Lesson Revisited
Sutton's influential essay, "The Bitter Lesson," posits that general methods leveraging massive computation ultimately outperform approaches that rely on human knowledge. He sees the current LLM trend as another potential instance of this lesson.
While LLMs leverage massive compute, they are fundamentally dependent on a massive corpus of human knowledge (the internet). Sutton predicts that this reliance on human-generated data is a bottleneck. Systems that can learn directly from experience have access to a much more scalable source of data and will eventually supersede those limited to human text. He is skeptical of the idea of using LLMs as a "prior" for experiential learning, noting that historically, "people get locked into the human knowledge approach, and they get their lunch eaten by the methods that are truly scalable."
How Animals Learn: Experience over Imitation
Sutton firmly rejects the idea that imitation is the primary learning mechanism for humans or animals.
He argues that from infancy, humans and animals learn through active trial-and-error—waving their hands, moving their e...
Full story
LLMs vs. The Reinforcement Learning Paradigm
Sutton draws a sharp distinction between the goals of LLMs and RL.
⦁ LLMs as Mimics: He characterizes LLMs as systems designed to "mimic people." They learn from a static corpus of what humans have said or written, implicitly suggesting that the right action is to do what a person did in a similar situation.
⦁ RL as Experiential Learning: In contrast, RL is about an agent "figuring out what to do" on its own. It is grounded in the concept of learning from experience, where an agent takes actions, observes consequences, and updates its behavior to achieve a goal.
A core point of disagreement is the nature of the "world model" in LLMs. While LLMs can predict what a person might say next with high accuracy, Sutton argues this is not a true world model. A genuine world model predicts the consequences of actions in the world, not just patterns in a text corpus. An agent with a real world model would be "surprised" by an unexpected outcome and adjust its understanding accordingly, a capability he claims LLMs fundamentally lack in their current architecture.
The Necessity of Goals and Ground Truth
For Sutton, a goal is the essence of intelligence. He cites John McCarthy's definition: "intelligence is the computational part of the ability to achieve goals."
⦁ LLMs, he argues, lack a substantive goal related to the external world. Next-token prediction is a goal related to the data, not to influencing or achieving something in the environment.
⦁ In RL, the reward signal provides a clear definition of what is "right"—the action that maximizes future reward. This creates a ground truth. An agent can test its knowledge and actions against this ground truth continually during its interaction with the world.
⦁ LLMs lack this ground truth. Without a goal, there is no objective measure of a "right" or "wrong" response, only what is more or less probable according to the training data. This makes true continual learning impossible, as there is no signal to learn from during deployment.
The Bitter Lesson Revisited
Sutton's influential essay, "The Bitter Lesson," posits that general methods leveraging massive computation ultimately outperform approaches that rely on human knowledge. He sees the current LLM trend as another potential instance of this lesson.
While LLMs leverage massive compute, they are fundamentally dependent on a massive corpus of human knowledge (the internet). Sutton predicts that this reliance on human-generated data is a bottleneck. Systems that can learn directly from experience have access to a much more scalable source of data and will eventually supersede those limited to human text. He is skeptical of the idea of using LLMs as a "prior" for experiential learning, noting that historically, "people get locked into the human knowledge approach, and they get their lunch eaten by the methods that are truly scalable."
How Animals Learn: Experience over Imitation
Sutton firmly rejects the idea that imitation is the primary learning mechanism for humans or animals.
"It's obvious—if you look at animals and how they learn, and you look at psychology and our theories of them—that supervised learning is not part of the way animals learn. We don't have examples of desired behavior... Supervised learning is not something that happens in nature."
He argues that from infancy, humans and animals learn through active trial-and-error—waving their hands, moving their e...
Full story
tokenless.tech
Richard Sutton – Father of RL thinks LLMs are a dead end | Tokenless
Richard Sutton, a foundational figure in reinforcement learning, argues that Large Language Models (LLMs) are a flawed paradigm for achieving true intelligence. He posits that LLMs are mimics of human-generated text, lacking genuine goals, world models, and…
29.4% ARC-AGI-2 🤯 (TOP SCORE!) - Jeremy Berman
Jeremy Berman, winner of the ARC-AGI v2 public leaderboard, discusses his novel evolutionary approach that refines natural language descriptions instead of code. He explores the idea of building AI that synthesizes new knowledge by constructing deductive "knowledge trees" rather than merely compressing data into "knowledge webs," touching on the fundamental challenges of reasoning, continual learning, and creativity in current models.
Jeremy Berman, winner of the ARC-AGI v2 public leaderboard, discusses his novel evolutionary approach that refines natural language descriptions instead of code. He explores the idea of building AI that synthesizes new knowledge by constructing deductive "knowledge trees" rather than merely compressing data into "knowledge webs," touching on the fundamental challenges of reasoning, continual learning, and creativity in current models.
Jeremy Berman, a research scientist at Reflection AI, recently won the ARC-AGI v2 public leaderboard with an elegant evolutionary algorithm. A significant shift from his v1 solution, which evolved Python programs, his new architecture evolves natural language descriptions of algorithms. This approach propelled him to the top of the leaderboard with approximately 30% accuracy, highlighting a move towards more expressive and general problem-solving frameworks.
From Python to Natural Language: A More Expressive Approach
Berman's initial success on ARC v1 involved generating and iteratively refining Python programs. He found that even for simple tasks, models struggled on the first attempt, but a revision loop that fed back errors significantly improved performance. Python was chosen for its deterministic nature and the ease of verifying a solution's correctness.
However, ARC v2 introduced more compositional tasks with multiple rules, which proved difficult to express concisely in Python. Berman observed that any ARC v2 task could be described in 5-10 bullet points of plain English. This led to the core innovation of his v2 solution: switching from evolving Python code to evolving natural language descriptions.
This shift came with a trade-off. While natural language is more expressive and allows the model to leverage its inductive biases more fully, it isn't directly executable. This necessitates a "checker" agent to interpret the natural language instructions and generate the output grid for verification. Interestingly, Berman found that the checker agent needed to be a more powerful model than the instruction-creating agent to work effectively.
Knowledge Trees vs. Knowledge Webs
A central theme of the discussion is the distinction between memorized knowledge and deduced knowledge. Berman posits that pre-training treats all information as a "knowledge web"—a network of connected embeddings without a guaranteed causal structure. This is why models can feel like "stochastic parrots." As he memorably quoted from his paper:
True intelligence, he argues, is about compression through deduction. It involves building a "knowledge tree" from foundational axioms, where knowledge is causally and logically structured. Reasoning is the process of pruning the knowledge web and replacing it with this deductive tree. Reinforcement learning with verifiable rewards is a key process for this, as it forces the model's internal circuits to align with the deductive, causal structure of a correct solution.
⦁ Understanding is the possession of this knowledge tree.
⦁ Intelligence is the efficiency with which an agent can build and expand its garden of trees.
⦁ Reasoning is the meta-skill of building the tree, which can then be applied to learn all other skills.
Fundamental Challenges: Continual Learning and Creativity
The conversation delves into the core limitations of current AI systems, particularly catastrophic forgetting. When a model is fine-tuned on a new task, it risks losing its existing knowledge and capabilities.
Berman suggests that future breakthroughs may involve making models more composable, perhaps by freezing expert layers or modules, creating an architecture that ca...
Full story
From Python to Natural Language: A More Expressive Approach
Berman's initial success on ARC v1 involved generating and iteratively refining Python programs. He found that even for simple tasks, models struggled on the first attempt, but a revision loop that fed back errors significantly improved performance. Python was chosen for its deterministic nature and the ease of verifying a solution's correctness.
However, ARC v2 introduced more compositional tasks with multiple rules, which proved difficult to express concisely in Python. Berman observed that any ARC v2 task could be described in 5-10 bullet points of plain English. This led to the core innovation of his v2 solution: switching from evolving Python code to evolving natural language descriptions.
"Really what you want is more expressive program. And so that's why I switched from Python to English which is a much more expressive program. You can describe every single ARC v2 task in 10 bullet points of plain English, most of them in five bullet points."
This shift came with a trade-off. While natural language is more expressive and allows the model to leverage its inductive biases more fully, it isn't directly executable. This necessitates a "checker" agent to interpret the natural language instructions and generate the output grid for verification. Interestingly, Berman found that the checker agent needed to be a more powerful model than the instruction-creating agent to work effectively.
Knowledge Trees vs. Knowledge Webs
A central theme of the discussion is the distinction between memorized knowledge and deduced knowledge. Berman posits that pre-training treats all information as a "knowledge web"—a network of connected embeddings without a guaranteed causal structure. This is why models can feel like "stochastic parrots." As he memorably quoted from his paper:
"A parrot that lives in a courthouse will regurgitate more correct statements than a parrot that lives in a mad house."
True intelligence, he argues, is about compression through deduction. It involves building a "knowledge tree" from foundational axioms, where knowledge is causally and logically structured. Reasoning is the process of pruning the knowledge web and replacing it with this deductive tree. Reinforcement learning with verifiable rewards is a key process for this, as it forces the model's internal circuits to align with the deductive, causal structure of a correct solution.
⦁ Understanding is the possession of this knowledge tree.
⦁ Intelligence is the efficiency with which an agent can build and expand its garden of trees.
⦁ Reasoning is the meta-skill of building the tree, which can then be applied to learn all other skills.
Fundamental Challenges: Continual Learning and Creativity
The conversation delves into the core limitations of current AI systems, particularly catastrophic forgetting. When a model is fine-tuned on a new task, it risks losing its existing knowledge and capabilities.
"The ideal system would be we have a set of data. Our language model is bad at a certain thing. We can just give it this data and then all of a sudden it keeps all of its knowledge and then also gets really good at this new thing. We we are not there yet. And that to me is like a fundamental missing part."
Berman suggests that future breakthroughs may involve making models more composable, perhaps by freezing expert layers or modules, creating an architecture that ca...
Full story
tokenless.tech
29.4% ARC-AGI-2 🤯 (TOP SCORE!) - Jeremy Berman | Tokenless
Jeremy Berman, winner of the ARC-AGI v2 public leaderboard, discusses his novel evolutionary approach that refines natural language descriptions instead of code. He explores the idea of building AI that synthesizes new knowledge by constructing deductive…
How To Train An LLM with Anthropic's Head of Pretraining
Anthropic's Head of Pre-training, Nick Joseph, details the immense engineering and infrastructure challenges behind training frontier models like Claude. He covers the evolution from early-stage custom frameworks to debugging hardware at massive scale, balancing pre-training with RL, and the strategic importance of data quality and team composition.
Anthropic's Head of Pre-training, Nick Joseph, details the immense engineering and infrastructure challenges behind training frontier models like Claude. He covers the evolution from early-stage custom frameworks to debugging hardware at massive scale, balancing pre-training with RL, and the strategic importance of data quality and team composition.
The Core Thesis: Scaling Laws and the Compute Feedback Loop
The central thesis of pre-training has been consistent: scaling compute, data, and model parameters predictably yields more capable models. This principle is quantified by "scaling laws," which show that as you increase compute, the model's loss (a measure of its error in predicting the next word) decreases in a predictable power-law fashion.
This predictability created a powerful positive feedback loop that has driven progress over the last five years:
1. Train a large model using available compute.
2. Use the model to create a useful product that generates revenue.
3. Use the revenue to buy more compute.
4. Train an even better, larger model.
This cycle relies on a simple, scalable objective. Next-word prediction on the vast, unlabeled dataset of the internet proved to be the most effective. Unlike other objectives like masked language modeling (used by models like BERT), autoregressive next-word prediction has a significant advantage: it naturally enables generative product use cases through simple sampling, fitting perfectly into the feedback loop.
The Engineering Reality of Training at the Frontier
While the concept of scaling is simple, the implementation is an immense engineering challenge. The hardest problems in AI are often infrastructure problems, not ML problems.
Early Infrastructure and Efficiency
In the early days of Anthropic, the team felt they were among a small group of ~30 people in the world working at the frontier of large-scale training. To compete with less funding, they focused intensely on efficiency.
⦁ Custom Frameworks: Off-the-shelf open-source packages like PyTorch's distributed libraries were insufficient for the scale they were targeting. The team had to build their own distributed frameworks from the ground up, implementing techniques like data parallelism, pipelining, and tensor sharding (upsharding) themselves. This gave them the control needed to modify and optimize every component.
⦁ Hardware-Level Understanding: Using a cloud provider doesn't abstract away the physical hardware. The team had to understand the literal layout of GPUs in the data center, at one point running clustering algorithms on network latency data to reverse-engineer which chips were in which rooms to debug performance bottlenecks.
⦁ A Scientific Approach to Optimization: The process for improving efficiency was methodical:
1. Model the System: On paper, calculate the theoretical maximum efficiency (MFU/flops utilization) by modeling the six or so key constraints, such as HBM bandwidth, CPU offloading, and network interconnects.
2. Implement: Write the code to execute the parallelization strategy.
3. Profile and Debug: Use profilers to measure the performance of every single operation. The goal is to match the actual performance to the theoretical model, identifying and fixing any discrepancies. This often required hacking existing single-GPU profilers to trace and combine data from thousands of GPUs simultaneously.
The Challenge of Cursed Bugs and Unreliable Hardware
A surprising and frustrating challenge at scale is that the hardware itself can be the source of bugs. The conventional programmer's wisdom, "it's your code, not the computer," breaks down.
Teams must debug the entire stack, from the high-level Python code down to the physical hardware. A single faulty GPU, a misconfigured power supply, or a subtle networking issue can corrupt a training run. These "cursed bugs" can take months to solve, potentially derailing an entire model generation. This necessitates a rare engineering skill set: the ability to deep-dive any problem a...
Full story
The central thesis of pre-training has been consistent: scaling compute, data, and model parameters predictably yields more capable models. This principle is quantified by "scaling laws," which show that as you increase compute, the model's loss (a measure of its error in predicting the next word) decreases in a predictable power-law fashion.
This predictability created a powerful positive feedback loop that has driven progress over the last five years:
1. Train a large model using available compute.
2. Use the model to create a useful product that generates revenue.
3. Use the revenue to buy more compute.
4. Train an even better, larger model.
This cycle relies on a simple, scalable objective. Next-word prediction on the vast, unlabeled dataset of the internet proved to be the most effective. Unlike other objectives like masked language modeling (used by models like BERT), autoregressive next-word prediction has a significant advantage: it naturally enables generative product use cases through simple sampling, fitting perfectly into the feedback loop.
The Engineering Reality of Training at the Frontier
While the concept of scaling is simple, the implementation is an immense engineering challenge. The hardest problems in AI are often infrastructure problems, not ML problems.
Early Infrastructure and Efficiency
In the early days of Anthropic, the team felt they were among a small group of ~30 people in the world working at the frontier of large-scale training. To compete with less funding, they focused intensely on efficiency.
⦁ Custom Frameworks: Off-the-shelf open-source packages like PyTorch's distributed libraries were insufficient for the scale they were targeting. The team had to build their own distributed frameworks from the ground up, implementing techniques like data parallelism, pipelining, and tensor sharding (upsharding) themselves. This gave them the control needed to modify and optimize every component.
⦁ Hardware-Level Understanding: Using a cloud provider doesn't abstract away the physical hardware. The team had to understand the literal layout of GPUs in the data center, at one point running clustering algorithms on network latency data to reverse-engineer which chips were in which rooms to debug performance bottlenecks.
⦁ A Scientific Approach to Optimization: The process for improving efficiency was methodical:
1. Model the System: On paper, calculate the theoretical maximum efficiency (MFU/flops utilization) by modeling the six or so key constraints, such as HBM bandwidth, CPU offloading, and network interconnects.
2. Implement: Write the code to execute the parallelization strategy.
3. Profile and Debug: Use profilers to measure the performance of every single operation. The goal is to match the actual performance to the theoretical model, identifying and fixing any discrepancies. This often required hacking existing single-GPU profilers to trace and combine data from thousands of GPUs simultaneously.
The Challenge of Cursed Bugs and Unreliable Hardware
A surprising and frustrating challenge at scale is that the hardware itself can be the source of bugs. The conventional programmer's wisdom, "it's your code, not the computer," breaks down.
My manager looked at it and was like, "uh yeah, probably the computer's wrong." And I was like, that seems unlikely. And sure enough, the computer was wrong. Turned out that like the GPU was broken.
Teams must debug the entire stack, from the high-level Python code down to the physical hardware. A single faulty GPU, a misconfigured power supply, or a subtle networking issue can corrupt a training run. These "cursed bugs" can take months to solve, potentially derailing an entire model generation. This necessitates a rare engineering skill set: the ability to deep-dive any problem a...
Full story
tokenless.tech
How To Train An LLM with Anthropic's Head of Pretraining | Tokenless
Anthropic's Head of Pre-training, Nick Joseph, details the immense engineering and infrastructure challenges behind training frontier models like Claude. He covers the evolution from early-stage custom frameworks to debugging hardware at massive scale, balancing…
Human Neurons are 1M x Energy Efficient than Digital AI Processors | Dr. Ewelina Kurtys | FinalSpark
Dr. Ewelina Kurtys of FinalSpark explains their pioneering work in building biocomputers from living human neurons, which are up to one million times more energy-efficient than traditional silicon chips. The conversation covers the technology of reprogramming skin cells into neurons, the company's growth strategy, and the profound ethical and philosophical questions, such as potential 'Matrix' scenarios, that arise from merging biology with AI.
Dr. Ewelina Kurtys of FinalSpark explains their pioneering work in building biocomputers from living human neurons, which are up to one million times more energy-efficient than traditional silicon chips. The conversation covers the technology of reprogramming skin cells into neurons, the company's growth strategy, and the profound ethical and philosophical questions, such as potential 'Matrix' scenarios, that arise from merging biology with AI.
The Mission: Solving AI's Energy Crisis with Biocomputing
The primary driver behind FinalSpark's research is the massive and exponentially increasing energy consumption of modern digital AI models. To build a better model today, one simply has to spend more money on energy. FinalSpark proposes a revolutionary alternative: using living human neurons as processors. These biological processors are approximately one million times more energy-efficient than their digital counterparts, presenting a potential future for AI that is sustainable and powerful. The long-term vision is to develop these biocomputers over the next 10 years, creating a new hardware paradigm for artificial intelligence.
The Technology Behind Neuron-Based Processors
FinalSpark's approach combines neuroscience, biology, and engineering to create a new type of computer.
⦁ Neuron Sourcing: Human neurons are sourced ethically and efficiently by reprogramming human skin cells into stem cells, which can then be differentiated into any cell type, including neurons. This method allows for the generation of a large supply of neurons.
⦁ Architecture: The core of the system involves arranging neurons in 3D structures, each containing about 10,000 neurons. These structures are placed on electrodes, which allows for sending electrical input signals and receiving the neurons' electrical responses (output). The fundamental challenge is to understand and program the relationship between these inputs and outputs.
⦁ Learning and Programming: The team is experimenting with both electrical and chemical signals, such as the neurotransmitter dopamine, to influence neuron behavior and facilitate learning. The next major milestone, projected for 2-3 years post-investment, is achieving "learning in vitro." This involves teaching the neurons simple tasks, like image recognition, by rewiring the connections between them, mirroring the learning process in the human brain.
⦁ The Butterfly Demo: A current practical demonstration of the technology is a web application where users can control a digital butterfly. The user's input via a mouse sends signals to the neurons in the lab. If the neurons' collective electrical activity surpasses a certain threshold, the butterfly moves, demonstrating a basic level of control and processing by the living neurons.
Commercial Strategy and Growth
Founded in 2014, FinalSpark has been primarily self-funded by its founders. The company is now seeking $50 million in investment to scale its research and development team, which is crucial for solving the complex technical challenges of biocomputing.
The commercial strategy for this deep-tech venture differs from typical SaaS companies. The focus is less on immediate marketing and more on fundamental R&D. The belief is that once the technology works and can offer computational power that is 10 or 100 times cheaper, the value proposition will be so compelling that it won't require extensive marketing.
Currently, FinalSpark offers remote access to its laboratory as a tool for scientists and clients worldwide. Subscribers can conduct fundamental research on signal processing in neurons, getting familiar with the hardware of the future.
The Future of AI: A Biological Revolution
Dr. Kurtys posits that biocomputing isn't just an incremental improvement but a "total revolution" for AI. While digital AI sees small, gradual changes, using living neurons represents a complete shift in hardware. This technology is especially promising for applications where processing speed is not the most critical factor, such as generative AI.
It is not expected that biocomputing will completely replace digital silicon chips. Instead, the future will likely feature a greater variety of specialized chips and technologies tailored to different use cases, with bioprocessors occupying a significant role.
Ethical Considerations and Philos...
Full story
The primary driver behind FinalSpark's research is the massive and exponentially increasing energy consumption of modern digital AI models. To build a better model today, one simply has to spend more money on energy. FinalSpark proposes a revolutionary alternative: using living human neurons as processors. These biological processors are approximately one million times more energy-efficient than their digital counterparts, presenting a potential future for AI that is sustainable and powerful. The long-term vision is to develop these biocomputers over the next 10 years, creating a new hardware paradigm for artificial intelligence.
The Technology Behind Neuron-Based Processors
FinalSpark's approach combines neuroscience, biology, and engineering to create a new type of computer.
⦁ Neuron Sourcing: Human neurons are sourced ethically and efficiently by reprogramming human skin cells into stem cells, which can then be differentiated into any cell type, including neurons. This method allows for the generation of a large supply of neurons.
⦁ Architecture: The core of the system involves arranging neurons in 3D structures, each containing about 10,000 neurons. These structures are placed on electrodes, which allows for sending electrical input signals and receiving the neurons' electrical responses (output). The fundamental challenge is to understand and program the relationship between these inputs and outputs.
⦁ Learning and Programming: The team is experimenting with both electrical and chemical signals, such as the neurotransmitter dopamine, to influence neuron behavior and facilitate learning. The next major milestone, projected for 2-3 years post-investment, is achieving "learning in vitro." This involves teaching the neurons simple tasks, like image recognition, by rewiring the connections between them, mirroring the learning process in the human brain.
⦁ The Butterfly Demo: A current practical demonstration of the technology is a web application where users can control a digital butterfly. The user's input via a mouse sends signals to the neurons in the lab. If the neurons' collective electrical activity surpasses a certain threshold, the butterfly moves, demonstrating a basic level of control and processing by the living neurons.
Commercial Strategy and Growth
Founded in 2014, FinalSpark has been primarily self-funded by its founders. The company is now seeking $50 million in investment to scale its research and development team, which is crucial for solving the complex technical challenges of biocomputing.
The commercial strategy for this deep-tech venture differs from typical SaaS companies. The focus is less on immediate marketing and more on fundamental R&D. The belief is that once the technology works and can offer computational power that is 10 or 100 times cheaper, the value proposition will be so compelling that it won't require extensive marketing.
Currently, FinalSpark offers remote access to its laboratory as a tool for scientists and clients worldwide. Subscribers can conduct fundamental research on signal processing in neurons, getting familiar with the hardware of the future.
The Future of AI: A Biological Revolution
Dr. Kurtys posits that biocomputing isn't just an incremental improvement but a "total revolution" for AI. While digital AI sees small, gradual changes, using living neurons represents a complete shift in hardware. This technology is especially promising for applications where processing speed is not the most critical factor, such as generative AI.
It is not expected that biocomputing will completely replace digital silicon chips. Instead, the future will likely feature a greater variety of specialized chips and technologies tailored to different use cases, with bioprocessors occupying a significant role.
Ethical Considerations and Philos...
Full story
tokenless.tech
Human Neurons are 1M x Energy Efficient than Digital AI Processors | Dr. Ewelina Kurtys | FinalSpark | Tokenless
Dr. Ewelina Kurtys of FinalSpark explains their pioneering work in building biocomputers from living human neurons, which are up to one million times more energy-efficient than traditional silicon chips. The conversation covers the technology of reprogramming…
929: Dragon Hatchling: The Missing Link Between Transformers and the Brain — with Adrian Kosowski
Adrian Kosowski from Pathway introduces the Baby Dragon Hatchling (BDH), a groundbreaking, post-transformer architecture inspired by neuroscience. BDH leverages sparse, positive activation to mimic brain function, offering a path to limitless context, superior reasoning, and unprecedented computational efficiency, potentially solving key limitations of current large language models.
Adrian Kosowski from Pathway introduces the Baby Dragon Hatchling (BDH), a groundbreaking, post-transformer architecture inspired by neuroscience. BDH leverages sparse, positive activation to mimic brain function, offering a path to limitless context, superior reasoning, and unprecedented computational efficiency, potentially solving key limitations of current large language models.
The Missing Link: Reconciling Transformers and Brain Function
The development of AI has seen a divergence between biologically-inspired architectures, like Recurrent Neural Networks (RNNs), and computationally-efficient models like the Transformer. The Transformer's attention mechanism, while powerful, is difficult to reconcile with biological processes in the brain. The new Baby Dragon Hatchling (BDH) architecture from Pathway aims to be the "missing link" by creating a more biologically plausible model that retains and extends the capabilities of transformers.
The core idea is to redesign the attention mechanism to be closer to natural systems. In the brain, attention operates at a micro-level, where a neuron's focus is on its immediate connections (synapses). This is a highly local and dynamic process, governed by principles like Hebbian Learning ("neurons that fire together, wire together"). In contrast, the Transformer's attention is a global lookup mechanism, searching across a context window for relevant information—a process designed for GPUs, not biological plausibility. BDH implements attention in a way that is fundamentally a massively parallel system of artificial neurons that communicate locally, providing a plausible model for how the brain might achieve complex reasoning.
Core Innovation: Sparse, Positive Activation
The fundamental departure from the Transformer architecture is BDH's use of sparse and positive activation.
⦁ Sparsity: In a Transformer, every prompt activates nearly all neurons in the network (dense activation), which is computationally and energetically expensive. The human brain is sparsely activated, with only a small fraction of neurons firing at any time. BDH mirrors this, with approximately 95% of its artificial neurons silent at any given moment. This results in significant efficiency gains and is a key reason for its potential to outperform transformers on certain hardware. The current 1-billion-parameter BDH model performs on par with a comparably sized dense model like GPT-2, but with a fraction of the active computation.
⦁ Positive Space: Transformers operate in a dense vector space (governed by an L2 norm) where concepts can be represented by adding and subtracting vectors, including "negative" or "opposite" concepts. BDH operates in a sparse, positive space (closer to an L1 norm and probability distributions). In this paradigm, concepts are combined more like a "bag of words" or a "tag cloud," where elements are added together to form a whole. This avoids the non-intuitive idea of "negative" concepts (e.g., you can't "un-think" of the color blue) and better reflects how humans compose ideas. This approach also leads to a much simpler and cleaner mathematical foundation.
Overcoming Transformer Limitations
BDH is designed to address the key areas where transformers fall short.
⦁ Limitless Context and Lifelong Learning: While many post-transformer architectures claim infinite context, they often rely on aggressive compression that can lose information. BDH is architected to have an enormous state space (analogous to the brain's 100 trillion synapses) without being a bottleneck, allowing it to efficiently process billions of tokens of context. This enables true lifelong learning and the ability to reason over massive, enterprise-scale datasets, such as an entire technical documentation library or a large codebase.
⦁ Generalizing Reasoning: A major challenge for current LLMs is their inability to generalize reasoning beyond patterns seen in their training data. They struggle with more complex or longer chains of thought. By being more aligned with the brain's architecture, BDH is positioned to make significant breakthroughs in creating models that can generalize reasoning in a more human-like way.
⦁ Model Composability: A unique and powerful feature of BDH is its m...
Full story
The development of AI has seen a divergence between biologically-inspired architectures, like Recurrent Neural Networks (RNNs), and computationally-efficient models like the Transformer. The Transformer's attention mechanism, while powerful, is difficult to reconcile with biological processes in the brain. The new Baby Dragon Hatchling (BDH) architecture from Pathway aims to be the "missing link" by creating a more biologically plausible model that retains and extends the capabilities of transformers.
The core idea is to redesign the attention mechanism to be closer to natural systems. In the brain, attention operates at a micro-level, where a neuron's focus is on its immediate connections (synapses). This is a highly local and dynamic process, governed by principles like Hebbian Learning ("neurons that fire together, wire together"). In contrast, the Transformer's attention is a global lookup mechanism, searching across a context window for relevant information—a process designed for GPUs, not biological plausibility. BDH implements attention in a way that is fundamentally a massively parallel system of artificial neurons that communicate locally, providing a plausible model for how the brain might achieve complex reasoning.
Core Innovation: Sparse, Positive Activation
The fundamental departure from the Transformer architecture is BDH's use of sparse and positive activation.
⦁ Sparsity: In a Transformer, every prompt activates nearly all neurons in the network (dense activation), which is computationally and energetically expensive. The human brain is sparsely activated, with only a small fraction of neurons firing at any time. BDH mirrors this, with approximately 95% of its artificial neurons silent at any given moment. This results in significant efficiency gains and is a key reason for its potential to outperform transformers on certain hardware. The current 1-billion-parameter BDH model performs on par with a comparably sized dense model like GPT-2, but with a fraction of the active computation.
⦁ Positive Space: Transformers operate in a dense vector space (governed by an L2 norm) where concepts can be represented by adding and subtracting vectors, including "negative" or "opposite" concepts. BDH operates in a sparse, positive space (closer to an L1 norm and probability distributions). In this paradigm, concepts are combined more like a "bag of words" or a "tag cloud," where elements are added together to form a whole. This avoids the non-intuitive idea of "negative" concepts (e.g., you can't "un-think" of the color blue) and better reflects how humans compose ideas. This approach also leads to a much simpler and cleaner mathematical foundation.
Overcoming Transformer Limitations
BDH is designed to address the key areas where transformers fall short.
⦁ Limitless Context and Lifelong Learning: While many post-transformer architectures claim infinite context, they often rely on aggressive compression that can lose information. BDH is architected to have an enormous state space (analogous to the brain's 100 trillion synapses) without being a bottleneck, allowing it to efficiently process billions of tokens of context. This enables true lifelong learning and the ability to reason over massive, enterprise-scale datasets, such as an entire technical documentation library or a large codebase.
⦁ Generalizing Reasoning: A major challenge for current LLMs is their inability to generalize reasoning beyond patterns seen in their training data. They struggle with more complex or longer chains of thought. By being more aligned with the brain's architecture, BDH is positioned to make significant breakthroughs in creating models that can generalize reasoning in a more human-like way.
⦁ Model Composability: A unique and powerful feature of BDH is its m...
Full story
tokenless.tech
929: Dragon Hatchling: The Missing Link Between Transformers and the Brain — with Adrian Kosowski | Tokenless
Adrian Kosowski from Pathway introduces the Baby Dragon Hatchling (BDH), a groundbreaking, post-transformer architecture inspired by neuroscience. BDH leverages sparse, positive activation to mimic brain function, offering a path to limitless context, superior…
Nick Lane – Life as we know it is chemically inevitable
Evolutionary biochemist Nick Lane presents a theory that the origin of life was a chemically inevitable continuation of the geochemistry in deep-sea hydrothermal vents. This framework explains why all life uses proton gradients for energy, the Krebs Cycle, and why simple bacteria dominated for billions of years. The true bottleneck for intelligent life, he argues, is the singular, chance event of endosymbiosis that created the complex eukaryotic cell, a prerequisite for large genomes, multicellularity, and even the evolution of two sexes.
Evolutionary biochemist Nick Lane presents a theory that the origin of life was a chemically inevitable continuation of the geochemistry in deep-sea hydrothermal vents. This framework explains why all life uses proton gradients for energy, the Krebs Cycle, and why simple bacteria dominated for billions of years. The true bottleneck for intelligent life, he argues, is the singular, chance event of endosymbiosis that created the complex eukaryotic cell, a prerequisite for large genomes, multicellularity, and even the evolution of two sexes.
The Geochemical Origins of Life
Life is not a spark of lightning in a primordial soup, but a continuous process emerging directly from Earth's geochemistry. The story begins in deep-sea alkaline hydrothermal vents, which act as natural electrochemical reactors. These vents are not violent "black smokers," but porous, sponge-like mineral structures.
This environment provides all the necessary ingredients for life's emergence:
⦁ Cell-like Compartments: The mineral pores act as precursors to cells, concentrating newly formed organic molecules and preventing them from diffusing into the ocean.
⦁ A Natural Proton Gradient: In the early Hadean Eon, the oceans were acidic (rich in protons from dissolved CO2), while the fluids emerging from the vents were alkaline. This created a natural proton gradient across the thin mineral walls of the pores—a chemiosmotic potential analogous to the one that powers all living cells today.
⦁ The Power of the Gradient: This natural voltage is immense at a molecular scale. A cell membrane is only five nanometers thick, so a potential of 150-200 millivolts creates a field of 30 million volts per meter, equivalent to a bolt of lightning. This immense power was available from the start, driving the difficult reaction of combining hydrogen (H2, abundant in vent fluids) and carbon dioxide (CO2) to form organic molecules.
⦁ Geological Catalysts: The mineral walls of these pores were rich in metals like iron and nickel sulfides, which act as catalysts for these reactions, much like the metal-based enzymes that perform the same function in modern cells.
This process suggests that the core metabolism of life, such as the Krebs cycle, is not a biological invention but a thermodynamically favored chemical reaction path under these specific geological conditions. The Earth itself acted as a giant battery, producing small, living, cell-like batteries that recapitulated its fundamental electrochemical imbalance.
The Great Filter: From Simple Cells to Complex Eukaryotes
If simple life is a near-inevitable outcome of planetary chemistry, the real bottleneck for the evolution of intelligent life is the transition from simple prokaryotic cells (bacteria and archaea) to complex eukaryotic cells.
For two billion years, life on Earth consisted solely of bacteria and archaea. Despite their vast genetic diversity, they never evolved macroscopic complexity. The reason lies in an energy constraint. A bacterial cell generates energy across its outer membrane. As the cell gets bigger, its volume increases faster than its surface area, meaning it cannot generate enough energy per unit of volume to support a large, complex genome and internal machinery. Giant bacteria that do exist solve this by carrying tens of thousands of copies of their small genome, an inefficient strategy that prevents further complexity.
The solution to this problem was a singular, chance event in life's history: endosymbiosis. An archaeal host cell engulfed a bacterium, which, instead of being digested, became an internal power generator. This endosymbiont evolved into the mitochondrion.
This event was revolutionary because it freed the host cell from the surface-area-to-volume constraint. With thousands of tiny, internal power packs, the cell had the energy to support a vastly larger genome and develop the complex internal structures—like the nucleus, endomembranes, and cytoskeleton—that define all eukaryotes, from amoebas to plants and animals. This singular origin explains why a plant cell and a human cell share the same fundamental "kit" of organelles; it was an adaptation to an internal struggle of integrating the endosymbiont, not an adaptation to a specific external lifestyle.
The Mitochondrial Legacy: Why We Have Two Sexes
Mitochondria explain not only the rise of complexity but also the evolution of sex. While prokaryotes exchange genes via lateral gene transf...
Full story
Life is not a spark of lightning in a primordial soup, but a continuous process emerging directly from Earth's geochemistry. The story begins in deep-sea alkaline hydrothermal vents, which act as natural electrochemical reactors. These vents are not violent "black smokers," but porous, sponge-like mineral structures.
This environment provides all the necessary ingredients for life's emergence:
⦁ Cell-like Compartments: The mineral pores act as precursors to cells, concentrating newly formed organic molecules and preventing them from diffusing into the ocean.
⦁ A Natural Proton Gradient: In the early Hadean Eon, the oceans were acidic (rich in protons from dissolved CO2), while the fluids emerging from the vents were alkaline. This created a natural proton gradient across the thin mineral walls of the pores—a chemiosmotic potential analogous to the one that powers all living cells today.
⦁ The Power of the Gradient: This natural voltage is immense at a molecular scale. A cell membrane is only five nanometers thick, so a potential of 150-200 millivolts creates a field of 30 million volts per meter, equivalent to a bolt of lightning. This immense power was available from the start, driving the difficult reaction of combining hydrogen (H2, abundant in vent fluids) and carbon dioxide (CO2) to form organic molecules.
⦁ Geological Catalysts: The mineral walls of these pores were rich in metals like iron and nickel sulfides, which act as catalysts for these reactions, much like the metal-based enzymes that perform the same function in modern cells.
This process suggests that the core metabolism of life, such as the Krebs cycle, is not a biological invention but a thermodynamically favored chemical reaction path under these specific geological conditions. The Earth itself acted as a giant battery, producing small, living, cell-like batteries that recapitulated its fundamental electrochemical imbalance.
The Great Filter: From Simple Cells to Complex Eukaryotes
If simple life is a near-inevitable outcome of planetary chemistry, the real bottleneck for the evolution of intelligent life is the transition from simple prokaryotic cells (bacteria and archaea) to complex eukaryotic cells.
For two billion years, life on Earth consisted solely of bacteria and archaea. Despite their vast genetic diversity, they never evolved macroscopic complexity. The reason lies in an energy constraint. A bacterial cell generates energy across its outer membrane. As the cell gets bigger, its volume increases faster than its surface area, meaning it cannot generate enough energy per unit of volume to support a large, complex genome and internal machinery. Giant bacteria that do exist solve this by carrying tens of thousands of copies of their small genome, an inefficient strategy that prevents further complexity.
The solution to this problem was a singular, chance event in life's history: endosymbiosis. An archaeal host cell engulfed a bacterium, which, instead of being digested, became an internal power generator. This endosymbiont evolved into the mitochondrion.
This event was revolutionary because it freed the host cell from the surface-area-to-volume constraint. With thousands of tiny, internal power packs, the cell had the energy to support a vastly larger genome and develop the complex internal structures—like the nucleus, endomembranes, and cytoskeleton—that define all eukaryotes, from amoebas to plants and animals. This singular origin explains why a plant cell and a human cell share the same fundamental "kit" of organelles; it was an adaptation to an internal struggle of integrating the endosymbiont, not an adaptation to a specific external lifestyle.
The Mitochondrial Legacy: Why We Have Two Sexes
Mitochondria explain not only the rise of complexity but also the evolution of sex. While prokaryotes exchange genes via lateral gene transf...
Full story
tokenless.tech
Nick Lane – Life as we know it is chemically inevitable | Tokenless
Evolutionary biochemist Nick Lane presents a theory that the origin of life was a chemically inevitable continuation of the geochemistry in deep-sea hydrothermal vents. This framework explains why all life uses proton gradients for energy, the Krebs Cycle…
Sparse Activation is the Future of AI (with Adrian Kosowski)
Adrian Kosowski from Pathway explains their groundbreaking research on sparse activation in AI, moving beyond the dense architectures of transformers. Their model, Baby Dragon Hatchling (BDH), mimics the brain's efficiency by activating only a small fraction of its artificial neurons, enabling a new, more scalable, and compositional approach to reasoning that isn't confined by the vector space limitations of current models.
Adrian Kosowski from Pathway explains their groundbreaking research on sparse activation in AI, moving beyond the dense architectures of transformers. Their model, Baby Dragon Hatchling (BDH), mimics the brain's efficiency by activating only a small fraction of its artificial neurons, enabling a new, more scalable, and compositional approach to reasoning that isn't confined by the vector space limitations of current models.
The architecture of current transformer-based models is characterized by dense activation, where information flows through all, or large modules of, neurons in the network. This process is computationally and energetically expensive. The human brain, in contrast, operates on a principle of sparse activation, where only a small fraction of neurons are active at any given time. This efficiency is demonstrable through techniques like fMRI, which show localized brain activity during specific cognitive tasks.
The Two Worlds: Dense vs. Sparse Activation
There is a fundamental gap between two paradigms in neural network design:
1. The World of Dense Activations: This is the realm of transformers, including models from the GPT-2 and Llama families. In these architectures, every input prompt triggers a computational cascade through every neuron and connection in the network. While powerful, this approach has inherent scaling challenges.
2. The World of Sparse Positive Activations: This is the paradigm where Pathway's Baby Dragon Hatchling (BDH) model operates. It is inspired by biological brain function, where sparse activation has been a topic of study since the 1990s, particularly in sensory functions. BDH represents a novel application of this concept to complex reasoning tasks. In this model, approximately 95% of artificial neurons are silent at any given moment, drastically improving efficiency.
Despite its sparse nature and relatively small size (around 1 billion parameters, comparable to GPT-2), BDH demonstrates performance that rivals its dense counterparts and is designed to be GPU-efficient, especially for inference.
The Scaling Limitations of Transformers
A critical limitation has emerged in the scaling of transformers. While parameters and layers have increased, the vector dimension of the attention head has converged and stopped scaling, remaining fixed at around 1,000 dimensions. This creates a bottleneck, as all concepts the model works with must be mapped into this relatively small vector space, limiting the potential for more nuanced and complex reasoning.
A New Framework for Conceptual Reasoning
The distinction between dense and sparse models extends to how they represent and manipulate concepts.
⦁ Transformers and Vector Spaces: Dense models operate within a traditional linear vector space. Concepts are represented as vectors that can be added, subtracted, or negated. This allows for algebraic manipulation but may not fully capture the complexity of human thought.
⦁ BDH and Sparse Positive Spaces: Sparse models move away from linear combinations and towards a compositional framework. Concepts are formed more like a "bag of words" or an associative "tag cloud," where elements are put together to create a whole. This is analogous to how German compound nouns are formed or how words combine to form a sentence.
A key difference in this new framework is the absence of negatives or opposites. In human reasoning, there isn't a simple symmetry between being attracted to a concept and being repelled by it. The classic example is, "don't think about the color blue"—the instruction itself forces you to engage with the concept of "blue." BDH's architecture reflects this non-symmetrical nature of thought, suggesting a mechanism that is fundamentally different from the vector opposition found in dense models and potentially closer to how biological reasoning occurs. This allows for a mathematically cleaner architecture that may even help re-evaluate and understand the transformer as an approximation of this more fundamental, sparse model.
The Two Worlds: Dense vs. Sparse Activation
There is a fundamental gap between two paradigms in neural network design:
1. The World of Dense Activations: This is the realm of transformers, including models from the GPT-2 and Llama families. In these architectures, every input prompt triggers a computational cascade through every neuron and connection in the network. While powerful, this approach has inherent scaling challenges.
2. The World of Sparse Positive Activations: This is the paradigm where Pathway's Baby Dragon Hatchling (BDH) model operates. It is inspired by biological brain function, where sparse activation has been a topic of study since the 1990s, particularly in sensory functions. BDH represents a novel application of this concept to complex reasoning tasks. In this model, approximately 95% of artificial neurons are silent at any given moment, drastically improving efficiency.
Despite its sparse nature and relatively small size (around 1 billion parameters, comparable to GPT-2), BDH demonstrates performance that rivals its dense counterparts and is designed to be GPU-efficient, especially for inference.
The Scaling Limitations of Transformers
A critical limitation has emerged in the scaling of transformers. While parameters and layers have increased, the vector dimension of the attention head has converged and stopped scaling, remaining fixed at around 1,000 dimensions. This creates a bottleneck, as all concepts the model works with must be mapped into this relatively small vector space, limiting the potential for more nuanced and complex reasoning.
A New Framework for Conceptual Reasoning
The distinction between dense and sparse models extends to how they represent and manipulate concepts.
⦁ Transformers and Vector Spaces: Dense models operate within a traditional linear vector space. Concepts are represented as vectors that can be added, subtracted, or negated. This allows for algebraic manipulation but may not fully capture the complexity of human thought.
⦁ BDH and Sparse Positive Spaces: Sparse models move away from linear combinations and towards a compositional framework. Concepts are formed more like a "bag of words" or an associative "tag cloud," where elements are put together to create a whole. This is analogous to how German compound nouns are formed or how words combine to form a sentence.
A key difference in this new framework is the absence of negatives or opposites. In human reasoning, there isn't a simple symmetry between being attracted to a concept and being repelled by it. The classic example is, "don't think about the color blue"—the instruction itself forces you to engage with the concept of "blue." BDH's architecture reflects this non-symmetrical nature of thought, suggesting a mechanism that is fundamentally different from the vector opposition found in dense models and potentially closer to how biological reasoning occurs. This allows for a mathematically cleaner architecture that may even help re-evaluate and understand the transformer as an approximation of this more fundamental, sparse model.
Machine Learning Explained: A Guide to ML, AI, & Deep Learning
A breakdown of Machine Learning (ML), its relationship with AI and Deep Learning, and its core paradigms: supervised, unsupervised, and reinforcement learning. The summary explores classic models and connects them to modern applications like Large Language Models (LLMs) and Reinforcement Learning with Human Feedback (RLHF).
A breakdown of Machine Learning (ML), its relationship with AI and Deep Learning, and its core paradigms: supervised, unsupervised, and reinforcement learning. The summary explores classic models and connects them to modern applications like Large Language Models (LLMs) and Reinforcement Learning with Human Feedback (RLHF).
Machine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on algorithms that learn patterns from training data to make accurate inferences about new, unseen data. It sits within a hierarchy where AI is the broadest field, ML is a subfield of AI, and Deep Learning (DL)—which uses neural networks with many layers—is a subfield of ML.
The central premise of ML involves model training, a process where a machine's performance is optimized on a dataset that resembles real-world tasks. A well-trained model can then apply the patterns it has learned to infer correct outputs for new data. The deployment of this trained model is called AI inference, where it actively makes predictions on new, live data.
The Three Learning Paradigms
Most machine learning can be grouped into three main paradigms:
1. Supervised Learning
Supervised learning trains a model to predict a correct output using labeled examples, often referred to as "ground truth." This process typically requires a human to provide the correctly labeled data.
⦁ Regression Models: Predict continuous numerical values, such as price predictions or temperature forecasts.
⦁ Linear Regression: Finds the best-fit straight line through data points.
⦁ Polynomial Regression: Captures non-linear relationships in the data.
⦁ Classification Models: Predict discrete classes or categories.
⦁ Binary Classification: Assigns an item to one of two categories (e.g., spam or not spam).
⦁ Multi-class Classification: Assigns an item to one of many categories.
⦁ Multi-label Classification: Assigns multiple relevant tags or labels to a single item.
Modern techniques often use ensemble methods, which combine multiple models to achieve higher accuracy.
A related approach is semi-supervised learning, which uses a small amount of labeled data along with a large pool of unlabeled data. This method allows the model to generalize from the labeled examples to the unlabeled data, reducing the need for costly and time-consuming data labeling.
2. Unsupervised Learning
Unsupervised learning works with unlabeled data to discover hidden structures and patterns on its own.
⦁ Clustering: Groups similar items together.
⦁ K-Means Clustering: Assigns items to a pre-determined number (k) of groups by repeatedly calculating group averages (centroids) until they stabilize. This is useful for tasks like customer segmentation (e.g., bargain hunters, loyal customers).
⦁ Hierarchical Clustering: Builds a tree of clusters by starting with each item as its own group and progressively merging the most similar groups. This allows for creating broad or fine-grained clusters depending on where the tree is "cut," which is useful for organizing IT tickets into themes.
⦁ Dimensionality Reduction: Reduces the complexity of data by representing it with a smaller number of features while retaining meaningful characteristics. This is often used for data preprocessing, compression, and visualization. Common algorithms include Principal Component Analysis (PCA) and autoencoders.
3. Reinforcement Learning (RL)
In reinforcement learning, an agent interacts with an environment. The agent observes the current state, chooses an action, and receives a reward or penalty from the environment. Through trial and error, the agent learns a policy that maximizes its long-term rewards.
A key challenge in RL is balancing exploration (trying new actions) with exploitation (repeating actions that have worked well in the past). A classic example is a self-driving car, where the state comes from GPS and cameras, actions are steering and braking, and rewards are given for safe progress while penalties are applied for hard braking or collisions.
From Classic ML to Modern Applications
Techniques like regression, classification, and clustering a...
Full story
The central premise of ML involves model training, a process where a machine's performance is optimized on a dataset that resembles real-world tasks. A well-trained model can then apply the patterns it has learned to infer correct outputs for new data. The deployment of this trained model is called AI inference, where it actively makes predictions on new, live data.
The Three Learning Paradigms
Most machine learning can be grouped into three main paradigms:
1. Supervised Learning
Supervised learning trains a model to predict a correct output using labeled examples, often referred to as "ground truth." This process typically requires a human to provide the correctly labeled data.
⦁ Regression Models: Predict continuous numerical values, such as price predictions or temperature forecasts.
⦁ Linear Regression: Finds the best-fit straight line through data points.
⦁ Polynomial Regression: Captures non-linear relationships in the data.
⦁ Classification Models: Predict discrete classes or categories.
⦁ Binary Classification: Assigns an item to one of two categories (e.g., spam or not spam).
⦁ Multi-class Classification: Assigns an item to one of many categories.
⦁ Multi-label Classification: Assigns multiple relevant tags or labels to a single item.
Modern techniques often use ensemble methods, which combine multiple models to achieve higher accuracy.
A related approach is semi-supervised learning, which uses a small amount of labeled data along with a large pool of unlabeled data. This method allows the model to generalize from the labeled examples to the unlabeled data, reducing the need for costly and time-consuming data labeling.
2. Unsupervised Learning
Unsupervised learning works with unlabeled data to discover hidden structures and patterns on its own.
⦁ Clustering: Groups similar items together.
⦁ K-Means Clustering: Assigns items to a pre-determined number (k) of groups by repeatedly calculating group averages (centroids) until they stabilize. This is useful for tasks like customer segmentation (e.g., bargain hunters, loyal customers).
⦁ Hierarchical Clustering: Builds a tree of clusters by starting with each item as its own group and progressively merging the most similar groups. This allows for creating broad or fine-grained clusters depending on where the tree is "cut," which is useful for organizing IT tickets into themes.
⦁ Dimensionality Reduction: Reduces the complexity of data by representing it with a smaller number of features while retaining meaningful characteristics. This is often used for data preprocessing, compression, and visualization. Common algorithms include Principal Component Analysis (PCA) and autoencoders.
3. Reinforcement Learning (RL)
In reinforcement learning, an agent interacts with an environment. The agent observes the current state, chooses an action, and receives a reward or penalty from the environment. Through trial and error, the agent learns a policy that maximizes its long-term rewards.
A key challenge in RL is balancing exploration (trying new actions) with exploitation (repeating actions that have worked well in the past). A classic example is a self-driving car, where the state comes from GPS and cameras, actions are steering and braking, and rewards are given for safe progress while penalties are applied for hard braking or collisions.
From Classic ML to Modern Applications
Techniques like regression, classification, and clustering a...
Full story
tokenless.tech
Machine Learning Explained: A Guide to ML, AI, & Deep Learning | Tokenless
A breakdown of Machine Learning (ML), its relationship with AI and Deep Learning, and its core paradigms: supervised, unsupervised, and reinforcement learning. The summary explores classic models and connects them to modern applications like Large Language…