Tokenless - The Best of AI, ML & CS Talks
3 subscribers
99 photos
114 links
Daily posts on the internals of AI, ML, and CS — straight from the experts. No hype, no bullshit news.

#AI #AInews #newsletter
Download Telegram
Jeremy Berman, a research scientist at Reflection AI, recently won the ARC-AGI v2 public leaderboard with an elegant evolutionary algorithm. A significant shift from his v1 solution, which evolved Python programs, his new architecture evolves natural language descriptions of algorithms. This approach propelled him to the top of the leaderboard with approximately 30% accuracy, highlighting a move towards more expressive and general problem-solving frameworks.

From Python to Natural Language: A More Expressive Approach

Berman's initial success on ARC v1 involved generating and iteratively refining Python programs. He found that even for simple tasks, models struggled on the first attempt, but a revision loop that fed back errors significantly improved performance. Python was chosen for its deterministic nature and the ease of verifying a solution's correctness.

However, ARC v2 introduced more compositional tasks with multiple rules, which proved difficult to express concisely in Python. Berman observed that any ARC v2 task could be described in 5-10 bullet points of plain English. This led to the core innovation of his v2 solution: switching from evolving Python code to evolving natural language descriptions.

"Really what you want is more expressive program. And so that's why I switched from Python to English which is a much more expressive program. You can describe every single ARC v2 task in 10 bullet points of plain English, most of them in five bullet points."

This shift came with a trade-off. While natural language is more expressive and allows the model to leverage its inductive biases more fully, it isn't directly executable. This necessitates a "checker" agent to interpret the natural language instructions and generate the output grid for verification. Interestingly, Berman found that the checker agent needed to be a more powerful model than the instruction-creating agent to work effectively.

Knowledge Trees vs. Knowledge Webs

A central theme of the discussion is the distinction between memorized knowledge and deduced knowledge. Berman posits that pre-training treats all information as a "knowledge web"—a network of connected embeddings without a guaranteed causal structure. This is why models can feel like "stochastic parrots." As he memorably quoted from his paper:

"A parrot that lives in a courthouse will regurgitate more correct statements than a parrot that lives in a mad house."

True intelligence, he argues, is about compression through deduction. It involves building a "knowledge tree" from foundational axioms, where knowledge is causally and logically structured. Reasoning is the process of pruning the knowledge web and replacing it with this deductive tree. Reinforcement learning with verifiable rewards is a key process for this, as it forces the model's internal circuits to align with the deductive, causal structure of a correct solution.

Understanding is the possession of this knowledge tree.
Intelligence is the efficiency with which an agent can build and expand its garden of trees.
Reasoning is the meta-skill of building the tree, which can then be applied to learn all other skills.

Fundamental Challenges: Continual Learning and Creativity

The conversation delves into the core limitations of current AI systems, particularly catastrophic forgetting. When a model is fine-tuned on a new task, it risks losing its existing knowledge and capabilities.

"The ideal system would be we have a set of data. Our language model is bad at a certain thing. We can just give it this data and then all of a sudden it keeps all of its knowledge and then also gets really good at this new thing. We we are not there yet. And that to me is like a fundamental missing part."

Berman suggests that future breakthroughs may involve making models more composable, perhaps by freezing expert layers or modules, creating an architecture that ca...

Full story
How To Train An LLM with Anthropic's Head of Pretraining

Anthropic's Head of Pre-training, Nick Joseph, details the immense engineering and infrastructure challenges behind training frontier models like Claude. He covers the evolution from early-stage custom frameworks to debugging hardware at massive scale, balancing pre-training with RL, and the strategic importance of data quality and team composition.
The Core Thesis: Scaling Laws and the Compute Feedback Loop

The central thesis of pre-training has been consistent: scaling compute, data, and model parameters predictably yields more capable models. This principle is quantified by "scaling laws," which show that as you increase compute, the model's loss (a measure of its error in predicting the next word) decreases in a predictable power-law fashion.

This predictability created a powerful positive feedback loop that has driven progress over the last five years:
1. Train a large model using available compute.
2. Use the model to create a useful product that generates revenue.
3. Use the revenue to buy more compute.
4. Train an even better, larger model.

This cycle relies on a simple, scalable objective. Next-word prediction on the vast, unlabeled dataset of the internet proved to be the most effective. Unlike other objectives like masked language modeling (used by models like BERT), autoregressive next-word prediction has a significant advantage: it naturally enables generative product use cases through simple sampling, fitting perfectly into the feedback loop.

The Engineering Reality of Training at the Frontier

While the concept of scaling is simple, the implementation is an immense engineering challenge. The hardest problems in AI are often infrastructure problems, not ML problems.

Early Infrastructure and Efficiency

In the early days of Anthropic, the team felt they were among a small group of ~30 people in the world working at the frontier of large-scale training. To compete with less funding, they focused intensely on efficiency.

Custom Frameworks: Off-the-shelf open-source packages like PyTorch's distributed libraries were insufficient for the scale they were targeting. The team had to build their own distributed frameworks from the ground up, implementing techniques like data parallelism, pipelining, and tensor sharding (upsharding) themselves. This gave them the control needed to modify and optimize every component.
Hardware-Level Understanding: Using a cloud provider doesn't abstract away the physical hardware. The team had to understand the literal layout of GPUs in the data center, at one point running clustering algorithms on network latency data to reverse-engineer which chips were in which rooms to debug performance bottlenecks.
A Scientific Approach to Optimization: The process for improving efficiency was methodical:
1. Model the System: On paper, calculate the theoretical maximum efficiency (MFU/flops utilization) by modeling the six or so key constraints, such as HBM bandwidth, CPU offloading, and network interconnects.
2. Implement: Write the code to execute the parallelization strategy.
3. Profile and Debug: Use profilers to measure the performance of every single operation. The goal is to match the actual performance to the theoretical model, identifying and fixing any discrepancies. This often required hacking existing single-GPU profilers to trace and combine data from thousands of GPUs simultaneously.

The Challenge of Cursed Bugs and Unreliable Hardware

A surprising and frustrating challenge at scale is that the hardware itself can be the source of bugs. The conventional programmer's wisdom, "it's your code, not the computer," breaks down.

My manager looked at it and was like, "uh yeah, probably the computer's wrong." And I was like, that seems unlikely. And sure enough, the computer was wrong. Turned out that like the GPU was broken.

Teams must debug the entire stack, from the high-level Python code down to the physical hardware. A single faulty GPU, a misconfigured power supply, or a subtle networking issue can corrupt a training run. These "cursed bugs" can take months to solve, potentially derailing an entire model generation. This necessitates a rare engineering skill set: the ability to deep-dive any problem a...

Full story
Human Neurons are 1M x Energy Efficient than Digital AI Processors | Dr. Ewelina Kurtys | FinalSpark

Dr. Ewelina Kurtys of FinalSpark explains their pioneering work in building biocomputers from living human neurons, which are up to one million times more energy-efficient than traditional silicon chips. The conversation covers the technology of reprogramming skin cells into neurons, the company's growth strategy, and the profound ethical and philosophical questions, such as potential 'Matrix' scenarios, that arise from merging biology with AI.
The Mission: Solving AI's Energy Crisis with Biocomputing

The primary driver behind FinalSpark's research is the massive and exponentially increasing energy consumption of modern digital AI models. To build a better model today, one simply has to spend more money on energy. FinalSpark proposes a revolutionary alternative: using living human neurons as processors. These biological processors are approximately one million times more energy-efficient than their digital counterparts, presenting a potential future for AI that is sustainable and powerful. The long-term vision is to develop these biocomputers over the next 10 years, creating a new hardware paradigm for artificial intelligence.

The Technology Behind Neuron-Based Processors

FinalSpark's approach combines neuroscience, biology, and engineering to create a new type of computer.

Neuron Sourcing: Human neurons are sourced ethically and efficiently by reprogramming human skin cells into stem cells, which can then be differentiated into any cell type, including neurons. This method allows for the generation of a large supply of neurons.
Architecture: The core of the system involves arranging neurons in 3D structures, each containing about 10,000 neurons. These structures are placed on electrodes, which allows for sending electrical input signals and receiving the neurons' electrical responses (output). The fundamental challenge is to understand and program the relationship between these inputs and outputs.
Learning and Programming: The team is experimenting with both electrical and chemical signals, such as the neurotransmitter dopamine, to influence neuron behavior and facilitate learning. The next major milestone, projected for 2-3 years post-investment, is achieving "learning in vitro." This involves teaching the neurons simple tasks, like image recognition, by rewiring the connections between them, mirroring the learning process in the human brain.
The Butterfly Demo: A current practical demonstration of the technology is a web application where users can control a digital butterfly. The user's input via a mouse sends signals to the neurons in the lab. If the neurons' collective electrical activity surpasses a certain threshold, the butterfly moves, demonstrating a basic level of control and processing by the living neurons.

Commercial Strategy and Growth

Founded in 2014, FinalSpark has been primarily self-funded by its founders. The company is now seeking $50 million in investment to scale its research and development team, which is crucial for solving the complex technical challenges of biocomputing.

The commercial strategy for this deep-tech venture differs from typical SaaS companies. The focus is less on immediate marketing and more on fundamental R&D. The belief is that once the technology works and can offer computational power that is 10 or 100 times cheaper, the value proposition will be so compelling that it won't require extensive marketing.

Currently, FinalSpark offers remote access to its laboratory as a tool for scientists and clients worldwide. Subscribers can conduct fundamental research on signal processing in neurons, getting familiar with the hardware of the future.

The Future of AI: A Biological Revolution

Dr. Kurtys posits that biocomputing isn't just an incremental improvement but a "total revolution" for AI. While digital AI sees small, gradual changes, using living neurons represents a complete shift in hardware. This technology is especially promising for applications where processing speed is not the most critical factor, such as generative AI.

It is not expected that biocomputing will completely replace digital silicon chips. Instead, the future will likely feature a greater variety of specialized chips and technologies tailored to different use cases, with bioprocessors occupying a significant role.

Ethical Considerations and Philos...

Full story
929: Dragon Hatchling: The Missing Link Between Transformers and the Brain — with Adrian Kosowski

Adrian Kosowski from Pathway introduces the Baby Dragon Hatchling (BDH), a groundbreaking, post-transformer architecture inspired by neuroscience. BDH leverages sparse, positive activation to mimic brain function, offering a path to limitless context, superior reasoning, and unprecedented computational efficiency, potentially solving key limitations of current large language models.
The Missing Link: Reconciling Transformers and Brain Function

The development of AI has seen a divergence between biologically-inspired architectures, like Recurrent Neural Networks (RNNs), and computationally-efficient models like the Transformer. The Transformer's attention mechanism, while powerful, is difficult to reconcile with biological processes in the brain. The new Baby Dragon Hatchling (BDH) architecture from Pathway aims to be the "missing link" by creating a more biologically plausible model that retains and extends the capabilities of transformers.

The core idea is to redesign the attention mechanism to be closer to natural systems. In the brain, attention operates at a micro-level, where a neuron's focus is on its immediate connections (synapses). This is a highly local and dynamic process, governed by principles like Hebbian Learning ("neurons that fire together, wire together"). In contrast, the Transformer's attention is a global lookup mechanism, searching across a context window for relevant information—a process designed for GPUs, not biological plausibility. BDH implements attention in a way that is fundamentally a massively parallel system of artificial neurons that communicate locally, providing a plausible model for how the brain might achieve complex reasoning.

Core Innovation: Sparse, Positive Activation

The fundamental departure from the Transformer architecture is BDH's use of sparse and positive activation.

Sparsity: In a Transformer, every prompt activates nearly all neurons in the network (dense activation), which is computationally and energetically expensive. The human brain is sparsely activated, with only a small fraction of neurons firing at any time. BDH mirrors this, with approximately 95% of its artificial neurons silent at any given moment. This results in significant efficiency gains and is a key reason for its potential to outperform transformers on certain hardware. The current 1-billion-parameter BDH model performs on par with a comparably sized dense model like GPT-2, but with a fraction of the active computation.

Positive Space: Transformers operate in a dense vector space (governed by an L2 norm) where concepts can be represented by adding and subtracting vectors, including "negative" or "opposite" concepts. BDH operates in a sparse, positive space (closer to an L1 norm and probability distributions). In this paradigm, concepts are combined more like a "bag of words" or a "tag cloud," where elements are added together to form a whole. This avoids the non-intuitive idea of "negative" concepts (e.g., you can't "un-think" of the color blue) and better reflects how humans compose ideas. This approach also leads to a much simpler and cleaner mathematical foundation.

Overcoming Transformer Limitations

BDH is designed to address the key areas where transformers fall short.

Limitless Context and Lifelong Learning: While many post-transformer architectures claim infinite context, they often rely on aggressive compression that can lose information. BDH is architected to have an enormous state space (analogous to the brain's 100 trillion synapses) without being a bottleneck, allowing it to efficiently process billions of tokens of context. This enables true lifelong learning and the ability to reason over massive, enterprise-scale datasets, such as an entire technical documentation library or a large codebase.

Generalizing Reasoning: A major challenge for current LLMs is their inability to generalize reasoning beyond patterns seen in their training data. They struggle with more complex or longer chains of thought. By being more aligned with the brain's architecture, BDH is positioned to make significant breakthroughs in creating models that can generalize reasoning in a more human-like way.

Model Composability: A unique and powerful feature of BDH is its m...

Full story
Nick Lane – Life as we know it is chemically inevitable

Evolutionary biochemist Nick Lane presents a theory that the origin of life was a chemically inevitable continuation of the geochemistry in deep-sea hydrothermal vents. This framework explains why all life uses proton gradients for energy, the Krebs Cycle, and why simple bacteria dominated for billions of years. The true bottleneck for intelligent life, he argues, is the singular, chance event of endosymbiosis that created the complex eukaryotic cell, a prerequisite for large genomes, multicellularity, and even the evolution of two sexes.
The Geochemical Origins of Life

Life is not a spark of lightning in a primordial soup, but a continuous process emerging directly from Earth's geochemistry. The story begins in deep-sea alkaline hydrothermal vents, which act as natural electrochemical reactors. These vents are not violent "black smokers," but porous, sponge-like mineral structures.

This environment provides all the necessary ingredients for life's emergence:
Cell-like Compartments: The mineral pores act as precursors to cells, concentrating newly formed organic molecules and preventing them from diffusing into the ocean.
A Natural Proton Gradient: In the early Hadean Eon, the oceans were acidic (rich in protons from dissolved CO2), while the fluids emerging from the vents were alkaline. This created a natural proton gradient across the thin mineral walls of the pores—a chemiosmotic potential analogous to the one that powers all living cells today.
The Power of the Gradient: This natural voltage is immense at a molecular scale. A cell membrane is only five nanometers thick, so a potential of 150-200 millivolts creates a field of 30 million volts per meter, equivalent to a bolt of lightning. This immense power was available from the start, driving the difficult reaction of combining hydrogen (H2, abundant in vent fluids) and carbon dioxide (CO2) to form organic molecules.
Geological Catalysts: The mineral walls of these pores were rich in metals like iron and nickel sulfides, which act as catalysts for these reactions, much like the metal-based enzymes that perform the same function in modern cells.

This process suggests that the core metabolism of life, such as the Krebs cycle, is not a biological invention but a thermodynamically favored chemical reaction path under these specific geological conditions. The Earth itself acted as a giant battery, producing small, living, cell-like batteries that recapitulated its fundamental electrochemical imbalance.

The Great Filter: From Simple Cells to Complex Eukaryotes

If simple life is a near-inevitable outcome of planetary chemistry, the real bottleneck for the evolution of intelligent life is the transition from simple prokaryotic cells (bacteria and archaea) to complex eukaryotic cells.

For two billion years, life on Earth consisted solely of bacteria and archaea. Despite their vast genetic diversity, they never evolved macroscopic complexity. The reason lies in an energy constraint. A bacterial cell generates energy across its outer membrane. As the cell gets bigger, its volume increases faster than its surface area, meaning it cannot generate enough energy per unit of volume to support a large, complex genome and internal machinery. Giant bacteria that do exist solve this by carrying tens of thousands of copies of their small genome, an inefficient strategy that prevents further complexity.

The solution to this problem was a singular, chance event in life's history: endosymbiosis. An archaeal host cell engulfed a bacterium, which, instead of being digested, became an internal power generator. This endosymbiont evolved into the mitochondrion.

This event was revolutionary because it freed the host cell from the surface-area-to-volume constraint. With thousands of tiny, internal power packs, the cell had the energy to support a vastly larger genome and develop the complex internal structures—like the nucleus, endomembranes, and cytoskeleton—that define all eukaryotes, from amoebas to plants and animals. This singular origin explains why a plant cell and a human cell share the same fundamental "kit" of organelles; it was an adaptation to an internal struggle of integrating the endosymbiont, not an adaptation to a specific external lifestyle.

The Mitochondrial Legacy: Why We Have Two Sexes

Mitochondria explain not only the rise of complexity but also the evolution of sex. While prokaryotes exchange genes via lateral gene transf...

Full story
Sparse Activation is the Future of AI (with Adrian Kosowski)

Adrian Kosowski from Pathway explains their groundbreaking research on sparse activation in AI, moving beyond the dense architectures of transformers. Their model, Baby Dragon Hatchling (BDH), mimics the brain's efficiency by activating only a small fraction of its artificial neurons, enabling a new, more scalable, and compositional approach to reasoning that isn't confined by the vector space limitations of current models.
The architecture of current transformer-based models is characterized by dense activation, where information flows through all, or large modules of, neurons in the network. This process is computationally and energetically expensive. The human brain, in contrast, operates on a principle of sparse activation, where only a small fraction of neurons are active at any given time. This efficiency is demonstrable through techniques like fMRI, which show localized brain activity during specific cognitive tasks.

The Two Worlds: Dense vs. Sparse Activation

There is a fundamental gap between two paradigms in neural network design:

1. The World of Dense Activations: This is the realm of transformers, including models from the GPT-2 and Llama families. In these architectures, every input prompt triggers a computational cascade through every neuron and connection in the network. While powerful, this approach has inherent scaling challenges.

2. The World of Sparse Positive Activations: This is the paradigm where Pathway's Baby Dragon Hatchling (BDH) model operates. It is inspired by biological brain function, where sparse activation has been a topic of study since the 1990s, particularly in sensory functions. BDH represents a novel application of this concept to complex reasoning tasks. In this model, approximately 95% of artificial neurons are silent at any given moment, drastically improving efficiency.

Despite its sparse nature and relatively small size (around 1 billion parameters, comparable to GPT-2), BDH demonstrates performance that rivals its dense counterparts and is designed to be GPU-efficient, especially for inference.

The Scaling Limitations of Transformers

A critical limitation has emerged in the scaling of transformers. While parameters and layers have increased, the vector dimension of the attention head has converged and stopped scaling, remaining fixed at around 1,000 dimensions. This creates a bottleneck, as all concepts the model works with must be mapped into this relatively small vector space, limiting the potential for more nuanced and complex reasoning.

A New Framework for Conceptual Reasoning

The distinction between dense and sparse models extends to how they represent and manipulate concepts.

Transformers and Vector Spaces: Dense models operate within a traditional linear vector space. Concepts are represented as vectors that can be added, subtracted, or negated. This allows for algebraic manipulation but may not fully capture the complexity of human thought.

BDH and Sparse Positive Spaces: Sparse models move away from linear combinations and towards a compositional framework. Concepts are formed more like a "bag of words" or an associative "tag cloud," where elements are put together to create a whole. This is analogous to how German compound nouns are formed or how words combine to form a sentence.

A key difference in this new framework is the absence of negatives or opposites. In human reasoning, there isn't a simple symmetry between being attracted to a concept and being repelled by it. The classic example is, "don't think about the color blue"—the instruction itself forces you to engage with the concept of "blue." BDH's architecture reflects this non-symmetrical nature of thought, suggesting a mechanism that is fundamentally different from the vector opposition found in dense models and potentially closer to how biological reasoning occurs. This allows for a mathematically cleaner architecture that may even help re-evaluate and understand the transformer as an approximation of this more fundamental, sparse model.
Machine Learning Explained: A Guide to ML, AI, & Deep Learning

A breakdown of Machine Learning (ML), its relationship with AI and Deep Learning, and its core paradigms: supervised, unsupervised, and reinforcement learning. The summary explores classic models and connects them to modern applications like Large Language Models (LLMs) and Reinforcement Learning with Human Feedback (RLHF).
Machine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on algorithms that learn patterns from training data to make accurate inferences about new, unseen data. It sits within a hierarchy where AI is the broadest field, ML is a subfield of AI, and Deep Learning (DL)—which uses neural networks with many layers—is a subfield of ML.

The central premise of ML involves model training, a process where a machine's performance is optimized on a dataset that resembles real-world tasks. A well-trained model can then apply the patterns it has learned to infer correct outputs for new data. The deployment of this trained model is called AI inference, where it actively makes predictions on new, live data.

The Three Learning Paradigms

Most machine learning can be grouped into three main paradigms:

1. Supervised Learning
Supervised learning trains a model to predict a correct output using labeled examples, often referred to as "ground truth." This process typically requires a human to provide the correctly labeled data.

Regression Models: Predict continuous numerical values, such as price predictions or temperature forecasts.
Linear Regression: Finds the best-fit straight line through data points.
Polynomial Regression: Captures non-linear relationships in the data.
Classification Models: Predict discrete classes or categories.
Binary Classification: Assigns an item to one of two categories (e.g., spam or not spam).
Multi-class Classification: Assigns an item to one of many categories.
Multi-label Classification: Assigns multiple relevant tags or labels to a single item.

Modern techniques often use ensemble methods, which combine multiple models to achieve higher accuracy.

A related approach is semi-supervised learning, which uses a small amount of labeled data along with a large pool of unlabeled data. This method allows the model to generalize from the labeled examples to the unlabeled data, reducing the need for costly and time-consuming data labeling.

2. Unsupervised Learning
Unsupervised learning works with unlabeled data to discover hidden structures and patterns on its own.

Clustering: Groups similar items together.
K-Means Clustering: Assigns items to a pre-determined number (k) of groups by repeatedly calculating group averages (centroids) until they stabilize. This is useful for tasks like customer segmentation (e.g., bargain hunters, loyal customers).
Hierarchical Clustering: Builds a tree of clusters by starting with each item as its own group and progressively merging the most similar groups. This allows for creating broad or fine-grained clusters depending on where the tree is "cut," which is useful for organizing IT tickets into themes.
Dimensionality Reduction: Reduces the complexity of data by representing it with a smaller number of features while retaining meaningful characteristics. This is often used for data preprocessing, compression, and visualization. Common algorithms include Principal Component Analysis (PCA) and autoencoders.

3. Reinforcement Learning (RL)
In reinforcement learning, an agent interacts with an environment. The agent observes the current state, chooses an action, and receives a reward or penalty from the environment. Through trial and error, the agent learns a policy that maximizes its long-term rewards.

A key challenge in RL is balancing exploration (trying new actions) with exploitation (repeating actions that have worked well in the past). A classic example is a self-driving car, where the state comes from GPS and cameras, actions are steering and braking, and rewards are given for safe progress while penalties are applied for hard braking or collisions.

From Classic ML to Modern Applications

Techniques like regression, classification, and clustering a...

Full story
Marc Andreessen and Ben Horowitz on the State of AI

A discussion with Marc Andreessen and Ben Horowitz on the true nature of AI creativity, the limitations of intelligence in leadership, why the current AI boom is not a bubble, and the coming platform shifts and geopolitical race in robotics.
On AI, Creativity, and Intelligence

A common critique of Large Language Models (LLMs) is that they cannot produce genuinely new ideas or creative works, but rather remix their training data. Marc Andreessen challenges this by questioning the definition of human creativity and intelligence itself. He argues that true conceptual breakthroughs are exceedingly rare in human history; most progress, whether in technology or the arts, is the result of decades of prior work and "remixing" existing ideas. Even a genius like Beethoven was heavily influenced by his predecessors. The standard for AI, therefore, should not be a mythical ideal of pure invention but whether it can match or exceed the innovative capacity of the vast majority of humans. If a model can clear the bar of 99.99% of humanity, it represents a monumental leap.

Ben Horowitz echoes this sentiment from the world of music. Through his work with hip-hop legends, he notes that true "conceptual innovators" like Rakim or Dr. Dre are incredibly rare, representing a tiny fraction of all artists. Most artists, particularly in hip-hop, are interested in AI as a powerful creative tool, seeing it as a natural extension of their own methods of sampling and reinterpreting existing music to create something new.

Intelligence, Leadership, and Theory of Mind

The conversation challenges the assumption that superior intelligence inevitably leads to power or control. Andreessen points out the fallacy in this thinking by observing the world around us: "the PhDs all work for MBAs." While IQ shows a positive correlation (around 0.4 in social sciences) with successful life outcomes, it fails to explain the majority of what drives success and leadership.

Leadership requires a different set of skills beyond raw intellect. Horowitz emphasizes qualities like emotional understanding, courage, motivation, and the ability to navigate difficult conversations—seeing decisions through the eyes of the team rather than just one's own.

This leads to the concept of Theory of Mind: the ability to model the mental state of others. Andreessen highlights a fascinating finding from the U.S. military: a leadership problem arises if a leader's IQ is more than one standard deviation away from their subordinates, in either direction. A leader who is significantly smarter than their team can lose their "theory of mind" for them, becoming unable to model their thought processes and connect effectively. This suggests a superintelligent AI with a 1000 IQ might be too "alien" to manage human systems effectively.

Further, human cognition is not a disembodied process. Andreessen argues against mind-body dualism, suggesting that intelligence is a full-body experience involving everything from our gut biome to hormones. Today's AIs are "disembodied brains," and the true robotics revolution will begin when AI is integrated into physical forms that can experience and learn from the world.

The "AI Bubble" and Market Fundamentals

Ben Horowitz asserts that we are not in an AI bubble precisely because the question is still being debated. A true bubble is a psychological phenomenon characterized by "capitulation," where everyone, including skeptics, comes to believe it is not a bubble. Unlike the dot-com era where market size had to catch up to valuations, the AI space is currently characterized by immense, tangible short-term demand.

Andreessen brings the discussion back to two ground-truth fundamentals:
1. Does the technology actually work? Yes, it delivers on its promise.
2. Are customers paying for it? Yes, they are.

As long as these two conditions hold, the market is grounded in reality, not hype.

Platform Shifts and the Future of UX

While the current battle appears to be between incumbents like Google and new entrants like OpenAI, the ultimate product form factors for AI are still unknown. Andreessen draws a historical parallel to the...

Full story
The Mathematical Foundations of Intelligence [Professor Yi Ma]

Professor Yi Ma presents a unified mathematical theory of intelligence built on two principles: parsimony and self-consistency. He challenges the notion that large language models (LLMs) understand, arguing they are sophisticated memorization systems, and demonstrates how architectures like the Transformer can be derived from the first principle of compression.
Professor Yi Ma proposes that intelligence, both natural and artificial, can be understood scientifically through a mathematical framework built upon two fundamental principles: parsimony and self-consistency. This perspective aims to clarify common misunderstandings about AI, explain the true nature of deep learning models, and outline what is required to build genuinely intelligent systems.

The Two Pillars of Intelligence

The core of intelligence, particularly the kind responsible for forming memory and world models, revolves around discovering what is predictable and structured in the external world.

1. Parsimony (Compression): The first principle is the relentless pursuit of simplicity. Intelligence is the process of compressing high-dimensional sensory data to find its intrinsic low-dimensional structure. This is not merely data compression in a technical sense, but the fundamental act of extracting knowledge. Mechanisms like denoising, dimensionality reduction, and identifying statistical correlations are all manifestations of this principle. As Einstein said of science, the goal is to "make things as simple as possible, but not any simpler."

2. Self-Consistency (Closed-Loop Learning): The second principle, "not any simpler," ensures the learned model is faithful to reality. An intelligent agent must continuously use its internal model to predict future states, compare those predictions with new observations, and use any discrepancy (error) to correct and refine the model. This creates a closed feedback loop, an idea central to cybernetics. This process allows for the model to become increasingly accurate and self-consistent with the world, enabling continual and lifelong learning without direct supervision on the "ground truth" error in the data space. The low-dimensional nature of the world's data is what makes this closed-loop correction possible.

LLMs: Memorization Masquerading as Understanding

A central point of confusion in modern AI is the nature of Large Language Models (LLMs). Professor Ma argues that we are conflating the mechanism of learning with the act of understanding.

Language is Already a Compressed Code: Natural language is not raw data. It is the result of millennia of human intelligence compressing knowledge about the physical world into a symbolic code. Language is a set of pointers to shared, grounded simulations in our minds.
Applying the Wrong Tool: Current LLMs apply the same compression mechanism used to learn from raw sensory data (like vision) to the already-compressed code of language. This process is effective at identifying and memorizing the statistical structures within the vast corpus of human text.
Memorization vs. Understanding: The result is a system that can regenerate text that is statistically plausible, effectively emulating how humans solve logical problems. However, this is akin to memorizing the process rather than understanding the underlying logic. It's the difference between memorizing mathematical proofs and mastering the deductive mechanism of mathematics itself.

The Leap from Empirical Knowledge to Scientific Abstraction

Professor Ma identifies a crucial "phase transition" in the development of intelligence that current AI has yet to make: the leap from empirical knowledge to scientific abstraction.

Empirical Knowledge: This is gained through passive observation, trial-and-error, and compression of sensory data. This is the level at which animals and current AI systems operate.
Scientific Abstraction: This involves the ability to hypothesize, create abstract concepts (e.g., infinity, parallel lines that never meet), and use rigorous, deductive logic. This form of intelligence allows us to create knowledge that is not directly present in the observed data.

The key open question for the future of AI is identifying the mechanism that enables this transi...

Full story
NVIDIA’s Jensen Huang on Reasoning Models, Robotics, and Refuting the “AI Bubble” Narrative

NVIDIA CEO Jensen Huang discusses the state of AI as we begin 2026, covering rapid improvements in reasoning, the profitability of inference, why AI will increase productivity without taking jobs, the future of robotics, the importance of open source, and which sectors are poised for their 'ChatGPT moment'.
Reflecting on the biggest AI surprises of 2025, the rapid improvements in reasoning, grounding, and the connection of models to search tools stand out. The industry effectively addressed skepticism around hallucination by making significant leaps in improving the quality and accuracy of AI-generated answers. A particularly pleasant surprise was the rapid, exponential growth in profitable inference tokens. Companies are now generating tokens with such high value that customers are willing to pay good money for them, indicating the creation of real economic value.

AI's Economic Impact: Jobs, Labor, and Productivity

A common narrative suggests AI will lead to widespread job loss, but this overlooks several key factors.

New Infrastructure and Skilled Labor Demand
The rise of AI has created a need for new "AI factories" to generate tokens, which in turn has spurred the emergence of three new types of industrial plants:
Chip Plants: Manufacturing the fundamental silicon.
Computer Plants: Assembling new types of supercomputers, where an entire rack can function as a single GPU.
AI Factories: The data centers that run the models.

The construction and operation of these facilities are creating enormous demand for skilled labor, including construction workers, plumbers, electricians, and network engineers, leading to significant wage growth in these professions.

The Task vs. Purpose Framework
It's crucial to distinguish between the tasks of a job and its purpose. AI automates tasks, not purposes. The example of radiology is illustrative: years ago, it was predicted that AI would eliminate the need for radiologists. While today nearly 100% of radiology applications are AI-powered, the number of radiologists has actually increased.
The task of a radiologist is to study scans, but the purpose is to diagnose disease.

By automating the task of studying scans, AI allows radiologists to analyze more scans, more deeply, leading to better diagnoses. This increases the hospital's productivity, allowing them to serve more patients and generating more revenue, which in turn creates demand for more radiologists. The same principle applies to software engineering, where the purpose is to solve problems, and coding is just one of the tasks.

Solving Labor Shortages with Robotics
Physical AI and robotics are not primarily about replacing workers, but about solving severe labor shortages in areas like manufacturing and trucking, which are exacerbated by an aging global population. Furthermore, a future with a billion robots will create the largest repair and maintenance industry the world has ever seen, generating entirely new categories of jobs.

The AI Technology Stack and Ecosystem

To understand the dynamics of the industry, it's helpful to view AI through a framework.

The Five-Layer Cake
The technology stack enabling AI can be visualized as a five-layer cake:
1. Energy: The fundamental input.
2. Chips: The specialized processors.
3. Infrastructure: The hardware (data centers, supercomputers) and software (orchestration) stack.
4. Models: The AI itself, which is a diverse system of models for various modalities beyond human language, including biology, chemistry, and physics.
5. Applications: The industry-specific tools built on top of the models (e.g., Harvey for law, Open Evidence for medicine, Cursor for coding).

The Myth of "God AI" and the Importance of Open Source
The narrative of a single, monolithic "God AI" that does everything is unhelpful and distracts from the practical reality. AI is a diverse field, and different industries require specialized models. No single entity is close to creating an AI that has supreme understanding of human language, genomics, molecular biology, and physics simultaneously.

In this diverse ecosystem, open source is essential. Without it, innovation in countless industries—from healthcare to manufacturi...

Full story
Post-training best-in-class models in 2025

An expert overview of post-training techniques for language models, covering the entire workflow from data generation and curation to advanced algorithms like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Reinforcement Learning (RL), along with practical advice on evaluation and iteration.
Post-training is the crucial process of transforming a pre-trained base model, which can only perform token completion, into a sophisticated model capable of following instructions and answering questions. This process is an iterative cycle of data curation, training, and evaluation, essential for creating specialized and high-performing models.

Supervised Fine-Tuning (SFT): The Foundation

The first step in post-training is Supervised Fine-Tuning (SFT). This involves training the base model on a large, high-quality dataset of instruction-answer pairs, often exceeding one million samples for general-purpose models.

Data Quality and Structure
The quality of the SFT dataset is paramount. A good dataset must be:
Accurate: The answers must be factually correct.
Diverse: It should cover a wide range of topics and tasks.
Complex: The tasks should be challenging enough to facilitate model learning.

The typical data structure includes an optional system prompt, a user instruction, and the expected model output. During training, the loss is calculated only on the model's generated output, making the quality of the provided answers critically important. A common data generation pipeline involves using a powerful LLM to generate responses to seed prompts with specific constraints, followed by automated checks, filtering, and decontamination to prevent training on test data.

SFT Techniques and Parameters
Full Fine-Tuning: Updates all model parameters, maximizing potential quality but requiring significant computational resources.
Parameter-Efficient Fine-Tuning (PEFT):
LoRA (Low-Rank Adaptation): Freezes the base model's weights and introduces small, trainable matrices (adapters). This drastically reduces the number of trainable parameters (e.g., to 0.1%), saving VRAM and speeding up training.
QLoRA (Quantized LoRA): Further reduces memory requirements by loading a quantized (e.g., 4-bit) version of the model before applying LoRA. This is a trade-off, as it can lead to a degradation in quality compared to standard LoRA.

The most critical hyperparameter to tune is the learning rate. An excessively high learning rate can cause "loss spikes," leading to a collapse in model performance. Monitoring the training loss for a smooth, descending curve is a key indicator of a successful run.

Preference Alignment with DPO

Direct Preference Optimization (DPO) is a powerful technique for aligning a model's behavior and style with human preferences. It moves beyond simple instruction-following to refine the nuances of the model's responses.

DPO uses a preference dataset composed of prompts, "chosen" (preferred) answers, and "rejected" (less preferred) answers. The training objective is contrastive: it increases the likelihood of the model generating responses similar to the chosen examples while decreasing the likelihood of generating those similar to the rejected ones.

A key hyperparameter in DPO is beta, which controls how closely the model must adhere to the reference model. A low beta allows for more exploration, while a high beta keeps the model's behavior constrained.

DPO is highly effective at creating models that humans prefer, as measured by metrics like the Chatbot Arena Elo score. However, it's important to note that human preference is often weakly correlated with performance on academic benchmarks for tasks like math or reasoning.

Advanced Reasoning with Reinforcement Learning (RL)

For complex reasoning tasks like math and coding, Reinforcement Learning (RL) offers a powerful training paradigm. A popular approach, used for models like DeepSeek, involves a multi-stage process.

1. SFT Warm-up: The model is first fine-tuned on a specialized dataset where each answer is preceded by a "reasoning trace" or chain-of-thought. This teaches the model to structure its thinking process before providi...

Full story