Tokenless - The Best of AI, ML & CS Talks
3 subscribers
99 photos
114 links
Daily posts on the internals of AI, ML, and CS — straight from the experts. No hype, no bullshit news.

#AI #AInews #newsletter
Download Telegram
AI Agents: Transforming Anomaly Detection & Resolution

Martin Keen explores how agentic AI can significantly reduce IT downtime and Mean Time To Repair (MTTR) by moving beyond naive data dumps and embracing context-aware analysis. The key lies in using topology-aware correlation to curate relevant data for an AI agent, which can then systematically identify the root cause, provide explainable insights, and generate actionable remediation steps, ultimately augmenting human SREs rather than replacing them.
The initial moments of an IT system anomaly are critical. With every minute of downtime costing thousands of dollars, the 22-minute average it takes for a woken-up Site Reliability Engineer (SRE) to reach full cognitive productivity—a phenomenon known as sleep inertia—can be incredibly expensive. Agentic AI offers a powerful solution to accelerate anomaly detection and resolution, but its application requires a nuanced approach.

The Pitfall of the Brute-Force AI Approach

A common but flawed idea is to pipe the massive volumes of telemetry data (logs, traces, metrics) generated by IT environments directly into a Large Language Model (LLM) and ask it to find the root cause. This "brute-force" method is doomed to fail for several reasons:

Context Window Limitations: While LLMs have large context windows, they are not infinite. A single node cluster can generate gigabytes of log data per hour, easily overwhelming the model.
Noise-Induced Hallucination: Overfeeding an LLM with vast amounts of unrelated, noisy data causes it to "hallucinate." The model's goal is to predict plausible words based on statistical patterns, not to verify facts. It will confidently fabricate causal links between coincidental events like benign restarts and old warning logs, creating a neat but entirely imaginary narrative.

The Solution: Context Curation and Topology-Aware Correlation

To effectively use AI, we must move from a brute-force dump to strategic context curation. Instead of feeding the AI a firehose of data, an AI agent should be given only the signals that matter for the specific incident at hand.

This is achieved through topology-aware correlation. A modern observability platform maintains a real-time dependency graph—a map of how all services connect and depend on each other. When an incident occurs, such as an authentication service failing:

⦁ The agent uses this dependency graph to pull telemetry data only from involved components (e.g., the user database it relies on, the Redis cache it uses for sessions).
⦁ It intelligently ignores data from unrelated components, even if they are running on the same cluster (e.g., a reporting microservice).

The Agentic Investigation Process

Once an incident alert is triggered, the AI agent begins its work using the curated MELT (Metrics, Events, Logs, Telemetry) data. The process follows a continuous feedback loop:

1. Perceive: The agent takes in the curated data related to the incident.
2. Reason: Using causal AI, it analyzes the data to form a hypothesis about the probable root cause.
3. Act: Based on its hypothesis, it systematically requests additional, specific data to validate or refine its understanding. For example, if a web service is slow, it might fetch logs, see a database connection error, then retrieve database metrics, notice a recent update, and finally check for configuration changes.
4. Observe: It analyzes the results of its actions and data requests, feeding this new information back into the loop.

This iterative process leads to the identification of the probable root cause, backed by crucial explainability. The agent presents its chain of thought and the supporting evidence that led to its conclusion, allowing a human SRE to supervise and validate the analysis.

From Analysis to Action: AI-Assisted Resolution

Identifying the root cause is only half the battle. Agentic AI also assists the SRE in the resolution process in four key ways:

Validation: The agent generates specific steps an SRE can take to manually validate that the identified root cause is correct, ensuring a human remains in control before applying a fix to a production system.
Runbook Generation: It produces a step-by-step runbook—an ordered action plan to remediate the issue. This guides the SRE through the fix, which is especially helpful if they aren't an expert on that specific component.
Aut...

Full story
Aaron Levie and Steven Sinofsky on the AI-Worker Future

Experts from a16z, Box, and Microsoft debate the definition and future of AI agents. They explore the shift from monolithic AGI to specialized agent networks, the technical challenges of autonomous systems, and how this new platform will reshape enterprise software, workflows, and the very nature of work.
The ultimate end state of AI is not the conversational, back-and-forth form factor we first saw, but rather autonomous agents that run in the background to execute real work with minimal human intervention. The more work an AI can perform without needing input, the more "agentic" it becomes.

Defining AI Agents: Beyond the Chatbot

An AI agent can be defined by a few key characteristics. First, it is a long-running, autonomous process that executes tasks in the background. Steven Sinofsky draws a parallel to the ampersand (&) in Linux, which runs a command as a background job.

More importantly, a true agent possesses a feedback loop: it produces output that it then feeds back into itself as input to guide its next actions. This distinguishes it from a simple, long-running inference task. This capacity for self-correction and iterative improvement is a core technical challenge, as these non-linear control systems can be difficult to manage and their convergence is not guaranteed.

The Architecture of Agency: AGI vs. Specialized Networks

The discourse around Artificial General Intelligence (AGI) has shifted significantly. The initial concept of a single, monolithic, super-intelligent system is giving way to a more practical and effective architecture: a network of many specialized agents.

This approach mirrors the Unix philosophy of breaking down complex problems into smaller, manageable tools. Each agent becomes a deep expert on a specific task or domain. A higher-level system then orchestrates these agents to solve more complex problems. This "division of labor" offers several advantages:

Mitigates Context Rot: Large context windows can confuse models and degrade the quality of their output. By partitioning tasks among specialized agents, each one works with a more focused and relevant context. For example, some developers are creating a dedicated sub-agent for each microservice in their codebase, with its own "readme" file for instructions.
Drives Specialization: This trend is a counter-narrative to the idea of a single, all-knowing AGI. It suggests that progress lies in creating more agents that perform narrower tasks with more specific, complex instructions. Prompts are becoming more sophisticated, not less, as users provide detailed context to guide these specialized systems.

Reshaping Work: From Tool Adoption to Workflow Transformation

The introduction of AI agents is not just about automating existing tasks; it's about fundamentally changing how work is done, a pattern seen in previous platform shifts.

Initially, new tools often mimic old workflows. The first word processors were used to fill in pre-printed expense reports, rather than generating the entire document from scratch. Over time, however, the workflow itself adapts to the new technology's capabilities. We will likely see a similar evolution where we don't just ask agents to conform to our current processes, but instead redesign our processes to best leverage what agents can do.

A key debate is whether AI represents a more profound shift than previous platforms like the internet or the PC. For the first time, programs are "abdicating logic" to a third-party model, not just offloading resources like storage or device drivers. However, one can also view this as a new level of abstraction. When Windows introduced standardized print drivers, it abstracted away complex logic that developers previously had to build themselves, which in turn unlocked new opportunities for developers who embraced the platform.

Currently, agents are most effective as productivity multipliers for experts. An expert developer can use an AI coding assistant to achieve a 10x productivity gain because they can quickly verify the output, identify errors, and provide precise guidance. A novice, lacking this judgment, would be less effective and could even introduce flawed code.

The Business Landscape: Verti...

Full story
Conext Engineering for Engineers

Jeff Huber of Chroma argues that building reliable AI systems hinges on 'Context Engineering'—the deliberate curation of information within the context window. He challenges the efficacy of long-context models, presenting a 'Gather and Glean' framework to maximize recall and precision, and discusses specific challenges and techniques for AI agents, such as intelligent compaction.
Building effective AI systems is not about mastering "prompt engineering" or the latest RAG technique, but about the disciplined practice of Context Engineering. An AI system can be understood simply as a program: it takes an instruction set, relevant information, and user input to produce an output. The core task of a builder is to control the "relevant information" that goes into the context window to create reliable, fast, and cheap software.

The Illusion of Long Context

The industry's push towards massive context windows—from one million to ten million tokens—is not the panacea it appears to be. While impressive on benchmarks, performance in practical applications degrades sharply long before these limits are reached.

A technical report from Chroma demonstrates this phenomenon. On simple tasks that a human could perform easily, model performance drops precipitously as the input length increases, with significant degradation observed as early as 10,000 tokens.

The primary benchmark used to substantiate long-context capabilities is the "Needle in a Haystack" test. This test is fundamentally flawed as a measure of real-world utility for two reasons:
1. Low Attention Requirement: By definition, the model only needs to find and pay attention to a single piece of information (the "needle"), ignoring the vast majority of the context.
2. Zero Reasoning Power: The task requires simple pattern matching, not complex reasoning. For example, finding the sentence "The best writing advice... was to write every week" in response to a question about the best writing advice.

Most valuable AI tasks, such as summarization or agentic workflows, require the model to pay attention to a large portion of the context and apply significant reasoning power. Relying on long context alone for these tasks leads to a significant drop in performance. Focused, curated context provides massive gains in performance compared to feeding the model the full, unfiltered context.

A Framework for Context Engineering: Gather and Glean

The central challenge is deciding what information, out of all the information in the universe, should be in the context window for any given turn. This can be approached with a two-stage model:

1. Gather (Maximize Recall): The first stage is to collect all potentially relevant information. The goal is to maximize recall, even at the expense of including some irrelevant data. This can involve creating a query plan that probes multiple sources:
⦁ Structured data (e.g., SQL queries)
⦁ Unstructured data (e.g., from a vector database like Chroma)
⦁ APIs and other tools
⦁ Web search results
⦁ Chat conversation history

2. Glean (Maximize Precision): The second stage is to filter the gathered data down to a pristine set of highly relevant, non-distracting information. This is a process of maximizing precision. Common techniques include:
⦁ Top-K vector similarity search.
⦁ Reciprocal Rank Fusion (RRF) to combine results from multiple retrievers.
⦁ Learning to Rank (LTR) models.
⦁ Dedicated reranking models.
⦁ Using LLMs themselves—often small, fast, and cheap models run in parallel—to "brute force" the search and curation process.

Context Engineering for AI Agents

These principles are even more critical for AI agents, where the gather-and-glean loop occurs repeatedly. The agent's conversation and action history becomes a major, and rapidly growing, component of the context window. The sheer volume of code, logs, and observations generated in a few turns can be impossible for any human to parse, let alone an AI.

An interesting finding in agent performance is the value of failure:
Giving an agent access to its past failure cases helps it break out of local minima and improve performance.
⦁ Conversely, giving it access to prior success cases can be detrimental, causing the agent to get "lazy" an...

Full story
The Top 100 Most Used AI Apps in 2025

In the fifth edition of the a16z Consumer AI 100, an analysis of the most-used AI-native products reveals a market that is beginning to stabilize after a period of chaotic growth. Key trends identified include the continued dominance of AI companionship and creative tools, the significant market entry of major players like Google and xAI's Grok, the rise of Chinese AI companies on the global stage, and the emergence of a powerful new category: "vibe coding." The data suggests a future of increased verticalization, prosumer tool adoption, and the development of more sophisticated network effects beyond simple data acquisition.
The consumer generative AI ecosystem is showing signs of maturation and stabilization, a significant shift from the "total chaos" of its early days. An analysis of the top 50 AI-native web and mobile products, ranked by monthly usage rather than revenue, highlights several key trends shaping the industry.

Market Stabilization and Dominant Categories
The pace of change is slowing, with only 11 new names on the web list compared to 17 in the previous six-month period. This suggests that the market is beginning to consolidate around established players and clear use cases. Two categories continue to dominate consumer attention:
AI Companionship: This remains a major segment, with platforms like Character.ai, Janitor, and Spicy Chat consistently ranking high. The list saw three new companionship apps join the ranks, indicating sustained interest and innovation in this area.
Creative Tools: This category, encompassing image, video, and audio generation, maintains a strong presence with mainstays like Midjourney, Leonardo, and ElevenLabs.

The Rise of "Vibe Coding"
A significant new trend is the emergence of "vibe coding" platforms, which allow users to build applications with natural language. Loveable and Replit both made the main web list, demonstrating rapid growth. Analysis of these platforms reveals strong underlying business models:
High Revenue Retention: Many leading vibe coding platforms show revenue retention of 100% or more in the first three months. This suggests that users are not just experimenting but are upgrading plans and deriving continuous value, pointing towards prosumer or even enterprise use cases.
Usage Patterns: Interestingly, traffic to the creation platforms themselves (e.g., loveable.ai) is significantly higher than traffic to the applications hosted on their subdomains. This could imply two things: serious users are deploying projects on custom domains, or a large number of users are building "personal software" for themselves or a small circle, which is highly valuable to the individual even without attracting mass traffic.

Competitive Landscape: Incumbents and International Players

Google's Strong Debut
With changes in how their traffic is tracked, four distinct Google properties made a significant debut on the web list:
1. Gemini: Ranked #2 on the web, capturing about 10% of ChatGPT's traffic. On mobile, it's much closer, with half of ChatGPT's traffic, driven primarily by Android users.
2. AI Studio: Google's developer-facing model sandbox landed in the top 10, showing strong adoption among builders.
3. NotebookLM: This research and writing assistant has maintained surprisingly strong, flat-to-increasing traffic since its launch, landing at #13.
4. Google Labs: Ranked #39, this consumer-facing sandbox saw a 15% traffic spike with the release of the Veo video model.

The Debut of Grok
xAI's Grok made a powerful entrance, debuting at #4 on the web list. Its integration into the X platform and its unique features have quickly attracted a large user base. Meta AI also began to make an appearance on the web list, signaling that the competition among large language model assistants is far from over.

The Multi-Faceted Role of Chinese AI
Chinese companies are making an impact in three distinct ways:
1. Domestic Focus: Products like Alibaba's Cork, ByteDance's Doubao, and Moonshot AI's Kimi rank high, serving the large domestic market where many Western AI products are unavailable.
2. Global Exports: A new wave of startups is developing AI for a global audience, particularly in the image and video generation space (e.g., Kling, PixVerse). These models are often distributed through their own properties or aggregated on US-based platforms.
3. Hybrid Model: Some companies, like Remini, successfully serve both domestic and international markets, with its top traffic sources bei...

Full story
Intelligence Isn't What You Think

Dr. Michael Timothy Bennett challenges conventional AI paradigms, arguing for a new approach inspired by the principles of living systems. He critiques the separation of software and hardware ("computational dualism"), redefines intelligence as efficient adaptation, and offers a novel theory of consciousness as a "tapestry of valence" essential for genuine intelligence.
Dr. Michael Timothy Bennett begins by challenging the conventional definitions and approaches to artificial intelligence, advocating for a perspective rooted in biology and embodied cognition. He favors Pei Wang's definition of intelligence as "adaptation with limited resources," emphasizing efficiency in terms of energy and data, a stark contrast to the "scale maxing" approach prevalent in Silicon Valley.

Critique of Formal Models and Computational Dualism

Bennett critiques formalisms like AIXI, which are based on Solomonoff induction and Occam's Razor (simplicity). While compelling, these models run into the problem of subjective complexity. The perceived simplicity of a model depends on the "interpreter" or the underlying language (abstraction layer) used by the agent. One can make an agent seem arbitrarily intelligent or stupid simply by changing the interpretative framework, making objective claims about performance difficult.

This leads to his central critique of modern AI, which he terms "computational dualism." He draws a provocative analogy to Cartesian dualism, where Descartes proposed the pineal gland as the interface between the non-physical mind and the physical body.
We have just replaced the pineal gland with a touring machine.

He argues that treating intelligence as pure software, separate from its hardware and environment, is a fundamental mistake. The behavior of any software is contingent on the interpreter that executes it, all the way down to the physical laws governing the hardware. To understand intelligence, one must analyze the system as a whole, including its embodiment and environment—a concept known in cognitive science as enactive cognition. This view also aligns with the concept of mortal computation, where the physical substrate is inseparable from the computation itself, as opposed to the abstract, "immortal" nature of a theoretical Turing machine.

A Biologically-Inspired Vision for Intelligence

Bennett advocates for an AI that emulates the properties of living systems: self-organization, decentralization, and multi-scale adaptation.

Causality and Abstraction: True intelligence requires learning a causal model of the world, starting with a representation of the self as a causal agent. An agent must be able to distinguish between its own actions causing a change and the environment causing a change. This "causal identity for self" is fundamental to subjective experience.
The Law of the Stack: Bennett proposes a principle where the adaptability of a system's high-level abstractions (e.g., software) is contingent on the adaptability of its lower-level abstractions (e.g., hardware). Biological systems excel because they delegate adaptation down the stack, allowing for flexibility at all levels. Computers, in contrast, are like an "inflexible bureaucracy that makes decisions only at the top."
Decentralization and Constraints: Drawing on the work of Michael Levin, Bennett views systems like cancer as a failure of collective intelligence, where a cell becomes informationally isolated from the whole and reverts to primitive behavior. This can happen when a system is over-constrained. Imposing too much top-down control eliminates potentially correct policies and forces components to "break off." This suggests that AI safety should focus on designing the entire system with appropriate, minimal constraints rather than trying to rigidly align a single component.

Consciousness as a Necessary Adaptation

Bennett directly confronts the "hard problem of consciousness" by arguing that philosophical zombies—beings identical to humans but without subjective experience—are impossible in any conceivable world. He posits that consciousness is not an epiphenomenal, non-causal addition to information processing but a necessary feature of a sufficiently adaptive, intelligent system.

His theory frames subjective experien...

Full story
The Moonshot Podcast Deep Dive: Andrew Ng on Deep Learning and Google Brain

Andrew Ng, founder of Google Brain and DeepLearning.AI, discusses the history of neural networks and the foundational ideas that led to modern AI breakthroughs. He covers the controversial early bets on scale and general-purpose algorithms, the technical innovations behind Transformers, and the future democratizing effect of artificial intelligence.
The creation and success of Google Brain were driven by two core, and at the time, controversial hypotheses. The first was that scale matters. Around 2010, the prevailing academic view favored inventing novel algorithms over simply building bigger neural networks. Despite advice from senior figures that focusing on scale was not a good career move, the data generated by my students at Stanford showed a clear, undeniable trend: for every model we tried, performance improved as the model size increased. This data provided the confidence to pursue scale relentlessly.

The second core idea was the "one learning algorithm" hypothesis. Inspired by neuro-rewiring experiments in the brain, where one part of the brain tissue can learn a new function (e.g., learning to "see" after previously learning to "hear"), the question was whether we needed thousands of hand-engineered algorithms for different tasks. The hypothesis was that a single, general-purpose learning algorithm could, if fed different data (text, images, audio), learn to process each type effectively. This was heresy at the time in a field dominated by specialized models, but it has since become the foundation of modern AI.

The Early Days: Pushing Against the Current

In the early 2010s, neural networks were largely out of favor in the AI community, having been in the "wilderness" for a long time. The path to publishing in top conferences was through clever mathematical proofs, not demonstrating the power of scaled-up systems. This focus on scale was seen as lacking intellectual rigor. For researchers who had spent decades meticulously tweaking specific algorithms, the idea that a large model fed with massive amounts of data could outperform their work was emotionally wrenching.

The Google Brain project began at X after Sebastian Thrun, who deserves immense credit for its inception, encouraged me to pitch the idea of using Google's massive compute infrastructure to Larry Page. The partnership with Jeff Dean was crucial; he brought the deep computer systems expertise, while I brought the machine learning perspective. This combination allowed us to effectively leverage Google's infrastructure to scale our algorithms.

Technical Innovations and Breakthroughs

Hardware and Architecture
Initially, we were slower to embrace GPUs, partly because Google's CPU infrastructure was so brilliant and there were concerns about creating a heterogeneous and hard-to-manage compute environment. However, the need for parallel computation was undeniable.

This philosophy of designing for parallel hardware was a core, if sometimes underappreciated, aspect of the Transformer paper. Before Transformers, models for tasks like translation tried to ingest and memorize an entire sentence before generating the output. The Transformer's key innovation was the attention mechanism, which allowed the model to focus on specific, relevant parts of the input sentence as it generated the output. Crucially, the entire architecture was designed so that every step was highly parallelizable, making it a perfect fit for GPUs and TPUs. This design choice was what unlocked its ability to scale and become the foundation for today's large models.

The "Cat Video" Paper
Google Brain's "coming out moment" was the 2012 paper demonstrating unsupervised learning. We built what was likely the largest neural network in the world at the time and trained it by showing it unlabeled frames from YouTube videos. One day, my student Quoc Le showed me that a neuron in the network had learned to respond specifically to images of cats. The algorithm had, on its own and without any human labels, discovered the concept of a "cat." This was a massive breakthrough, proving that models could learn meaningful features from the world's vast stores of unlabeled data.

From Research to Real-World Application

To prove our value, we collaborated with teams across Google. Our early succe...

Full story
Moonshot Podcast Deep Dive: André Prager on Prototyping at Wing

André Prager, former Chief Engineer at Wing, discusses the core engineering philosophy of simplicity and cost-effectiveness that enabled the drone delivery service. He covers the design of key systems like the passive charging pad, the intelligent winch, the non-powered autoloader, and the iterative process of making the drones acoustically unobtrusive.
At the heart of Wing's development was a core philosophy: radical simplicity. The most scalable, robust, and affordable system is one with the fewest components. As former Chief Engineer André Prager puts it, "Everything that's not there doesn't need to be developed. Everything that's not there doesn't break." This principle guided the team away from complex, expensive solutions towards elegant, minimalist designs capable of scaling to a billion flights a year.

The Engineer as an Artist

Prager views engineering as more akin to art than pure mathematics—a creative process of discovery driven by curiosity. This mindset, rooted in childhood experiments like building an electric skateboard in the early '90s, emphasizes building unique things where the outcome isn't known beforehand. He argues that while math is easy to test, the creative and associative thinking required for innovative engineering is harder to measure but far more valuable. This approach seeks to find engineers who can make novel connections between different domains, a skill Prager believes is more difficult to find than pure mathematical proficiency.

The Real Challenge: The System, Not Just the Drone

When Prager joined Wing, the core architecture of the aircraft—a hybrid design with separate propellers for vertical hover and forward flight—was largely figured out. The fundamental challenge wasn't just keeping the drone in the air, but solving for "everything else." This included the entire operational ecosystem:

Payload Management: How to get packages onto and off the drone efficiently.
Infrastructure: Where the drones "sleep," how they charge, and how they are managed.
Air Traffic Management: How to coordinate a fleet of drones operating simultaneously.

Solving these problems required a relentless focus on cost and scalability, rejecting complex, demo-friendly prototypes in favor of systems that could be affordably deployed worldwide.

Engineering Simplicity in Action

Several key subsystems exemplify Wing's philosophy of minimalist, hardware-simple, and software-intelligent design.

The Charging Pad

The goal was a landing pad that required no complex robotics, manipulation, or precise alignment. The team avoided heavy wireless inductive charging coils due to weight constraints. Instead, they developed a passive contact-based system. The drone has small, conductive "feet," and the landing pad features a specific geometric pattern of positive and negative contacts. This geometry ensures that no matter how the drone lands on the 3x3 foot pad, its feet will complete a circuit and begin charging. The pad itself is simple, robust, and inexpensive, resembling a large printed circuit board (PCB).

The Winch and Delivery Hook

The system for lowering and retrieving packages could have been incredibly complex. The final design, however, features a hook with no moving parts and no electronics. This result was the product of a two-year development process with over 90 prototypes.

The intelligence resides in the winch motor and the software. By sensing force and position through the tether—much like sensing vibrations on a string—the system can infer a surprising amount of information:
⦁ When the package has touched the ground.
⦁ If a person is pulling on the line.
⦁ If the hook is snagged on an obstacle.

This allows for a simple, lightweight hardware implementation where intelligence and new features can be added over time through software updates.

The Autoloader

To automate the process of attaching a package, Prager developed the "autoloader," a device with no moving parts, no power, and no electronics. Inspired by the simplicity of a restaurant patio umbrella, the device consists of two arms that catch the drone's tether as it hovers nearby. The drone performs a slight sideways movement, which guides the hook through a channel and attaches it to the package via friction. This...

Full story
7 AI Terms You Need to Know: Agents, RAG, ASI & More

A deep dive into seven essential AI concepts shaping the future of intelligent systems, including Agentic AI, RAG, Mixture of Experts (MoE), and the theoretical frontier of Artificial Superintelligence (ASI).
As the field of artificial intelligence evolves at a breakneck pace, it's crucial for technology professionals to stay current with its core concepts. Here are seven essential AI terms that are shaping the future of the industry.

1. Agentic AI (AI Agents)
AI agents represent a shift from reactive models, like chatbots that only respond to one prompt at a time, to proactive systems that can reason and act autonomously to achieve goals. These agents operate in a continuous cycle:
1. Perceive: They assess their current environment.
2. Reason: They determine the next best steps to achieve a predefined goal.
3. Act: They execute the plan formulated during the reasoning stage.
4. Observe: They analyze the results of their actions and repeat the cycle.

This autonomous nature allows them to fulfill complex roles, such as a travel agent booking a trip, a data analyst identifying trends in reports, or a DevOps engineer detecting anomalies, testing fixes in containers, and rolling back faulty deployments.

2. Large Reasoning Models
AI agents are often powered by a specialized class of LLMs known as Large Reasoning Models (LRMs). Unlike standard LLMs that generate responses immediately, LRMs are fine-tuned to work through problems step-by-step. This methodical approach is essential for agents planning complex, multi-stage tasks.

The training process involves:
⦁ Using datasets with verifiably correct answers, such as math problems or code that can be tested by compilers.
⦁ Employing reinforcement learning to teach the model how to generate reasoning sequences that lead to correct final answers.

When a chatbot pauses and displays a "thinking..." message, it's often an LRM at work, generating an internal chain of thought to deconstruct a problem before providing a coherent response.

3. Vector Databases
Vector databases are a fundamental component of modern AI infrastructure, particularly for handling unstructured data. Instead of storing data like text or images as raw files, an embedding model converts this data into vectors—long lists of numbers that capture the data's semantic meaning and context.

The primary advantage is that similarity searches become mathematical operations. By finding vectors that are numerically close to each other in the multi-dimensional "embedding space," the system can identify semantically similar content. For example, a search for a picture of a mountain can find other similar images, related text articles, or even thematically similar music files based on their vector proximity.

4. Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) is a powerful technique that leverages vector databases to make LLM responses more accurate and context-aware. It enriches user prompts with relevant, external information before the LLM generates a response.

The process works as follows:
1. A user's input prompt is converted into a vector using an embedding model.
2. This vector is used to perform a similarity search in a vector database containing a specific knowledge base (e.g., internal company documents).
3. The relevant information retrieved from the database is then inserted into the original prompt.
4. This augmented prompt is sent to the LLM, which now has the necessary context to generate a factually grounded answer.

For instance, asking a question about company policy can trigger a RAG system to pull the relevant section from the employee handbook and include it in the prompt, ensuring the LLM's answer is accurate.

5. Model Context Protocol (MCP)
For LLMs to be truly useful, they must interact with a wide range of external data sources, services, and tools. The Model Context Protocol (MCP) is an emerging standard designed to standardize these interactions.

Currently, developers often have to build custom, one-off connections for each new tool or database an LLM needs to access. MCP aims to solve th...

Full story
Small Language Models are the Future of Agentic AI Reading Group

This paper challenges the prevailing "bigger is better" narrative in AI, arguing that Small Language Models (SLMs) are not just sufficient but often superior for agentic AI tasks due to their efficiency, speed, and specialization. The discussion explores the paper's core arguments, counterarguments, and the practical implications of adopting a hybrid LLM-SLM approach.
The paper "Small Language Models are the Future of Agentic AI" posits that the trend toward ever-larger models may be misguided for agentic systems. Instead, it argues that a heterogeneous ecosystem of smaller, specialized models offers a more powerful, efficient, and economical path forward. The core intuition is to scale out by composing specialized "Lego" blocks (SLMs) rather than scaling up a single monolithic model (LLM).

The Three Pillars of the SLM Argument

The authors build their case on three primary arguments:

1. SLMs are Powerful Enough: Recent advancements have produced SLMs (e.g., Microsoft's Phi series) that achieve competitive performance on benchmarks for reasoning, language, and coding tasks when compared to models 10-20 times their size. For the vast majority of agentic tasks, which often involve a limited subset of an LLM's full capabilities, this level of performance is sufficient. The broad, general intelligence of a massive LLM can be "intelligence overkill" when an agent is only performing a narrow, specific function.

2. SLMs are Operationally Superior:
Performance: They offer significantly lower inference latency and require less memory, making them faster and easier to deploy.
Flexibility: Their small size allows for greater operational flexibility, including deployment on edge devices and consumer-grade GPUs without specialized infrastructure.
Behavioral Alignment: Agentic systems require predictable, structured interactions, often using formats like JSON or YAML. It's easier and more reliable to fine-tune an SLM to consistently produce a specific format, reducing the risk of hallucinations or formatting errors that can occur with a general-purpose LLM trained on countless formats.
Heterogeneity: Agentic workflows are naturally composed of diverse tasks with varying complexity. A system can dynamically choose the best model for each sub-task—a simple SLM for a simple task and perhaps a more powerful model only when necessary.

3. SLMs are More Economical:
Inference Costs: Serving a 7B parameter SLM can be 10-30 times cheaper than serving a 70B+ parameter LLM.
Operational Simplicity: SLMs avoid the complexities of multi-GPU and multi-node parallelization, simplifying infrastructure management and maintenance.
Fine-Tuning Agility: Fine-tuning an SLM requires only a few GPU hours, enabling rapid iteration and specialization, compared to the weeks and significant resources needed for large models.
Parameter Utilization: SLMs are fundamentally more efficient, activating a higher percentage of their parameters for a given task compared to the sparse activation in very large models.

A compelling argument is that agentic systems naturally evolve toward SLMs. Each interaction an agent has (prompt, output, user feedback) generates valuable training data. Even if a system starts with an LLM, this continuous stream of task-specific data creates the perfect conditions for optimizing and distilling that capability into a smaller, expert model.

Counterarguments and Practical Challenges

The discussion also highlighted significant counterarguments and real-world barriers:

Scaling Laws and Generalization: LLMs benefit from scaling laws, giving them a more nuanced and abstract understanding of concepts, multi-linguality, and multi-modality that SLMs may lack. This deep generalization might be crucial for a top-level "supervisor" agent that needs to orchestrate complex tasks.
The Cost of a Fleet: While a single SLM is cheap, managing an entire fleet of specialized models introduces its own operational complexity and costs, including infrastructure, talent, and orchestration. Centralized LLM endpoints can benefit from higher utilization, potentially making them more cost-effective at scale than multiple, under-utilized SLM endpoints.
Real-World Ne...

Full story
The Day AI Solves My Puzzles Is The Day I Worry (Prof. Cristopher Moore)

Professor Cristopher Moore of the Santa Fe Institute discusses the surprising effectiveness of AI, arguing it stems from the rich, non-random structure of the real world. He explores the limits of current models, the nature of intelligence as creative problem-solving and abstraction, the importance of grounding and shared reality, and the profound implications of computational irreducibility and the need for algorithmic transparency in high-stakes applications.
Cristopher Moore, a self-described "frog" in the world of science, prefers diving deep into concrete problems over taking a high-level "bird's-eye view". This perspective informs his analysis of artificial intelligence, computational theory, and the nature of intelligence itself.

The Structure of the World and the Success of AI

The surprising effectiveness of large models like transformers stems not from a magical architecture, but from the nature of the data they are trained on. Real-world data is neither completely random nor adversarially designed to be difficult. Instead, it is filled with rich structure, patterns, and hierarchies. Moore argues: "the real world presents us with examples of these problems where there is so much rich structure to sink your teeth into."

Any sufficiently rich architecture can learn to exploit this structure. We will likely look back and realize that what truly matters is that "the world is structured and any architecture which is capable of capturing some of that structure is going to do well at prediction." This contrasts with theoretical work in computer science and statistical physics, which often proves hardness based on worst-case adversarial examples or purely random data models. While concepts like phase transitions—sharp shifts in problem difficulty based on signal-to-noise ratios, analogous to a magnet losing its field at a critical temperature—are powerful for understanding random problems, they don't capture the full picture of real-world AI performance.

Intelligence as Creative Problem-Solving

Despite their success, current models falter on tasks requiring novel reasoning and abstraction, such as modern Sudoku variants with complex, layered rules. These puzzles, designed by humans for humans, require insights and the creation of new logical constraints on the fly, a process current AI struggles with. Moore notes that the ability of AI to absorb rules and perform intelligent search "hasn't happened yet."

This highlights a deeper aspect of intelligence: the ability to transform hard problems into simpler ones. It's about inventing heuristics and new forms of "partial knowledge" to navigate a problem space. Humans fluidly switch their approach, asking "which piece can fit here?" and then "where can this piece go?" This process of formalization and mathematization is a crucial, creative step that often constitutes 90% of the work in scientific modeling. True intelligence involves inventing the variables and constraints to address a problem, not just solving a pre-defined one.

Grounding, Meaning, and Shared Reality

A significant limitation of current language models is their lack of grounding in the physical world. When asked to summarize a nuanced essay, a model might regress to the mean, producing a "lowest common denominator" summary based on common arguments about the topic, completely missing the author's unique, subtle point. This indicates a failure to grasp meaning beyond statistical correlation.

Moore, a self-professed Platonist, believes in a shared, objective reality of abstract concepts. When two people visualize a cube, they perceive the same object with 8 corners and 12 edges. This shared perception allows for meaningful agreement and correction. He suggests that once AI systems can utilize multimodal "workspaces"—to doodle, run code, and manipulate virtual objects—they will move closer to this kind of grounded understanding.

Computation, Universality, and Irreducibility

The conversation delves into the fundamental nature of computation and its relationship to intelligence.

Computational Irreducibility: Drawing on Stephen Wolfram's work, Moore discusses systems where there are no analytical shortcuts to predict a future state. To know the outcome, "you have to do the work" of simulating every intervening step. While our only method for proving a system is irreducible is to build a universal c...

Full story
921: NPUs vs GPUs vs CPUs for Local AI Workloads — with Dell’s Ish Shah and Shirish Gupta

Shirish Gupta and Ish Shah from Dell Technologies explore the evolving landscape of AI hardware. They discuss why Windows, enhanced by WSL 2, remains a dominant platform for developers, and delve into the distinct roles of CPUs, GPUs, and the increasingly important Neural Processing Units (NPUs). The conversation covers the trade-offs between local and cloud computing for AI workloads and introduces new hardware, like workstations with discrete NPUs, that are making on-device AI more powerful and accessible than ever.