TechLead Bits

ReasoningBank

Currently AI agents have one major limitation: they cannot learn. I mean they don't learn from their experience or from the results of completed tasks. Once the model is trained, all we can do is to tune our prompts or enrich results with domain data from RAG.

Researchers from Google started exploring how to overcome this limitation and introduced the concept called ReasoningBank.

The overall idea is simple:
1. The agent writes down the result of successful or failed tasks into a dedicated md file.
2. During task execution, the agent searches the ReasoningBank and pulls relevant memories into the context.
3. Then it uses an LLM-as-a-judge approach to self-evaluate the result, analyze the trajectory of reasoning, and extract success insights or failure reasons.

Each file has the following structure (very similar to skills):
- Title: identifier of the core strategy.
- Description: short summary of the memory item.
- Content: reasoning steps, decision explanation, or operational insights extracted from past experience.

To be honest, benchmark results compared to other agent memory approaches do not look extremely impressive:

ReasoningBank without scaling outperformed memory-free agents by 8.3% on WebArena and 4.6% on SWE-Bench-Verified.

At the same time, this approach adds even more data to the context. And context, as we know, directly affects both model behavior quality and usage cost.

The official paper contains interesting research details, including particular prompts and measurements.

From my perspective, the idea and its implementation are very similar to skills or other long-term agent memories (e.g. in Claude Code). But the overall direction of making agents capable of learning from their own experience looks really promising.

#ai #engineering #news

👍3❤2

256 views05:57

TechLead Bits

AI Engineering

I strongly believe that if you want to use any technology effectively, you need to understand how it works under the hood. Especially in software engineering.

So if you haven’t looked into LLM internals yet, I’d highly recommend reading AI Engineering by Chip Huyen. The book was published in December 2024. And as AI is moving extremely fast, you might think it’s already outdated. Yes and no.

The book focuses on fundamentals. And they don’t really change that fast. You won’t find hype topics like skills, harnesses, or agents orchestration there. But for building structured understanding of how AI works, you don't actually need them.

What I personally found useful:
🔸 Core LLM concepts: tokenization, training and post-training processes, datasets preparation. This part is very similar to Mashing Learning Crash Course from Google.
🔸 Model evaluation: quite complex but interesting topic about model output results and their comparison. The book covers ranking, model specialization, public benchmarks and AI-as-a-judge approach.
🔸 Prompt engineering: good reference about context and prompting. Additionally, the author described different security aspects of using prompts, that part really extended my thoughts about what can go wrong.
🔸 Finetuning: a deep dive into different ways to optimize models. You need to be a good mathematician to understand this part. So I was really glad I'm not an ML engineer 😃 (huge respect to all ML experts, it's really hard).
🔸 User feedback: basic patterns on how to collect feedback, what to measure and why, common pitfalls.

To sum up, this book is really great to structure your knowledge about modern AI systems. Once you have that foundation, it becomes much easier to navigate all the new tools, patterns and paradigms that appear almost every month.

#booknook #ai #engineering

Amazon

AI Engineering: Building Applications with Foundation Models

AI Engineering: Building Applications with Foundation Models [Huyen, Chip] on Amazon.com. *FREE* shipping on qualifying offers. AI Engineering: Building Applications with Foundation Models

🔥4👍3

233 viewsedited 05:02

TechLead Bits

Ralph Loop

Underterministic nature of AI sometimes produces very interesting engineering solutions. One such example is Ralph (or Ralph Wiggum) Loop.

This is an AI coding pattern inspired by The Simpsons character Ralph Wiggum, known for saying weird things with high confidence 🙃.

The idea is simple: The agent can be dumb in a single iteration. But if it keeps retrying with feedback long enough, it eventually converges.

The loop steps:

Start a new agent\subagent -> load task + memory -> execute 1 selected task -> run validation -> save learnings -> commit progress -> repeat

The loops finishes when all tasks have passes:true or it reaches the maximum number of iterations (default is 10).

But the real value of the technique is not in retries, it's in context engineering strategy under the hood:
🔸 One loop executes only one task. It keeps agent focused.
🔸 Each iteration starts a new agent session. State lives outside the context keeping it clean between iterations. State is stored in git history, progress.txt and long-term-memory files.
🔸 Tasks are delegated to subagents. The main context is not polluted with task execution details and validations.
🔸 AGENTS.md is updated on each iteration. It is a live artifact that contains discovered patterns, learnings and conventions so future iterations can benefit from those findings and do not repeat previous mistakes.
🔸 AGENTS.md contains explicit validations for feedback loop. It usually defines linters and typechecks, build and test execution commands.

Ralph Loop is a really powerful pattern to get things done: it just repeats the task until it succeeds making agent execution more reliable. "Deterministically bad" but effective.

But this approach only works if you have good task decomposition, clear completion criteria, and mature SDLC practices with strong validations and feedback loops. Otherwise the agent will generate just a ton of mess.

#ai #engineering #patterns

Geoffrey Huntley

Ralph Wiggum as a "software engineer"

How Ralph Wiggum went from 'The Simpsons' to the biggest name in AI right now - Venture Beat

😎Here's a cool little field report from a Y Combinator hackathon event where they put Ralph Wiggum to the test.

"We Put a Coding Agent in a While Loop and It Shipped

🔥4

222 views04:20

TechLead Bits

Skill Packaging

Engineering teams are actively building internal collections of skills for agents: code review, troubleshooting, design preparation, onboarding, security practices.

And it looks great until you hit the question: how do you distribute those skills across dozens of teams and multiple harnesses? For Claude you need to put skills into .claude, for Cursor into .cursor, for Gemini into .gemini, etc. And things become even messier when you need to roll out updates.

To solve this problem big companies mostly build their own in-house solutions. Smaller companies usually just copy files from some shared repository and manage this complexity manually.

I don’t like reinventing the wheel, so when my team faced the same problem, we started looking for an existing solution we could reuse. And the only actively maintained tool we managed to find was apm by Microsoft.

APM is a package manager for prompts, skills, and MCPs. In other words, it’s maven or gomod for agents.

APM package structure:

my-package/
├── apm.yml                         
└── .apm/
    ├── instructions/
    │   └── my.instructions.md       
    ├── skills/
    │   └── my-skill/                
    │       └── SKILL.md  
    ├── agents/                      
    └── prompts/

To install the package in a target repo you need to define apm.yaml with a list of required dependencies:

name: my-projecty
version: 1.0.0
targets:
  - claude
  - copilot
dependencies:
  apm:
  - <git-address>/my-package
  - <git-address>/another-package
  mcp: []

After that you just run:

apm install

and the required skills will be installed into the corresponding harness folders (.claude, .copilot, etc.).

The tool works with Github and on-prem git installations like Gitlab.

APM is not perfect. It had some unpleasant (but not critical) issues, and sometimes you can really feel that it was heavily vibe-coded in Python .

But despite all that, the tool actually works: you have a spec to define your skills\prompts packages, distribute and update them with simple apm update. And on top of apm dependencies format it’s pretty easy to vibe-code your own internal skills marketplace.

#ai #engineering #agents

GitHub

GitHub - microsoft/apm: Agent Package Manager

Agent Package Manager. Contribute to microsoft/apm development by creating an account on GitHub.

👍4❤3

219 views05:22

TechLead Bits

The Fearless Organization

The most dangerous teams are the quiet teams. There are no disagreements, no bad news, no conflicts. Looks like harmony until the real incident.

That's the topic of the book The Fearless Organization: Creating Psychological Safety in the Workplace for Learning, Innovation, and Growth by Amy C. Edmondson.

Amy is a professor of Leadership and Management at the Harvard Business School. She has studied the phenomenon of psychological safety and its impact on team performance for many years across different organizations.

She defines psychological safety as follows:

a belief that one will not be punished or humiliated for speaking up with ideas, questions, concerns, or mistakes, and that the team is safe for inter-personal risk taking.

What it means in practice:
- people are not afraid to ask questions
- they do not hide problems
- they are not afraid to look stupid
- they do not avoid conflicts
- they can freely express their opinions and bring suggestions

Why does it matter?
There is a good example from the book that explains that. Imagine a doctor prescribes treatment for a child. A nurse notices that doctors usually prescribe drug A in such cases, but this time it is missing.
In a team with high psychological safety, the nurse will clarify this with the doctor and may help prevent a medical error.
In a team with low psychological safety, she may be afraid to ask. And the consequences can be dramatic.

The core idea is simple. But the book contains a lot of real stories where a low level of psychological safety leads to dramatic results (e.g. Volkswagen emission scandal, pilot mistakes that caused plane crashes). The author repeatedly highlights that the more complex and critical the profession is, the more important psychological safety becomes.

How does this relate to our daily work?
We as leaders are responsible for the psychological climate in the team: how well we listen to people, accept different opinions, react to questions, mistakes, or bad news. It's our daily routine that either helps the team become more effective, or leads people to hide problems and the real state of things.

Overall, I really liked the book. It explains the idea in simple language with many real examples. And what is important for me, all arguments and recommendations are supported by sociological research, experiments and practical psychology.
So psychological safety is not just an idea. It is a proven behavioral model and set of practices that can actually help leaders build better teams.

P.S. One of the best real examples of psychological safety is Pixar. I wrote about it earlier in overview of Creativity, Inc.: Overcoming the Unseen Forces That Stand in the Way of True Inspiration: parts 1,2,3.

#booknook #softskills #leadership

👍3

229 views04:09

TechLead Bits

Agent Readiness Framework

A few weeks ago I wrote that adopting coding agents requires strong engineering practices.
Test stability, linting, documentation, security controls matter much more than a particular harness or model.

Agent Readiness framework is an attempt to formalize these criteria for a particular repository and define how much autonomy can be safely delegated to agents.

The framework evaluates repos across 8 dimensions:
- Style & Validation
- Build System
- Testing
- Documentation
- Dev Environment
- Code Quality
- Observability
- Security & Governance

Based on these dimensions, the framework defines 5 levels of repo maturity:
🔸 Level 1: Functional. Basic checks: README, linters, unit tests.
🔸 Level 2: Documented. Detailed documentation and basic automations: AGENT.md, reproducible dev env, contribution guides.
🔸 Level 3: Standardized. E2E tests, observability, security scanning, maintained documentation.
🔸 Level 4: Optimized. Fast validation loops, canary deployments, build optimization. Process is optimized for fast feedback.
🔸 Level 5: Autonomous. Task decomposition, multi-service orchestration, self-healing logic, auto-remediation.

The idea is simple: the higher the maturity level, the more predictable and reliable agent results. But looking at these levels, I can see that most repos are actually somewhere between Level 1 and Level 3.

Framework authors also provide a tool to automatically measure these criteria and maturity level, but it's available only after registration and using proprietary APIs. Scanned examples you can find at https://factory.ai/agent-readiness.

There is also an open-source alternative https://github.com/kodustech/agent-readiness. The project doesn't look active, but it gets the job done. It analyzes the repo and generates a report with the overall maturity level, findings for each dimension, and suggestions for improvements. Some rules are not very accurate. Looks like the project was mainly designed for python and js code verification. But anyway the tool gives you a good sense of what to pay attention to in your codebase.

What I like about this framework is that it shows that agent effectiveness is actually limited by the maturity of engineering practices. And it provides measurable and actionable results, that are easy to convert into an improvement plan for a particular repo.

#ai #engineering

👍3🔥3

239 views02:51

TechLead Bits

How Anthropic Writes Skills

Last week Anthropic published lessons learnt of how they build agent skills internally. It's quite interesting to read recommendations from the company that introduced the concept in the first place.

Key ideas:
🔸 Don't be obvious. Model already knows how to code. A skill should provide instructions that change default agent behavior, not repeat the data the model was trained on.
🔸 Build a gotchas section. Add common mistakes and lessons learned. This helps the agent avoid repeating the same failures.
🔸 Use progressive disclosure. A skill is not just a SKILL.md. It can include additional files that are loaded on demand, reducing context overload.
🔸 Don't be too specific. Give the agent information it needs, but leave the flexibility to adapt to the situation.
🔸 Separate configuration from instructions. Store setup data in config.json or collect required input from the user.
🔸 Write description for the model, not for humans. A description should help the model to understand when the skill should be invoked.
🔸 Use long-term memory. Skills can maintain their own data in a subdirectory and reuse it across executions.
🔸 Automate where possible. Not everything should be a prompt. Some actions can be automated with helper scripts and functions.

Unfortunately, the article doesn't provide any guidelines of how to evaluate skill effectiveness. It's still not clear how to understand if a skill actually works, how to compare two versions of the same skill, or how to detect that a skill is no longer useful.

Provided recommendations are based mostly on observations of how popular internal skills are structured. It's useful but not measurable.

So despite the fact that skills are the most powerful agent extension now, evaluating them remains on of the hardest engineering problem.

But that's another story.

#ai #engineering

Claude

Lessons from building Claude Code: How we use skills | Claude by Anthropic

What we learned building and scaling hundreds of skills internally at Anthropic.

👍3

219 views02:45

TechLead Bits

Project Hail Mary

Technical books and articles are great, but sometimes my brain needs a break. Especially now, when AI is generating more and more new things to learn every day. One of my favorite ways to recharge is reading fiction, and I recently finished the very popular Project Hail Mary by Andy Weir.

I'm not a big sci-fi fan, but I definitely enjoyed this book.
Thanks to the recent movie adaptation, the story is probably familiar to many.

A man wakes up alone on a spaceship with no memory of who he is or why he's there. As his memories gradually return, he discovers that he's a scientist on a mission in another star system.
Humanity is facing extinction. The Sun is losing energy because of a mysterious organism called Astrophage. Nearby stars are also infected except Tau Ceti. A crew is sent there to find out why it's different and, hopefully, save Earth.
Unfortunately, only the main character survives the journey.
He starts his scientific research of the star and eventually noticed a spacecraft on his radar.

And then the almost impossible happens: first contact with an alien. The problem is, how do you communicate when you don't even share the same way of producing speech? The answer is physics. I really liked the idea that laws of physics are universal, making math and science the foundation for building communication between two civilizations.

I won't spoil the rest, but the story is really engaging. Despite being a disaster novel, it contains a good dose of humor and places a strong emphasis on friendship, kindness, and mutual help. And when you finish it, you're left with a surprisingly warm feeling.

I watched the movie after finishing the book, and for once I can say the adaptation is actually good. Of course, it's much more compact and some details are simplified, but it stays remarkably close to the original story while preserving its emotional depth.

Overall, I loved it. Highly recommended both the book and the movie.

#offtop #booknook

👍3🔥2❤1

192 views17:36

TechLead Bits

"AI won't take your job. Someone using AI will."

This quote caught my attention and made me watch A Leader’s Guide to Advanced Team Structures in an Agentic World from the recent AWS Summit Sydney.
It's a very sobering talk on the current state of the industry, AI adoption, and the future of engineering roles.

The central question of the talk is: "How should we build teams to work in this new AI world?"

To answer it, the author proposes a framework based on four elements:

Economics
The market has changed. Timelines are compressing. A small team of senior engineers can replace an entire existing product. This creates real risks for businesses that fail to adapt in time. That's why AI decisions should be driven by economics and business value, not hype.

Talent
Previously, career growth in tech was mostly about writing code and building features. Today, the most valuable skill is understanding the business, customers, and product. In other words, AI rewards expert generalists. One person can now handle analysis, backend, and frontend, reducing collaboration overhead and the need for deep specialization of large teams.
Another interesting point is the future of junior engineers. The speaker argues that we must keep the junior pipeline alive. Otherwise, we won't have senior expertise in 2034.

Structure
Current IT operations are optimized for determinism. But agents are non-deterministic. So operating model has to shift: variance in execution, focus on outcome and guardrails around thing you actually care about. The best operational model there is platform engineering.

Governance
The author highlights several areas that organizations need to address:
- Agent Identity Management. Every agent should have a verifiable identity traced to a named human.
- Risk Assessment. Clearly define what an agent is allowed to do and ensure it operates within those boundaries.
- Multi-agent Coordination. Control what happens when agents disagree, escalate, or find emergent behavior we don't expect.
- Deskilling Prevention. Employees should maintain core skills even if agents automate routine work. Someone still needs to validate results, audit actions, and take responsibility for decisions.

Overall, the talk is a good reality check on what is actually happening and how businesses and teams need to change to remain successful. Much of it resonates with my own observations, so I would definitely recommend to watch the full video.

#ai #leadership #engineering

YouTube

A leader’s guide to advanced team structures in an agentic world | AWS Events

As AI agents transform the workplace, organizations must adapt their structures and methodologies to harness new opportunities. The probabilistic nature of AI requires continuous iteration and intelligent oversight, creating new ways of working across business…

👍2🔥2❤1

180 views09:33

TechLead Bits

A Few Words About Context

New major model releases regularly promise bigger context windows. Sounds great until you realize it's mostly marketing. A bigger context window doesn't mean better results. It often means more data, more noise, and more AI slop.

According to multiple studies, models effectively use only about 30–50% of their available context. For example, a model with a 200K-token context window may already show noticeable quality loss at around 50K tokens.

Why this happens:
🔸 Context rot. Output quality gradually degrades as the context grows.
🔸 Reasoning shift. The model spends less effort on reasoning. The answers sound more confident, but their quality often gets worse.
🔸 The lost-in-the-middle effect. Information in the middle of the context can be overlooked during later reasoning.
🔸 Attention dilution. The model's attention is spread across different instructions, making it harder to focus on what actually matters.

The practical takeaway is simple: keep your context clean:
🔸 Start a new conversation for each new task (/new in Claude).
🔸 During long-running tasks, use /compact regularly to collapse intermediate reasoning and keep only the important things.
🔸 Store large data in long-term memory or relevant documentation, and bring it into the context only when it's actually needed.

Useful references:
- https://www.morphllm.com/context-rot
- https://www.zenml.io/llmops-database/context-rot-evaluating-llm-performance-degradation-with-increasing-input-tokens
- https://arxiv.org/html/2601.11564v1

#engineering #ai #tips

👍4🔥2❤1

85 views03:12

About

Blog

Apps

Platform