<written by a human being>Back when I had a regular job, I often had to explain my ideas, plans, concepts, tasks, and all the other joys of corporate life to colleagues and management. Whiteboards with markers or prepared presentations were the usual go-to - because obviously nobody's going to read a dry document, and making one isn't exactly a pleasure either.
These days I present that kind of thing to colleagues as HTML pages, which let you lay out information in a way that actually lands visually. I'm not much of a designer myself, so I hand that part off to the latest AI models, and they nail it every time.
There's not much prep work involved either - you can just throw a pile of inputs at the agent: various documents, meeting transcripts, your own raw thoughts, and a voice note of what you want the final presentation to actually say.
The agent structures all that chaos on its own, picks out only what matters, and synthesizes it into a clean, coherent layout for a clear visual result.
If you want, you can even throw in interactive elements that affect the output - sliders that adjust financial charts, buttons that change the result, whatever comes to mind.
The key thing here, obviously, is time saved. I used to spend a solid third of a workday on this kind of thing, at minimum. Now it's genuinely a 10-minute conversation with an agent.
🔥1
<written by a human being>
A few months back, while designing the architecture for a potential system and planning the stack, I gave this task to three different models - Claude Opus, ChatGPT, and Grok, all in their top reasoning modes. I got three different recommendations that overlapped on individual modules, but each came with an impressive set of arguments for its proposed options.
Same input, different outputs. How do you choose between them? Present each model with the other two versions and ask it to critique all three - including its own - based on the consolidated reasoning from the others.
After a couple of iterations of this kind of triage, a consensus emerges somewhere in the middle.
Fast forward a few months, and I'm kicking off development on a new system. PRDs are ready, functional and non-functional requirements are gathered, constraints are defined - time to start planning. Long hours of interviews, self-checks, spec analysis by independent agents, and the final architecture and stack are starting to take shape.
As always, with a healthy dose of skepticism toward any single model's decisions, I load the same set of input documents into two others - and the output is... practically identical recommendations for the stack. They diverge on a few individual elements, but that's genuinely 10% of the total volume.
Consensus reached without any cross-analysis between them. What is this - identical training data? Or is it actually the best choice for my case, and the models have gotten smart enough that they all arrived at the same conclusion independently? Have you noticed this?
SDD, aka Spec (Specification) Driven Development, or how vibe-coders reinvented systems analysis
For seasoned software developers, it's not news at all that before writing code, it's a good idea to have documentation of what's actually going to be built. Moreover, in enterprise environments this has long been a mandatory phase - and the practice of systems analysis, which is precisely what handles (or used to handle, in the pre-AI world) the development of that documentation, is a perfectly normal thing.
The moment it became possible to quickly write code that works the way the author intended, everyone rushed to do exactly that - and with predictable frustration discovered that things aren't quite as simple as advertised by the companies collecting real money for AI tokens.
Turns out the final product doesn't come together quickly, and rarely the way you'd want. A lot of things don't work, or turn out to be impossible despite the AI promising everything would fly straight to the moon. Features stumble at every step, you end up rewriting them a dozen times, burning through all your limits. And even if you managed to vibe-code something that more or less worked, it turned out there was a gaping security hole in it and hacking the thing was trivially easy. I actually talked about this last fall.
For those who've been in development for a long time, none of this is surprising. Pretty much the same result you'd get if you let a junior developer loose on a codebase. On the twentieth attempt, crooked and cobbled together, they might deliver something resembling a final product - but they'll miss a ton of details, including cybersecurity.
And lo and behold, the so-called SDD concept was born (or rather, people remembered the fundamentals). I think I've got material for several posts here, so stay tuned.
<written by a human being>
For seasoned software developers, it's not news at all that before writing code, it's a good idea to have documentation of what's actually going to be built. Moreover, in enterprise environments this has long been a mandatory phase - and the practice of systems analysis, which is precisely what handles (or used to handle, in the pre-AI world) the development of that documentation, is a perfectly normal thing.
The moment it became possible to quickly write code that works the way the author intended, everyone rushed to do exactly that - and with predictable frustration discovered that things aren't quite as simple as advertised by the companies collecting real money for AI tokens.
Turns out the final product doesn't come together quickly, and rarely the way you'd want. A lot of things don't work, or turn out to be impossible despite the AI promising everything would fly straight to the moon. Features stumble at every step, you end up rewriting them a dozen times, burning through all your limits. And even if you managed to vibe-code something that more or less worked, it turned out there was a gaping security hole in it and hacking the thing was trivially easy. I actually talked about this last fall.
For those who've been in development for a long time, none of this is surprising. Pretty much the same result you'd get if you let a junior developer loose on a codebase. On the twentieth attempt, crooked and cobbled together, they might deliver something resembling a final product - but they'll miss a ton of details, including cybersecurity.
And lo and behold, the so-called SDD concept was born (or rather, people remembered the fundamentals). I think I've got material for several posts here, so stay tuned.
<written by a human being>
SDD, or Spec-Driven Development - the concept is dead simple. First you write the spec for what you're building, then, following that spec (that's the key part), you write the actual code.
It seems pretty logical, because nobody questions that before building a house you should first carefully prepare an architectural plan with all the details - including things that might seem like minor stuff you could just fix on the fly. But the more thorough the plan, the faster and smoother the build goes. This is exactly the "measure twice, cut once" moment (or "measure seven times" as the Russian saying goes.
I don't know where this myth came from that software development can work differently, but in practice the same thing happens with software. I'm talking about mature software, more enterprise-level - because yeah, you can build a shed without a detailed architectural plan. But it'll look like one too.
That's exactly why in a little while we're gonna see a whole wave of these shed-products, slapped together without a proper plan or solid architecture, that'll collapse at the first storm or any halfway serious load.
By the way, there's nothing wrong with that - honestly a lot of micro-software (not the Bill kind), built personally by a vibe-coder for some specific niche task, can totally get away without complex engineering. I write one-shot utilities for my own work and clients all the time, no plans, no architecture.
But when it comes to more or less serious systems - there's no way around doing the architecture first.
<written by a human being>
We figured out that SDD is the right and sensible approach to building software with AI agents. It's like drawing up the architectural plan before building a house. But what does it look like in practice?
Sooner or later I'll write my own skill for AI agents that writes specs to my requirements, but for now I'm using a skill from the Superpowers pack, which I talked about earlier. It's a plugin from Anthropic that's basically a skill set, and it has a spec-writing skill. It's not tuned specifically for development though - more of a generic type - but it works fine overall, especially if you spend a long time grinding through the first part with the questionnaire and demanding that the necessary details get added.
At the end you get a document in Markdown format, which I strongly recommend reading, because the AI can go sideways in some places, forget something, or generously add stuff at its own discretion.
I also recommend doing a double or even triple check of the spec. First - ask the AI to run an independent agent (within an independent session) validation of the spec for errors, consistency, logic, and other criteria that matter for this specific spec.
Second - take the spec that already went through that review-and-fix cycle and hand it off for validation to other models - ChatGPT, Grok, whatever else. Let them find the weak spots and inconsistencies. Different training data is really good at enabling that kind of unbiased take.
Only after that kind of multi-layer check should you start working with it. Specifically - break it into tasks and start executing.
Infinite development loop
Hit a problem I don't know how to get out of yet. Been developing with an AI agent for weeks now, the system is mostly written and works module by module, but it's still very rough and crooked. Talking about a LangGraph-based graph that mixes deterministic nodes with code scripts and LLM calls - which, of course, tend to deviate from what's actually required.
And that's exactly where the constant problems keep spawning: either the criteria used to evaluate the LLM's output are too narrow and the model is literally set up to never meet them. Or there's not enough context, but the deterministic nodes can't surface that by their very nature. Or the dev agent seems to find yet another bug that's supposed to fix everything, but in practice pulls things in a different direction, digging the tech debt deeper and pushing the final result further away.
Not the first time I've tried to change the approach, nudge the agent in what seems like the right direction - but we keep sliding back into an infinite loop of fixes. Every new fix spawns a few more "bugs," which keeps growing the task count that's supposed to shrink over time.
How to get out of this hole - no idea yet. And even though the graph system we're building is pretty complex, it's still not rocket science. But Opus 4.7 is doing a poor job navigating the development so far, which the lack of results makes obvious.
Once I figure it all out, I'll definitely share what I found. But for now - tell me, are you running into similar issues with agentic development?
<written by a human being>
Hit a problem I don't know how to get out of yet. Been developing with an AI agent for weeks now, the system is mostly written and works module by module, but it's still very rough and crooked. Talking about a LangGraph-based graph that mixes deterministic nodes with code scripts and LLM calls - which, of course, tend to deviate from what's actually required.
And that's exactly where the constant problems keep spawning: either the criteria used to evaluate the LLM's output are too narrow and the model is literally set up to never meet them. Or there's not enough context, but the deterministic nodes can't surface that by their very nature. Or the dev agent seems to find yet another bug that's supposed to fix everything, but in practice pulls things in a different direction, digging the tech debt deeper and pushing the final result further away.
Not the first time I've tried to change the approach, nudge the agent in what seems like the right direction - but we keep sliding back into an infinite loop of fixes. Every new fix spawns a few more "bugs," which keeps growing the task count that's supposed to shrink over time.
How to get out of this hole - no idea yet. And even though the graph system we're building is pretty complex, it's still not rocket science. But Opus 4.7 is doing a poor job navigating the development so far, which the lack of results makes obvious.
Once I figure it all out, I'll definitely share what I found. But for now - tell me, are you running into similar issues with agentic development?
Media is too big
VIEW IN TELEGRAM
The Tasks AI Still Can't Do (And Why It Pretends It Can)
<written by a human being>
So, you've written a bunch of specs for a system you're planning to build. What's next?
First, you need to make sense of all this stuff. In my case, for example, I ended up with 12 ADRs (Architecture Decision Records) and 13 accompanying specifications. All of this needs to be structurally organized within the future repository. There's actually a dedicated spec for that - one that locks down this very structure. That's what you start working from.
The next agent should plan the deployment of a working project environment for development. Draw up a plan of everything that'll be needed, and actually get to work in accordance with the specs - they should always stay in context.
On top of that, I spun up a separate agent to visualize the architecture and system topology for me in C4 (a format specifically designed for architectural diagrams). Claude is honestly pretty bad at this right now, but I'll get what I want.
I know you're already itching to start building - but it's genuinely too early for that. Not if you want to avoid ending up with vibe-coded, leaky slop, anyway. You still need to lock down all the rules for working with the repo, the code, the system, and the docs, learn how to preserve and pass context properly, and keep clean, organized track of project tasks.
I'm doing all of this in parallel with my posts, by the way - so I'm literally sharing a behind-the-scenes look straight from the source: what's actually happening in my VSCode.
<written by a human being>
A couple of days ago, a friend of mine who works at a bank asked me about existing tools for drawing C4-level diagrams - basically, visually representing the architectural topology of an information system.
Reliable diagram visualization is something I've been saying for a long time - it's a task that AI currently handles quite poorly. Sure, it understands what charts and diagrams are, can recognize them and even "build" them - but only in theory, and only as text.
The one thing I actually managed to pull off was a BPMN diagram, and the saving grace there was that it's deterministic XML markup. So it's logical to assume that diagrams which are "drawn" with text should come naturally to AI.
But no such luck. For some reason, visual representation is exactly where they struggle the most. There's Structurizr DSL, for instance, which supposedly lets you generate the right diagrams through code. Except deploying it turned out to be a fairly labor-intensive task - which felt like overkill for a single diagram.
There are UML primitives that let you assemble what you need, but first of all it's not pure C4, and second of all it still comes out crooked.
In the end, the simplest thing that actually gives me the result I need - for now - is HTML visualization. Not without some tap-dancing, of course, but it gets the job done overall. And visually you can make it look pretty decent too.
For the rest - we're waiting for model updates where the training data will include more diagrams.
Media is too big
VIEW IN TELEGRAM
AI Delegation Is About Doing What You Never Could
<written by a human being>
I've talked about LangGraph a few times already and how I'm building my video editing system on top of it. And a few days ago I started running into videos about ADK - Google's Agents Development Kit. It's a relatively fresh framework that literally two days ago added its own graph-based Workflow Runtime, which immediately puts it if not on the same level, then at least closer to LangGraph.
And of course, already beaten up by the infinite development loop on LangGraph, I went to take a closer look at ADK and the potential of migrating my system to it. It's got a graph too, which lets you run deterministic nodes - meaning regular code scripts - and everything goes through familiar Python classes.
Spinning up a quick simple agent with a repeatable loop and strict procedures using ADK is gonna be way faster - you literally just grab ready-made components. In LangGraph you build everything from scratch, but that also gives you more control over what's happening.
But if you need to build a genuinely complex and detailed process (which is exactly what my system turned out to be) - LangGraph wins in a lot of ways, precisely because of the high boilerplate - state schema, nodes, edges. In ADK you literally define an agent
Agent(...) and that's it.Another mismatch with my requirements is ADK's hard lock-in to the Google ecosystem. If you're already deep in there, it's only a plus. But my goal from the start was to build an independent system on my own host, with no strict vendor lock.
So for now I'm staying on LangGraph, but keeping one eye on ADK and how it develops. I'm sure with such an active community it'll catch up fast and get polished to production-ready.
Meta-agents
The structure of my work project pushed me toward using a concept I call Meta-agents.
I have one repository, from which I design a new repository for a future system from scratch. Me and the agents plan everything - architecture, stack, order of work, project management, infrastructure deployment, all of it.
We also plan the work of AI agents on developing this new system and, accordingly, their work in a repository separate from the current one. We prepared a set of instructions and even a skill set that the agents will use.
And evolutionarily arrived at planning the work of those agents. Now we run smoke tests with them, where the initial prompts to the agents sound something like "Complete task DSP-160" and that's it. The agent has to figure out everything else on its own, while me and the agent in the other repo watch its actions and analyze what went wrong and what to tweak in the instructions, rules, and skills.
So here's the picture we ended up with: we created an environment for agents, launch them, and with another Meta-agent we observe and fine-tune the habitat for the new ones. Ironic, right?
So far this approach has been working really well: literally after the first smoke run and one instruction tweak, the second session went almost without a hitch, which is great. Continuing to observe.
<written by a human being>
The structure of my work project pushed me toward using a concept I call Meta-agents.
I have one repository, from which I design a new repository for a future system from scratch. Me and the agents plan everything - architecture, stack, order of work, project management, infrastructure deployment, all of it.
We also plan the work of AI agents on developing this new system and, accordingly, their work in a repository separate from the current one. We prepared a set of instructions and even a skill set that the agents will use.
And evolutionarily arrived at planning the work of those agents. Now we run smoke tests with them, where the initial prompts to the agents sound something like "Complete task DSP-160" and that's it. The agent has to figure out everything else on its own, while me and the agent in the other repo watch its actions and analyze what went wrong and what to tweak in the instructions, rules, and skills.
So here's the picture we ended up with: we created an environment for agents, launch them, and with another Meta-agent we observe and fine-tune the habitat for the new ones. Ironic, right?
So far this approach has been working really well: literally after the first smoke run and one instruction tweak, the second session went almost without a hitch, which is great. Continuing to observe.
Media is too big
VIEW IN TELEGRAM
Give AI a Terminal - Genius. Give It Bubble - Disaster
<written by a human being>
A couple months ago I talked about how no-code solutions were still stuck in semi-manual mode for me - the Claude Chrome plugin worked unbearably slow and ate an unbearable amount of (tokens), so it was faster and easier to just make edits by hand.
AI stayed this wise algorithm and UX advisor, helping me debug stuff and quickly come up with design solutions.
The next evolution step was feeding exports and dumps to agents. Pretty much any no-code tool lets you get data about what's happening inside the app or database one way or another - as tables, JSON files, or their own formats that end up being machine-readable anyway.
Bubble, for example, lets you export the whole app in .bubble format, which is basically minified JSON, and if you clean it up into standard format a coding agent starts understanding it pretty effectively.
Or Directual, which I work with a lot - it lets you export all your scenarios or workflows, plus table structure, into clean JSON formats, which lets an AI agent go deep into any step of the algorithms.
The last piece that closed the loop was Playwright CLI, which lets an AI agent interactively navigate apps exactly the same way you do manually in a browser, except straight from the command line - native territory for our smart assistants. It can even take screenshots, just like a regular browser, and analyze the final UX/UI!
And that already opens up a whole layer of possibilities - from hunting down the nastiest bugs to iteratively improving the design of apps built on top of no-code solutions.
Right now, for example, I'm closing out some pretty old hanging bugs that have been buried deep in the backlog as not-super-urgent but that still mess up users' lives. How long would those have taken me - probably weeks, since there'd already been a few attempts at fixing them with zero results. But for an AI agent it's just another routine task.
Lazy AI agents, or why attention to detail actually matters
I currently have two systems in development - a personal finance tracking system and a video editing system. Both have complex data processing pipelines.
And when an LLM runs through them, the agents have a tendency toward "lazy" solutions - taking the path of least resistance (just like humans, I guess). For example, there's a money transfer transaction to another account, recorded in a bank statement. By default, and logically, the system assigns it a transfer category. But at the same time it tries to find the other account in the database that the money was sent to. Except that transaction turned out to be a payment to a contractor - paid by transferring to their personal card.
Yeah, the system doesn't have enough data to make the right classification call, but that's exactly where I was counting on the LLM, not just a dumb algorithm - I figured it would stop and hand the case to me so I could decide what it actually is. But nope - it silently decides it's a standard transfer and the mismatch only surfaces during the final balance reconciliation, which means a long, painful unraveling of the entire data chain to find the source.
In moments like this you have to stop the slacking machine that supposedly has intelligence, and force it to walk through each transaction with you in detail to fix the pipeline design.
I've simplified the case a bit for clarity, but the point should be clear. It's still way too early to blindly trust LLM decisions when accuracy matters - like in finance. Especially when the data is incomplete.
<written by a human being>
I currently have two systems in development - a personal finance tracking system and a video editing system. Both have complex data processing pipelines.
And when an LLM runs through them, the agents have a tendency toward "lazy" solutions - taking the path of least resistance (just like humans, I guess). For example, there's a money transfer transaction to another account, recorded in a bank statement. By default, and logically, the system assigns it a transfer category. But at the same time it tries to find the other account in the database that the money was sent to. Except that transaction turned out to be a payment to a contractor - paid by transferring to their personal card.
Yeah, the system doesn't have enough data to make the right classification call, but that's exactly where I was counting on the LLM, not just a dumb algorithm - I figured it would stop and hand the case to me so I could decide what it actually is. But nope - it silently decides it's a standard transfer and the mismatch only surfaces during the final balance reconciliation, which means a long, painful unraveling of the entire data chain to find the source.
In moments like this you have to stop the slacking machine that supposedly has intelligence, and force it to walk through each transaction with you in detail to fix the pipeline design.
I've simplified the case a bit for clarity, but the point should be clear. It's still way too early to blindly trust LLM decisions when accuracy matters - like in finance. Especially when the data is incomplete.
<written by a human being>
Seems like the AI world has been unusually quiet lately - no major model drops, no talk of revolutionary quality leaps, image and video models are spinning their wheels too. Have we hit peak perfection?
Either way - it's actually a good time to stop chasing updates and go deep on the tools we already have.
After yesterday's dev session with AI agents, I realized that depth is exactly what I'm missing from their side. A genuine understanding of what we're actually doing - even when it's spelled out in the instructions and prompts. Somewhere around the middle of a session I suddenly realize we're not solving the root cause of a bug - we're skimming the surface, painting over a scratch, while completely missing the fact that underneath that paint there might be a rotting foundation that needs to be fixed first.
It reminds me of Elon Musk's first principles approach - the one that lets you cut through to the very essence of a problem, the foundation everything else is built on. How do you get AI agents to actually apply that same approach?
For now, the only way I've found is through a series of progressively deeper questions. But the agent still tries to patch the bug, move on, do something. It struggles to stop and ask itself - why am I doing this right now, what's the actual end goal?
That, to me, will be the real breakthrough in future models. Until then, you have to keep them on a short leash and watch them closely.
How not to drown in a constantly filling pool of AI tools, or your perfect AI stack
Recently I came across a video about AI stacks and realized that for a lot of people this can be an incredibly non-obvious choice - given the insane number of new tools that drop if not every day, then every week or two for sure.
How do you keep up with all of it? You don't! And there's not much point. Give it a bit more time and the market will consolidate, like it always does, and it'll become clear who's daddy - the leaders will split the market between them and there'll be a long tail of niche tools, each covering the needs of a specific segment and use case.
That's exactly the approach I use when building my own stack: I pick a couple of leaders I use on a daily basis and sprinkle in periodic use of niche tools.
For example my go-to AI is Claude and I work 97% of the time in CLI mode from VSCode. It covers a solid 80% of all tasks on the flagship model. I split some of the work with Codex too - mostly data processing, bug fixes, and client tasks, which it handles great.
And that's pretty much it. Everything else is niche usage. Like, I dictate long prompts into Wispr Flow, transcribe call recordings in ElevenLabs via API, generate images in ChatGPT.
If something comes up that needs a specific tool the flagships can't handle, I'll go find the right one. But honestly, over the past several months that just hasn't happened.
<written by a human being>
Recently I came across a video about AI stacks and realized that for a lot of people this can be an incredibly non-obvious choice - given the insane number of new tools that drop if not every day, then every week or two for sure.
How do you keep up with all of it? You don't! And there's not much point. Give it a bit more time and the market will consolidate, like it always does, and it'll become clear who's daddy - the leaders will split the market between them and there'll be a long tail of niche tools, each covering the needs of a specific segment and use case.
That's exactly the approach I use when building my own stack: I pick a couple of leaders I use on a daily basis and sprinkle in periodic use of niche tools.
For example my go-to AI is Claude and I work 97% of the time in CLI mode from VSCode. It covers a solid 80% of all tasks on the flagship model. I split some of the work with Codex too - mostly data processing, bug fixes, and client tasks, which it handles great.
And that's pretty much it. Everything else is niche usage. Like, I dictate long prompts into Wispr Flow, transcribe call recordings in ElevenLabs via API, generate images in ChatGPT.
If something comes up that needs a specific tool the flagships can't handle, I'll go find the right one. But honestly, over the past several months that just hasn't happened.
<written by a human being>
The AI-coding community is slowly starting to turn toward Codex, which lately seemed to have faded into the background and given up its lead to Claude Code and even Cursor's Composer.
Personally, in practice I've always used both, since I didn't see a major difference in the results - though some distinctions in specific cases were still noticeable. I've mentioned this a few times: tasks involving data, where accuracy matters, I more often hand off to Codex, and it handles them brilliantly. Opus handles them too, but typically over more iterations - meaning, roughly speaking, longer and more expensive.
Cost, by the way, is one of the key factors when choosing one tool or another, because when the output quality is roughly equal, price becomes the natural deciding edge. And the calculations one guy did a few months back, comparing equivalent subscription cost to API spend, put Claude in an undisputed lead: for a $100 monthly subscription you're buying around $1,300 worth of API value.
That math was the key motivator for me to get the Max subscription on Claude, which ended up becoming my main coding agent.
But now people are saying that Codex's limits on the same $100 subscription far exceed Claude's and feel practically unlimited. I haven't put that to the test yet, but it definitely got me thinking. I think I'll need to spend some time testing them side by side to compare the feel of the results, the usage limits, and the quirks of each one on different kinds of tasks.
But if you don't yet have a subscription to the top-tier plans of the flagship coding AIs, the choice just got a lot harder.