SDD, aka Spec (Specification) Driven Development, or how vibe-coders reinvented systems analysis
For seasoned software developers, it's not news at all that before writing code, it's a good idea to have documentation of what's actually going to be built. Moreover, in enterprise environments this has long been a mandatory phase - and the practice of systems analysis, which is precisely what handles (or used to handle, in the pre-AI world) the development of that documentation, is a perfectly normal thing.
The moment it became possible to quickly write code that works the way the author intended, everyone rushed to do exactly that - and with predictable frustration discovered that things aren't quite as simple as advertised by the companies collecting real money for AI tokens.
Turns out the final product doesn't come together quickly, and rarely the way you'd want. A lot of things don't work, or turn out to be impossible despite the AI promising everything would fly straight to the moon. Features stumble at every step, you end up rewriting them a dozen times, burning through all your limits. And even if you managed to vibe-code something that more or less worked, it turned out there was a gaping security hole in it and hacking the thing was trivially easy. I actually talked about this last fall.
For those who've been in development for a long time, none of this is surprising. Pretty much the same result you'd get if you let a junior developer loose on a codebase. On the twentieth attempt, crooked and cobbled together, they might deliver something resembling a final product - but they'll miss a ton of details, including cybersecurity.
And lo and behold, the so-called SDD concept was born (or rather, people remembered the fundamentals). I think I've got material for several posts here, so stay tuned.
<written by a human being>
For seasoned software developers, it's not news at all that before writing code, it's a good idea to have documentation of what's actually going to be built. Moreover, in enterprise environments this has long been a mandatory phase - and the practice of systems analysis, which is precisely what handles (or used to handle, in the pre-AI world) the development of that documentation, is a perfectly normal thing.
The moment it became possible to quickly write code that works the way the author intended, everyone rushed to do exactly that - and with predictable frustration discovered that things aren't quite as simple as advertised by the companies collecting real money for AI tokens.
Turns out the final product doesn't come together quickly, and rarely the way you'd want. A lot of things don't work, or turn out to be impossible despite the AI promising everything would fly straight to the moon. Features stumble at every step, you end up rewriting them a dozen times, burning through all your limits. And even if you managed to vibe-code something that more or less worked, it turned out there was a gaping security hole in it and hacking the thing was trivially easy. I actually talked about this last fall.
For those who've been in development for a long time, none of this is surprising. Pretty much the same result you'd get if you let a junior developer loose on a codebase. On the twentieth attempt, crooked and cobbled together, they might deliver something resembling a final product - but they'll miss a ton of details, including cybersecurity.
And lo and behold, the so-called SDD concept was born (or rather, people remembered the fundamentals). I think I've got material for several posts here, so stay tuned.
<written by a human being>
SDD, or Spec-Driven Development - the concept is dead simple. First you write the spec for what you're building, then, following that spec (that's the key part), you write the actual code.
It seems pretty logical, because nobody questions that before building a house you should first carefully prepare an architectural plan with all the details - including things that might seem like minor stuff you could just fix on the fly. But the more thorough the plan, the faster and smoother the build goes. This is exactly the "measure twice, cut once" moment (or "measure seven times" as the Russian saying goes.
I don't know where this myth came from that software development can work differently, but in practice the same thing happens with software. I'm talking about mature software, more enterprise-level - because yeah, you can build a shed without a detailed architectural plan. But it'll look like one too.
That's exactly why in a little while we're gonna see a whole wave of these shed-products, slapped together without a proper plan or solid architecture, that'll collapse at the first storm or any halfway serious load.
By the way, there's nothing wrong with that - honestly a lot of micro-software (not the Bill kind), built personally by a vibe-coder for some specific niche task, can totally get away without complex engineering. I write one-shot utilities for my own work and clients all the time, no plans, no architecture.
But when it comes to more or less serious systems - there's no way around doing the architecture first.
<written by a human being>
We figured out that SDD is the right and sensible approach to building software with AI agents. It's like drawing up the architectural plan before building a house. But what does it look like in practice?
Sooner or later I'll write my own skill for AI agents that writes specs to my requirements, but for now I'm using a skill from the Superpowers pack, which I talked about earlier. It's a plugin from Anthropic that's basically a skill set, and it has a spec-writing skill. It's not tuned specifically for development though - more of a generic type - but it works fine overall, especially if you spend a long time grinding through the first part with the questionnaire and demanding that the necessary details get added.
At the end you get a document in Markdown format, which I strongly recommend reading, because the AI can go sideways in some places, forget something, or generously add stuff at its own discretion.
I also recommend doing a double or even triple check of the spec. First - ask the AI to run an independent agent (within an independent session) validation of the spec for errors, consistency, logic, and other criteria that matter for this specific spec.
Second - take the spec that already went through that review-and-fix cycle and hand it off for validation to other models - ChatGPT, Grok, whatever else. Let them find the weak spots and inconsistencies. Different training data is really good at enabling that kind of unbiased take.
Only after that kind of multi-layer check should you start working with it. Specifically - break it into tasks and start executing.
Infinite development loop
Hit a problem I don't know how to get out of yet. Been developing with an AI agent for weeks now, the system is mostly written and works module by module, but it's still very rough and crooked. Talking about a LangGraph-based graph that mixes deterministic nodes with code scripts and LLM calls - which, of course, tend to deviate from what's actually required.
And that's exactly where the constant problems keep spawning: either the criteria used to evaluate the LLM's output are too narrow and the model is literally set up to never meet them. Or there's not enough context, but the deterministic nodes can't surface that by their very nature. Or the dev agent seems to find yet another bug that's supposed to fix everything, but in practice pulls things in a different direction, digging the tech debt deeper and pushing the final result further away.
Not the first time I've tried to change the approach, nudge the agent in what seems like the right direction - but we keep sliding back into an infinite loop of fixes. Every new fix spawns a few more "bugs," which keeps growing the task count that's supposed to shrink over time.
How to get out of this hole - no idea yet. And even though the graph system we're building is pretty complex, it's still not rocket science. But Opus 4.7 is doing a poor job navigating the development so far, which the lack of results makes obvious.
Once I figure it all out, I'll definitely share what I found. But for now - tell me, are you running into similar issues with agentic development?
<written by a human being>
Hit a problem I don't know how to get out of yet. Been developing with an AI agent for weeks now, the system is mostly written and works module by module, but it's still very rough and crooked. Talking about a LangGraph-based graph that mixes deterministic nodes with code scripts and LLM calls - which, of course, tend to deviate from what's actually required.
And that's exactly where the constant problems keep spawning: either the criteria used to evaluate the LLM's output are too narrow and the model is literally set up to never meet them. Or there's not enough context, but the deterministic nodes can't surface that by their very nature. Or the dev agent seems to find yet another bug that's supposed to fix everything, but in practice pulls things in a different direction, digging the tech debt deeper and pushing the final result further away.
Not the first time I've tried to change the approach, nudge the agent in what seems like the right direction - but we keep sliding back into an infinite loop of fixes. Every new fix spawns a few more "bugs," which keeps growing the task count that's supposed to shrink over time.
How to get out of this hole - no idea yet. And even though the graph system we're building is pretty complex, it's still not rocket science. But Opus 4.7 is doing a poor job navigating the development so far, which the lack of results makes obvious.
Once I figure it all out, I'll definitely share what I found. But for now - tell me, are you running into similar issues with agentic development?
Media is too big
VIEW IN TELEGRAM
The Tasks AI Still Can't Do (And Why It Pretends It Can)
<written by a human being>
So, you've written a bunch of specs for a system you're planning to build. What's next?
First, you need to make sense of all this stuff. In my case, for example, I ended up with 12 ADRs (Architecture Decision Records) and 13 accompanying specifications. All of this needs to be structurally organized within the future repository. There's actually a dedicated spec for that - one that locks down this very structure. That's what you start working from.
The next agent should plan the deployment of a working project environment for development. Draw up a plan of everything that'll be needed, and actually get to work in accordance with the specs - they should always stay in context.
On top of that, I spun up a separate agent to visualize the architecture and system topology for me in C4 (a format specifically designed for architectural diagrams). Claude is honestly pretty bad at this right now, but I'll get what I want.
I know you're already itching to start building - but it's genuinely too early for that. Not if you want to avoid ending up with vibe-coded, leaky slop, anyway. You still need to lock down all the rules for working with the repo, the code, the system, and the docs, learn how to preserve and pass context properly, and keep clean, organized track of project tasks.
I'm doing all of this in parallel with my posts, by the way - so I'm literally sharing a behind-the-scenes look straight from the source: what's actually happening in my VSCode.
<written by a human being>
A couple of days ago, a friend of mine who works at a bank asked me about existing tools for drawing C4-level diagrams - basically, visually representing the architectural topology of an information system.
Reliable diagram visualization is something I've been saying for a long time - it's a task that AI currently handles quite poorly. Sure, it understands what charts and diagrams are, can recognize them and even "build" them - but only in theory, and only as text.
The one thing I actually managed to pull off was a BPMN diagram, and the saving grace there was that it's deterministic XML markup. So it's logical to assume that diagrams which are "drawn" with text should come naturally to AI.
But no such luck. For some reason, visual representation is exactly where they struggle the most. There's Structurizr DSL, for instance, which supposedly lets you generate the right diagrams through code. Except deploying it turned out to be a fairly labor-intensive task - which felt like overkill for a single diagram.
There are UML primitives that let you assemble what you need, but first of all it's not pure C4, and second of all it still comes out crooked.
In the end, the simplest thing that actually gives me the result I need - for now - is HTML visualization. Not without some tap-dancing, of course, but it gets the job done overall. And visually you can make it look pretty decent too.
For the rest - we're waiting for model updates where the training data will include more diagrams.
Media is too big
VIEW IN TELEGRAM
AI Delegation Is About Doing What You Never Could
<written by a human being>
I've talked about LangGraph a few times already and how I'm building my video editing system on top of it. And a few days ago I started running into videos about ADK - Google's Agents Development Kit. It's a relatively fresh framework that literally two days ago added its own graph-based Workflow Runtime, which immediately puts it if not on the same level, then at least closer to LangGraph.
And of course, already beaten up by the infinite development loop on LangGraph, I went to take a closer look at ADK and the potential of migrating my system to it. It's got a graph too, which lets you run deterministic nodes - meaning regular code scripts - and everything goes through familiar Python classes.
Spinning up a quick simple agent with a repeatable loop and strict procedures using ADK is gonna be way faster - you literally just grab ready-made components. In LangGraph you build everything from scratch, but that also gives you more control over what's happening.
But if you need to build a genuinely complex and detailed process (which is exactly what my system turned out to be) - LangGraph wins in a lot of ways, precisely because of the high boilerplate - state schema, nodes, edges. In ADK you literally define an agent
Agent(...) and that's it.Another mismatch with my requirements is ADK's hard lock-in to the Google ecosystem. If you're already deep in there, it's only a plus. But my goal from the start was to build an independent system on my own host, with no strict vendor lock.
So for now I'm staying on LangGraph, but keeping one eye on ADK and how it develops. I'm sure with such an active community it'll catch up fast and get polished to production-ready.
Meta-agents
The structure of my work project pushed me toward using a concept I call Meta-agents.
I have one repository, from which I design a new repository for a future system from scratch. Me and the agents plan everything - architecture, stack, order of work, project management, infrastructure deployment, all of it.
We also plan the work of AI agents on developing this new system and, accordingly, their work in a repository separate from the current one. We prepared a set of instructions and even a skill set that the agents will use.
And evolutionarily arrived at planning the work of those agents. Now we run smoke tests with them, where the initial prompts to the agents sound something like "Complete task DSP-160" and that's it. The agent has to figure out everything else on its own, while me and the agent in the other repo watch its actions and analyze what went wrong and what to tweak in the instructions, rules, and skills.
So here's the picture we ended up with: we created an environment for agents, launch them, and with another Meta-agent we observe and fine-tune the habitat for the new ones. Ironic, right?
So far this approach has been working really well: literally after the first smoke run and one instruction tweak, the second session went almost without a hitch, which is great. Continuing to observe.
<written by a human being>
The structure of my work project pushed me toward using a concept I call Meta-agents.
I have one repository, from which I design a new repository for a future system from scratch. Me and the agents plan everything - architecture, stack, order of work, project management, infrastructure deployment, all of it.
We also plan the work of AI agents on developing this new system and, accordingly, their work in a repository separate from the current one. We prepared a set of instructions and even a skill set that the agents will use.
And evolutionarily arrived at planning the work of those agents. Now we run smoke tests with them, where the initial prompts to the agents sound something like "Complete task DSP-160" and that's it. The agent has to figure out everything else on its own, while me and the agent in the other repo watch its actions and analyze what went wrong and what to tweak in the instructions, rules, and skills.
So here's the picture we ended up with: we created an environment for agents, launch them, and with another Meta-agent we observe and fine-tune the habitat for the new ones. Ironic, right?
So far this approach has been working really well: literally after the first smoke run and one instruction tweak, the second session went almost without a hitch, which is great. Continuing to observe.
Media is too big
VIEW IN TELEGRAM
Give AI a Terminal - Genius. Give It Bubble - Disaster
<written by a human being>
A couple months ago I talked about how no-code solutions were still stuck in semi-manual mode for me - the Claude Chrome plugin worked unbearably slow and ate an unbearable amount of (tokens), so it was faster and easier to just make edits by hand.
AI stayed this wise algorithm and UX advisor, helping me debug stuff and quickly come up with design solutions.
The next evolution step was feeding exports and dumps to agents. Pretty much any no-code tool lets you get data about what's happening inside the app or database one way or another - as tables, JSON files, or their own formats that end up being machine-readable anyway.
Bubble, for example, lets you export the whole app in .bubble format, which is basically minified JSON, and if you clean it up into standard format a coding agent starts understanding it pretty effectively.
Or Directual, which I work with a lot - it lets you export all your scenarios or workflows, plus table structure, into clean JSON formats, which lets an AI agent go deep into any step of the algorithms.
The last piece that closed the loop was Playwright CLI, which lets an AI agent interactively navigate apps exactly the same way you do manually in a browser, except straight from the command line - native territory for our smart assistants. It can even take screenshots, just like a regular browser, and analyze the final UX/UI!
And that already opens up a whole layer of possibilities - from hunting down the nastiest bugs to iteratively improving the design of apps built on top of no-code solutions.
Right now, for example, I'm closing out some pretty old hanging bugs that have been buried deep in the backlog as not-super-urgent but that still mess up users' lives. How long would those have taken me - probably weeks, since there'd already been a few attempts at fixing them with zero results. But for an AI agent it's just another routine task.
Lazy AI agents, or why attention to detail actually matters
I currently have two systems in development - a personal finance tracking system and a video editing system. Both have complex data processing pipelines.
And when an LLM runs through them, the agents have a tendency toward "lazy" solutions - taking the path of least resistance (just like humans, I guess). For example, there's a money transfer transaction to another account, recorded in a bank statement. By default, and logically, the system assigns it a transfer category. But at the same time it tries to find the other account in the database that the money was sent to. Except that transaction turned out to be a payment to a contractor - paid by transferring to their personal card.
Yeah, the system doesn't have enough data to make the right classification call, but that's exactly where I was counting on the LLM, not just a dumb algorithm - I figured it would stop and hand the case to me so I could decide what it actually is. But nope - it silently decides it's a standard transfer and the mismatch only surfaces during the final balance reconciliation, which means a long, painful unraveling of the entire data chain to find the source.
In moments like this you have to stop the slacking machine that supposedly has intelligence, and force it to walk through each transaction with you in detail to fix the pipeline design.
I've simplified the case a bit for clarity, but the point should be clear. It's still way too early to blindly trust LLM decisions when accuracy matters - like in finance. Especially when the data is incomplete.
<written by a human being>
I currently have two systems in development - a personal finance tracking system and a video editing system. Both have complex data processing pipelines.
And when an LLM runs through them, the agents have a tendency toward "lazy" solutions - taking the path of least resistance (just like humans, I guess). For example, there's a money transfer transaction to another account, recorded in a bank statement. By default, and logically, the system assigns it a transfer category. But at the same time it tries to find the other account in the database that the money was sent to. Except that transaction turned out to be a payment to a contractor - paid by transferring to their personal card.
Yeah, the system doesn't have enough data to make the right classification call, but that's exactly where I was counting on the LLM, not just a dumb algorithm - I figured it would stop and hand the case to me so I could decide what it actually is. But nope - it silently decides it's a standard transfer and the mismatch only surfaces during the final balance reconciliation, which means a long, painful unraveling of the entire data chain to find the source.
In moments like this you have to stop the slacking machine that supposedly has intelligence, and force it to walk through each transaction with you in detail to fix the pipeline design.
I've simplified the case a bit for clarity, but the point should be clear. It's still way too early to blindly trust LLM decisions when accuracy matters - like in finance. Especially when the data is incomplete.
<written by a human being>
Seems like the AI world has been unusually quiet lately - no major model drops, no talk of revolutionary quality leaps, image and video models are spinning their wheels too. Have we hit peak perfection?
Either way - it's actually a good time to stop chasing updates and go deep on the tools we already have.
After yesterday's dev session with AI agents, I realized that depth is exactly what I'm missing from their side. A genuine understanding of what we're actually doing - even when it's spelled out in the instructions and prompts. Somewhere around the middle of a session I suddenly realize we're not solving the root cause of a bug - we're skimming the surface, painting over a scratch, while completely missing the fact that underneath that paint there might be a rotting foundation that needs to be fixed first.
It reminds me of Elon Musk's first principles approach - the one that lets you cut through to the very essence of a problem, the foundation everything else is built on. How do you get AI agents to actually apply that same approach?
For now, the only way I've found is through a series of progressively deeper questions. But the agent still tries to patch the bug, move on, do something. It struggles to stop and ask itself - why am I doing this right now, what's the actual end goal?
That, to me, will be the real breakthrough in future models. Until then, you have to keep them on a short leash and watch them closely.
How not to drown in a constantly filling pool of AI tools, or your perfect AI stack
Recently I came across a video about AI stacks and realized that for a lot of people this can be an incredibly non-obvious choice - given the insane number of new tools that drop if not every day, then every week or two for sure.
How do you keep up with all of it? You don't! And there's not much point. Give it a bit more time and the market will consolidate, like it always does, and it'll become clear who's daddy - the leaders will split the market between them and there'll be a long tail of niche tools, each covering the needs of a specific segment and use case.
That's exactly the approach I use when building my own stack: I pick a couple of leaders I use on a daily basis and sprinkle in periodic use of niche tools.
For example my go-to AI is Claude and I work 97% of the time in CLI mode from VSCode. It covers a solid 80% of all tasks on the flagship model. I split some of the work with Codex too - mostly data processing, bug fixes, and client tasks, which it handles great.
And that's pretty much it. Everything else is niche usage. Like, I dictate long prompts into Wispr Flow, transcribe call recordings in ElevenLabs via API, generate images in ChatGPT.
If something comes up that needs a specific tool the flagships can't handle, I'll go find the right one. But honestly, over the past several months that just hasn't happened.
<written by a human being>
Recently I came across a video about AI stacks and realized that for a lot of people this can be an incredibly non-obvious choice - given the insane number of new tools that drop if not every day, then every week or two for sure.
How do you keep up with all of it? You don't! And there's not much point. Give it a bit more time and the market will consolidate, like it always does, and it'll become clear who's daddy - the leaders will split the market between them and there'll be a long tail of niche tools, each covering the needs of a specific segment and use case.
That's exactly the approach I use when building my own stack: I pick a couple of leaders I use on a daily basis and sprinkle in periodic use of niche tools.
For example my go-to AI is Claude and I work 97% of the time in CLI mode from VSCode. It covers a solid 80% of all tasks on the flagship model. I split some of the work with Codex too - mostly data processing, bug fixes, and client tasks, which it handles great.
And that's pretty much it. Everything else is niche usage. Like, I dictate long prompts into Wispr Flow, transcribe call recordings in ElevenLabs via API, generate images in ChatGPT.
If something comes up that needs a specific tool the flagships can't handle, I'll go find the right one. But honestly, over the past several months that just hasn't happened.
<written by a human being>
The AI-coding community is slowly starting to turn toward Codex, which lately seemed to have faded into the background and given up its lead to Claude Code and even Cursor's Composer.
Personally, in practice I've always used both, since I didn't see a major difference in the results - though some distinctions in specific cases were still noticeable. I've mentioned this a few times: tasks involving data, where accuracy matters, I more often hand off to Codex, and it handles them brilliantly. Opus handles them too, but typically over more iterations - meaning, roughly speaking, longer and more expensive.
Cost, by the way, is one of the key factors when choosing one tool or another, because when the output quality is roughly equal, price becomes the natural deciding edge. And the calculations one guy did a few months back, comparing equivalent subscription cost to API spend, put Claude in an undisputed lead: for a $100 monthly subscription you're buying around $1,300 worth of API value.
That math was the key motivator for me to get the Max subscription on Claude, which ended up becoming my main coding agent.
But now people are saying that Codex's limits on the same $100 subscription far exceed Claude's and feel practically unlimited. I haven't put that to the test yet, but it definitely got me thinking. I think I'll need to spend some time testing them side by side to compare the feel of the results, the usage limits, and the quirks of each one on different kinds of tasks.
But if you don't yet have a subscription to the top-tier plans of the flagship coding AIs, the choice just got a lot harder.
Media is too big
VIEW IN TELEGRAM
The One File That Stops AI Agents From Wasting Your Tokens
How many AI agents can you juggle at once?
AI agents give you an undeniable speed advantage on a lot of tasks. Over time I started spinning up multiple agents simultaneously, each working on a different project.
One's trying to hunt down a bug in a client's system. Another's editing a promo site for a new event for a different client. A third is spinning up a local dev environment for a new project.
And I'm switching between them, adding context, unblocking blockers and conflicts, reviewing output, giving the go-ahead or queuing the next task. Sounds pretty productive, right?
Except it has the opposite effect too - constant and frequent context switching, trying to hold the whole stack in your head while processing it all pretty fast, gives you the same feeling you get at a hard deadline when your ass is on fire, you can't keep up with anything and you just keep paddling to stay afloat.
And yet paradoxically, you get everything done and then some - more than you planned - but it doesn't feel that way. Apparently my cognitive hardware isn't used to operating in these conditions yet and still interprets it the old-fashioned way.
One task at a time - works great. But how do you deal with the fact that the AI is doing the task and you're just watching it? Sit there doing nothing waiting for the next iteration? No way, if there's time, might as well knock out other things in parallel! And that's how it starts spiraling.
You get this too? How do you deal?
<written by a human being>
AI agents give you an undeniable speed advantage on a lot of tasks. Over time I started spinning up multiple agents simultaneously, each working on a different project.
One's trying to hunt down a bug in a client's system. Another's editing a promo site for a new event for a different client. A third is spinning up a local dev environment for a new project.
And I'm switching between them, adding context, unblocking blockers and conflicts, reviewing output, giving the go-ahead or queuing the next task. Sounds pretty productive, right?
Except it has the opposite effect too - constant and frequent context switching, trying to hold the whole stack in your head while processing it all pretty fast, gives you the same feeling you get at a hard deadline when your ass is on fire, you can't keep up with anything and you just keep paddling to stay afloat.
And yet paradoxically, you get everything done and then some - more than you planned - but it doesn't feel that way. Apparently my cognitive hardware isn't used to operating in these conditions yet and still interprets it the old-fashioned way.
One task at a time - works great. But how do you deal with the fact that the AI is doing the task and you're just watching it? Sit there doing nothing waiting for the next iteration? No way, if there's time, might as well knock out other things in parallel! And that's how it starts spiraling.
You get this too? How do you deal?
<written by a human being>
I don't usually talk about new models, but today I'll join the trend, since I've already had a chance to get my hands on the new Opus - and I really liked it!
Literally from a single prompt it solved tasks - and did so faster, more precisely, without any hassle, deviations, or back-and-forth questions.
One of my automations broke and I asked it to fix it. Not only did it do that on the first try, but it also dug into the root cause - which turned out to be two consecutive system crashes that had corrupted the automation script file. On top of that, it kindly warned me that this isn't normal and suggested checking the system for serious errors. The previous Opus wasn't that perceptive.
Today I have several coding sessions ahead, and I'll be putting the new model through its paces in real battle conditions, for its intended purpose. If anything stands out beyond the noticeably improved thoughtfulness, I'll definitely share.
In the meantime, we're being promised that the new model has become more honest and accurate in its assessments. That was genuinely a problem before - it would give completely unrealistic timelines, like saying 2 days for a task it would finish in 20 minutes. And apparently it should be less of a yes-man now, and instead think harder about whether an action actually makes sense before doing it.
And the cherry on top - optimized token consumption, meaning the model got smarter without supposedly eating through more limits than the previous one. But all of that will show in practice. Let's go check it out!