Not boring, and a bit of a condescending prick
306 subscribers
116 photos
4 videos
191 links
Semi-digested observations about our world right after they are phrased well enough in my head to be shared broader.
Download Telegram
So I implemented a fairly large end-to-end UI harness test using Playwright over the past several days.

Even got yet another compliment from the CEO that I’m indeed a weird engineer. Which is fair — most engineers can’t be made to write UI tests, and I literally volunteered to build one. Ask forgiveness, not permission. My take was and still is: if you actually care about data isolation across user accounts and system boundaries, end-to-end tests are the best tool we have.

Here comes the punchline though.

Playwright is enormously good in the age of AI. So good that I’m starting to think instrumented Chromium may be one of the most overlooked security risks.

Take online banking or brokerage accounts. Leaking a password is not that scary (sic!), because:

- there’s two-factor
- a new device or location triggers extra checks
- even with access, moving funds to new accounts requires more verification

Now imagine the attacker acting on your behalf from your own browser. Your own headless browser. Which most humans have no idea can exist.

Headless browsers can open your email, grab the 2FA code, complete the login, and delete that email.

And no alarm will ring. Because from the system’s perspective, this is your device. Your browser. Your session. We don’t use CAPTCHAs for bank logins, after all.

And you won’t notice anything. Until it’s way too late.

So, three thoughts. First: I’m scared.

Not so much for myself — my personal paranoia (separate browsers, isolated cookies, etc.) probably protects me from most unsophisticated attacks.

But I am scared for the industry. Once this kind of attack becomes widespread, it’s going to be a disaster.

Second: I’m annoyed.

Because this is exactly the kind of problem the Web3 folks solved at the protocol level a decade ago.

Air-gapped device. QR code. Explicit confirmation. Signed response.

You see exactly what you approve.

Why aren’t we doing this for GitHub commits, pull requests, AWS production changes — anything high impact?

No idea. Guess we’ll learn the hard way. The industry has framed the Web3 crowd as a bunch of unsophisticated enthusiasts, unwisely dismissing all the great things built there.

And third: the upside.

Security in the age of AI is going to become a huge deal, very quickly. And that is actually a good thing!

Because this is one of the few areas where first-principles thinking really matters. Security is always an arms race, and the ability to reason clearly about systems will be in very high demand.

As for me — with all due disrespect to things like Kubernetes and Terraform — I can kind of see where this is going.

Less writing code.

More defining invariants, reviewing (semi-AI-generated) rules, and building harnesses that ensure no higher-order policy can be violated by any lower-level implementation.

That seems like a good place to invest the time, energy, and passion of hardcore geeks like yours truly.
👍7
I learned about the concept of the Merchant of Record. And I can’t shake it off.

We have near-instant means of payment that cost fractions of a cent. And we’re working hard, as technologists, to make it faster, cheaper, and safer.

At the same time regulators are working hard to make it more difficult for legitimate businesses to collect payments that legitimate customers are consciously willing to pay for legitimate services.

And the official justification is that smaller countries got tired of Netflix et. al. being tax-free since it comes from overseas, while local competitors have to pay those extra sales taxes.

First, why tab online services at all? Just tax the land, electricity, Internet, fire insurance etc. That the servers are physically located in your country or region should be subject to market forces and market forces alone. Make the terms good and the very Netflix will build a data center in your region. Make it more difficult and your customers will be streaming from elsewhere. You can tax their internet traffic if you wish. But that’s how it should work.

And then: say, your citizens are okay paying the “Netflix tax”. Fine. But why push it on to Netflix when it comes to collecting those payments? Have them declare those outbound credit cards fees on their tax returns. Introduce the means to chase and fine offenders. But leave Netflix as the service provider the f**k alone.

It’s still beyond my comprehension how the entire international taxation model is designed. If it continues like this, books will end up requiring payment for every word I’ve read while being on the soil of country X. Insane? Indeed.

But to me it is as insane that selling the same trivial online service is subject to price differentiation based on where the customer is — with the party collecting those payments held accountable for tracing their customers’ location.

Good thing most of the above does not apply to B2B invoicing. That’s one reason I believe Big Governments are against the Web3 space.

Because opening an international LLC should be a sub-millisecond operation that costs milicents, so that in a sane world 99+% of potentially taxable transactions should take place under this “limited liability” clause. Easy-peasy, and everybody wins, except the blood-sucking vampires that want to control every single financial drop of a perfectly consensual peer-to-peer financial transaction.
👍2
Been digging deeper into Claude Code leak lately [1], and a few things are becoming clearer.

First, it doesn’t actually use a vector database. That confirms my earlier intuition, and honestly makes me feel better about still paying for Cursor. In practice, Opus via Cursor often feels faster and more responsive anyway. There’s now a Rust port/fork of Claude Code floating around, though — I’d expect that direction to eventually introduce some kind of retrieval or vector layer.

Second, Claude Code really isn’t designed around persistent external memory. It’s basically the model’s context window plus whatever lives in-repo (Markdown, notes, etc.). Even its “self-notes” just eat into context. That feels like a strange design choice, especially given how aggressively it uses sub-agents. You’d think it would evolve lightweight internal rules or mini-linters over time — but not really.

Third, philosophically, it’s not very “model-first.” In fact, it’s the opposite. Claude Code wraps the model in a heavy harness with lots of guardrails and restricted autonomy — which is ironic, given Anthropic builds some of the safest models out there.

Compare that to OpenCode — which basically trusts the model and lets it operate more freely. If you assume a properly sandboxed environment, you could even argue that approach is safer long-term. Less rigid scaffolding, more adaptive behavior.

It raises a bigger question: where is all of this heading?

Do we end up with every major company building its own agentic coding framework?

Or do we converge toward full-blown “agentic operating systems” for development — the Linux / FreeBSD / Windows / macOS equivalents of AI-native coding environments?

Personally, I’m still leaning toward curated, manually reviewed extensions layered on top of these systems — scoped per repo or per org. Not fully open (at least for now), but composable and controlled.

Either way — this space is getting interesting fast.

[1] https://arxiv.org/pdf/2604.14228v1
3
Folks, a geopolitical question that I would not be comfortable asking in my other social media.

Assuming the US is interested in slowing down China's growth by constraining oil supply (Hormuz, then Malacca), what about nuclear?

In my mind, this question splits into two major ones:

1. Does China need to import raw materials to derive nuclear power from? If yes, is this something that is plausible to attempt to control? (I mean, those will not be huge and heavy tankers, after all.)

2. Can it so happen that starving a big country off oil would just result in them advancing on nuclear faster? (Much like cutting Assange et. al. from wire transfers pushed the leaking crowd towards crypto, ultimately yielding the opposite result.)

Of course, if there are major other directions / sub-questions to ask, I'd love to know of that.

Truly curious, and thx in advance!
I literally thought whether ChatGPT could generate this, and the first thing it said was "Hey, check out this new image generation model!"

Well, looks good indeed. Me happy.
TIL: most asteoids are not solid at all!

Decades of spacecraft visits and spin-rate surveys have revealed that the vast majority of asteroids larger than a few hundred meters are rubble piles — loose aggregates of rock, dust, and boulders held together almost entirely by their own weak self-gravity, with essentially zero tensile strength. Think of a gravel pile in space, not a boulder.

The key evidence:

The spin barrier. When astronomers plot spin rate vs. size for thousands of asteroids, there's a sharp cutoff: essentially no asteroid larger than ~200 meters spins faster than once per ~2.2 hours. That's exactly the rate at which a cohesionless rubble pile would fly apart under centrifugal force. If large asteroids were monoliths, we'd see plenty spinning faster. We don't. They're rubble.

Direct imaging. Missions like Hayabusa2 (Ryugu) and OSIRIS-REx (Bennu) found surfaces covered in boulders, with bulk densities well below solid rock — Bennu is ~1,190 kg/m³ vs. ~3,000 for its constituent material, implying ~50% porosity. When OSIRIS-REx touched Bennu, the surface behaved like a ball pit; the spacecraft sank in. Ryugu was similar.

Shapes. Many are "spinning top" shapes (Bennu, Ryugu, Didymos) — the equatorial bulge you'd get from a gravitationally-bound pile of gravel slowly redistributing itself under rotation.

Are there any solid ones?

Yes, but with caveats:

Small monoliths (< ~200 m) are common and genuinely solid — they're fragments of larger collisions.

Metallic (M-type) asteroids like 16 Psyche (~220 km) are thought to be exposed cores of differentiated planetesimals — likely much more coherent, possibly solid iron-nickel. Psyche is the current NASA target precisely to find out.

A few near-Earth asteroids show signs of some internal cohesion — spinning faster than the rubble-pile limit — suggesting they have at least modest tensile strength (maybe a few hundred Pa to a few kPa, nowhere near solid rock's ~10 MPa). These are "weakly bound" rather than truly monolithic.

What's actually near Earth at 50 km scale?

Near-Earth asteroids (NEAs) in the 50-km class basically don't exist. The largest NEA is 1036 Ganymed at ~38 km, and it's the only one above ~20 km. Almost all NEAs are sub-kilometer. The 50-km-radius (100-km diameter) bodies live in the main belt — Ceres, Vesta, Pallas territory — and those are dwarf-planet-scale objects with their own geology.
👍2
Rephrasing a co-worker, C in harness stands for clarity.
Something's broken, I definitely used Cursor extensively in the past few days.

But oh my, these numbers are insane.
😱2
Thought of the day.

In the age of AI, it's the LLM tokens that are expensive.

Compute, as in EC2 or ECS or Hetzner, is merely collateral damage. Nobody cares about those costs as long as the LLM tokens are burned with high utility.

AI did to compute what compute did to storage.

Which also means there's tons of money to be made in compute in the years to come — much like there's tons of money to be made in storage as of the past 5+ years.

My bet is the consistency and durability is what will sell well. Both with storage, as of a few years already. And with compute, which is starting about now, since compute becomes the fungible auxiliary unit next to LLM tokens utilized at scale.
👍3
Weird how Elon says he [co-]founded OpenAI because he wanted a counter to Google’s approach to AI: closed, private, for-profit.

And as of now the largest benefactor of the ongoing trial is indeed Google — which did soften up since, but all of the original arguments against it by Musk clearly hold true.
👍3
Whoa, I asked Claude to show me state income tax rates in 2015, 2020, and 2025. Was expecting to see tax hikes, but there's far more states that actually lowered their taxes.

A pleasant surprise indeed!
A friend is hiring a Staff for Uber @ Amsterdam, for the intersection of AI and securiry, on the harness / guardrails side I assume. Super interesting.

https://www.uber.com/global/en/careers/list/159033

Happy to help arrange the call if I worked with you before and/or know your work via one handshake.
1
66.88%. 80.1%. 85%. 90.79%. 93%. 100%.

These are all SOTA scores on agentic memory benchmarks. None of them tell you whether the system will work in production.

The deeper problem isn't the data — it's that we often misunderstand what these numbers actually measure. In our recent whitepaper we open-sourced datasets that target specific memory functions. Today we published a follow-up that explains why we think the well-known agentic memory benchmarks (LoCoMo, LongMemEval) miss the mark for production systems, and what we measure instead.

https://xmemory.ai/chasing-sota-in-ai-memory/

We're in a field that is measuring itself against itself.

The real question is not “are we beating last week's leaderboard?” It's “are we building something that makes people's work meaningfully better?”

That's harder to measure. It's also the only thing that matters.
Given how insecure our systems are proving to be, perhaps it’s time to admit a custom-build Web3-signed commands processor is safer than ssh?

Just run a small in-house blockchain spanning the nodes one needs to access to, have a few whitelisted public keys so that the holder of their private counterparts can issue shell commands and read all the output — and voila, zero-day Linux vulnerabilities are gone; or at least the attack surface shrinks ~100x.

Why aren’t big security players such as VPN providers offering this service already?
🔥1
Speaking of prompt optimization techniques in particular and of slop in general:

Jury instructions are frequently not understood. Studies dating back to the 1970s and continuing through today consistently find that jurors comprehend only somewhere between 40% and 70% of standard jury instructions, even after hearing them read aloud. A linguistics expert at Northeastern, Sally Randall, found that comprehension was significantly worse when instructions contained passive voice, presupposed information, and legal jargon. A more recent finding is striking: the complexity of legal concepts, rather than the complexity of language, is the primary cause of difficulties with comprehension, meaning that even rewriting instructions in plain English only modestly improves things, because the underlying concepts (mens rea, proximate cause, scienter, reasonable doubt as distinct from "beyond a shadow of a doubt") are genuinely hard.
👍5
Engineers aren't tech leads of agent swarms. They're founders of agent tribes.

Three thoughts that turned out to be the same thought.

I. Constitutions, not values.

Any moral opinion becomes a political stance the moment you scale it across millions of imperfect people over centuries. Politics is the friction you get when a moral idea meets a hundred million people and a hundred years.

At the other end of that spectrum sit constitutions: not statements of values, but the harness that lets imperfect people carry values across generations without burning the place down. A system of checks designed so that one bad decision — even one the majority loves — can't ruin the whole country.

That reframes how I think about moral opinions. Not "is this right?", but "what kind of system does this turn into when you scale it across imperfect people over a century, and does that system survive?"

A rule that only works when good people enforce it isn't a rule. It's a wish.

II. Agentic systems are the same problem, on a faster clock.

Engineers using Claude Code, Codex, and the rest aren't dispatching tickets to direct reports. We're designing meta-harnesses — constitutions — that turn a goal into coordinated agent work whose sum is greater than its parts.

The failure modes rhyme exactly with failed states: anarchy on one end, brittle over-centralized rigidity on the other. The job is to find the middle.

Today's tools are early tribal chiefs: brilliant while the chief is competent, fragile the moment the chief is replaced. They aren't broken — they're wonderful — but they're tied to the particular engineer running them. Agents are improving too fast for that to be the long-term answer.

III. Writing the right words is literally wealth creation.

The U.S. founding fathers' artifact was a set of words. Give them a few percent share of 250 years of American GDP and you'll easily land at "the largest value creators in the history of humankind."

Mechanically, they sat in rooms and argued about phrasing.

That's the skill the best engineers now need. Not prompting by hand. Not shaving tokens. Writing the words that turn capable-but-aimless agents into something that creates lasting value.

Every harness rule should express clean intent. The intent should cut deep — into what's actually being built and why. The system around the rules should ask questions when intent is unclear, and accumulate understanding instead of starting fresh every time.

Conclusions.

The right metric isn't tokens saved. It's universality — how much of your company's intent the harness can absorb and carry forward without you in the loop on every decision.

Tokens are cheap and getting cheaper. Feedback loops have collapsed from days to minutes. Tweaking prompts to save a few hundred tokens is debating the font of the parchment.

The work that matters is constitutional. Saving 50% of tokens means nothing compared to a harness that talks to your whole company nonstop, learns the shape of your organization within weeks, and keeps producing the work indefinitely — as a productive team member that carries your vision through time.

The founding fathers would understand.

Full post: https://dimakorolev.substack.com/p/founding-fathers-of-agents
🔥4
As we've definitely entered the era of smart AI agents, I think it's time we go back to the roots with shell scripting.

Say no to long shell scripts. Say no to complex business logic in shell scripts. Instead, this logic belongs in plain English!

Replace one script with ten options by ten shallow do-one-thing-and-do-it-well scripts. Or fifteen, if some of them have modes and start / stop / check functionality.

Then write a plain English file, likely markdown, outlining for the agent what these scripts are, how to use them, and what to keep in mind. A human may also read this English file, but who are we kidding?

And then invoke these scripts via a decent AI agent, such as Cursor. In your workflow, as needed, it will present a concise summary, with tables where needed.

If uncertain about the output, just ask it again, in plain English. And if some script output needs proper visualization, that's definitely not the job of the script itself, but of its outer harness.

And this outer harness is the human operator and an AI model. A model so small that it'll be running locally in just one or two generations of our laptops.

What a time to be alive.
👍6