Not boring, and a bit of a condescending prick
307 subscribers
114 photos
3 videos
190 links
Semi-digested observations about our world right after they are phrased well enough in my head to be shared broader.
Download Telegram
Been digging deeper into Claude Code leak lately [1], and a few things are becoming clearer.

First, it doesn’t actually use a vector database. That confirms my earlier intuition, and honestly makes me feel better about still paying for Cursor. In practice, Opus via Cursor often feels faster and more responsive anyway. There’s now a Rust port/fork of Claude Code floating around, though — I’d expect that direction to eventually introduce some kind of retrieval or vector layer.

Second, Claude Code really isn’t designed around persistent external memory. It’s basically the model’s context window plus whatever lives in-repo (Markdown, notes, etc.). Even its “self-notes” just eat into context. That feels like a strange design choice, especially given how aggressively it uses sub-agents. You’d think it would evolve lightweight internal rules or mini-linters over time — but not really.

Third, philosophically, it’s not very “model-first.” In fact, it’s the opposite. Claude Code wraps the model in a heavy harness with lots of guardrails and restricted autonomy — which is ironic, given Anthropic builds some of the safest models out there.

Compare that to OpenCode — which basically trusts the model and lets it operate more freely. If you assume a properly sandboxed environment, you could even argue that approach is safer long-term. Less rigid scaffolding, more adaptive behavior.

It raises a bigger question: where is all of this heading?

Do we end up with every major company building its own agentic coding framework?

Or do we converge toward full-blown “agentic operating systems” for development — the Linux / FreeBSD / Windows / macOS equivalents of AI-native coding environments?

Personally, I’m still leaning toward curated, manually reviewed extensions layered on top of these systems — scoped per repo or per org. Not fully open (at least for now), but composable and controlled.

Either way — this space is getting interesting fast.

[1] https://arxiv.org/pdf/2604.14228v1
3
Folks, a geopolitical question that I would not be comfortable asking in my other social media.

Assuming the US is interested in slowing down China's growth by constraining oil supply (Hormuz, then Malacca), what about nuclear?

In my mind, this question splits into two major ones:

1. Does China need to import raw materials to derive nuclear power from? If yes, is this something that is plausible to attempt to control? (I mean, those will not be huge and heavy tankers, after all.)

2. Can it so happen that starving a big country off oil would just result in them advancing on nuclear faster? (Much like cutting Assange et. al. from wire transfers pushed the leaking crowd towards crypto, ultimately yielding the opposite result.)

Of course, if there are major other directions / sub-questions to ask, I'd love to know of that.

Truly curious, and thx in advance!
I literally thought whether ChatGPT could generate this, and the first thing it said was "Hey, check out this new image generation model!"

Well, looks good indeed. Me happy.
TIL: most asteoids are not solid at all!

Decades of spacecraft visits and spin-rate surveys have revealed that the vast majority of asteroids larger than a few hundred meters are rubble piles — loose aggregates of rock, dust, and boulders held together almost entirely by their own weak self-gravity, with essentially zero tensile strength. Think of a gravel pile in space, not a boulder.

The key evidence:

The spin barrier. When astronomers plot spin rate vs. size for thousands of asteroids, there's a sharp cutoff: essentially no asteroid larger than ~200 meters spins faster than once per ~2.2 hours. That's exactly the rate at which a cohesionless rubble pile would fly apart under centrifugal force. If large asteroids were monoliths, we'd see plenty spinning faster. We don't. They're rubble.

Direct imaging. Missions like Hayabusa2 (Ryugu) and OSIRIS-REx (Bennu) found surfaces covered in boulders, with bulk densities well below solid rock — Bennu is ~1,190 kg/m³ vs. ~3,000 for its constituent material, implying ~50% porosity. When OSIRIS-REx touched Bennu, the surface behaved like a ball pit; the spacecraft sank in. Ryugu was similar.

Shapes. Many are "spinning top" shapes (Bennu, Ryugu, Didymos) — the equatorial bulge you'd get from a gravitationally-bound pile of gravel slowly redistributing itself under rotation.

Are there any solid ones?

Yes, but with caveats:

Small monoliths (< ~200 m) are common and genuinely solid — they're fragments of larger collisions.

Metallic (M-type) asteroids like 16 Psyche (~220 km) are thought to be exposed cores of differentiated planetesimals — likely much more coherent, possibly solid iron-nickel. Psyche is the current NASA target precisely to find out.

A few near-Earth asteroids show signs of some internal cohesion — spinning faster than the rubble-pile limit — suggesting they have at least modest tensile strength (maybe a few hundred Pa to a few kPa, nowhere near solid rock's ~10 MPa). These are "weakly bound" rather than truly monolithic.

What's actually near Earth at 50 km scale?

Near-Earth asteroids (NEAs) in the 50-km class basically don't exist. The largest NEA is 1036 Ganymed at ~38 km, and it's the only one above ~20 km. Almost all NEAs are sub-kilometer. The 50-km-radius (100-km diameter) bodies live in the main belt — Ceres, Vesta, Pallas territory — and those are dwarf-planet-scale objects with their own geology.
👍2
Rephrasing a co-worker, C in harness stands for clarity.
Something's broken, I definitely used Cursor extensively in the past few days.

But oh my, these numbers are insane.
😱2
Thought of the day.

In the age of AI, it's the LLM tokens that are expensive.

Compute, as in EC2 or ECS or Hetzner, is merely collateral damage. Nobody cares about those costs as long as the LLM tokens are burned with high utility.

AI did to compute what compute did to storage.

Which also means there's tons of money to be made in compute in the years to come — much like there's tons of money to be made in storage as of the past 5+ years.

My bet is the consistency and durability is what will sell well. Both with storage, as of a few years already. And with compute, which is starting about now, since compute becomes the fungible auxiliary unit next to LLM tokens utilized at scale.
👍3
Weird how Elon says he [co-]founded OpenAI because he wanted a counter to Google’s approach to AI: closed, private, for-profit.

And as of now the largest benefactor of the ongoing trial is indeed Google — which did soften up since, but all of the original arguments against it by Musk clearly hold true.
👍3
Whoa, I asked Claude to show me state income tax rates in 2015, 2020, and 2025. Was expecting to see tax hikes, but there's far more states that actually lowered their taxes.

A pleasant surprise indeed!
A friend is hiring a Staff for Uber @ Amsterdam, for the intersection of AI and securiry, on the harness / guardrails side I assume. Super interesting.

https://www.uber.com/global/en/careers/list/159033

Happy to help arrange the call if I worked with you before and/or know your work via one handshake.
1
66.88%. 80.1%. 85%. 90.79%. 93%. 100%.

These are all SOTA scores on agentic memory benchmarks. None of them tell you whether the system will work in production.

The deeper problem isn't the data — it's that we often misunderstand what these numbers actually measure. In our recent whitepaper we open-sourced datasets that target specific memory functions. Today we published a follow-up that explains why we think the well-known agentic memory benchmarks (LoCoMo, LongMemEval) miss the mark for production systems, and what we measure instead.

https://xmemory.ai/chasing-sota-in-ai-memory/

We're in a field that is measuring itself against itself.

The real question is not “are we beating last week's leaderboard?” It's “are we building something that makes people's work meaningfully better?”

That's harder to measure. It's also the only thing that matters.
Given how insecure our systems are proving to be, perhaps it’s time to admit a custom-build Web3-signed commands processor is safer than ssh?

Just run a small in-house blockchain spanning the nodes one needs to access to, have a few whitelisted public keys so that the holder of their private counterparts can issue shell commands and read all the output — and voila, zero-day Linux vulnerabilities are gone; or at least the attack surface shrinks ~100x.

Why aren’t big security players such as VPN providers offering this service already?
🔥1
Speaking of prompt optimization techniques in particular and of slop in general:

Jury instructions are frequently not understood. Studies dating back to the 1970s and continuing through today consistently find that jurors comprehend only somewhere between 40% and 70% of standard jury instructions, even after hearing them read aloud. A linguistics expert at Northeastern, Sally Randall, found that comprehension was significantly worse when instructions contained passive voice, presupposed information, and legal jargon. A more recent finding is striking: the complexity of legal concepts, rather than the complexity of language, is the primary cause of difficulties with comprehension, meaning that even rewriting instructions in plain English only modestly improves things, because the underlying concepts (mens rea, proximate cause, scienter, reasonable doubt as distinct from "beyond a shadow of a doubt") are genuinely hard.
👍5
Engineers aren't tech leads of agent swarms. They're founders of agent tribes.

Three thoughts that turned out to be the same thought.

I. Constitutions, not values.

Any moral opinion becomes a political stance the moment you scale it across millions of imperfect people over centuries. Politics is the friction you get when a moral idea meets a hundred million people and a hundred years.

At the other end of that spectrum sit constitutions: not statements of values, but the harness that lets imperfect people carry values across generations without burning the place down. A system of checks designed so that one bad decision — even one the majority loves — can't ruin the whole country.

That reframes how I think about moral opinions. Not "is this right?", but "what kind of system does this turn into when you scale it across imperfect people over a century, and does that system survive?"

A rule that only works when good people enforce it isn't a rule. It's a wish.

II. Agentic systems are the same problem, on a faster clock.

Engineers using Claude Code, Codex, and the rest aren't dispatching tickets to direct reports. We're designing meta-harnesses — constitutions — that turn a goal into coordinated agent work whose sum is greater than its parts.

The failure modes rhyme exactly with failed states: anarchy on one end, brittle over-centralized rigidity on the other. The job is to find the middle.

Today's tools are early tribal chiefs: brilliant while the chief is competent, fragile the moment the chief is replaced. They aren't broken — they're wonderful — but they're tied to the particular engineer running them. Agents are improving too fast for that to be the long-term answer.

III. Writing the right words is literally wealth creation.

The U.S. founding fathers' artifact was a set of words. Give them a few percent share of 250 years of American GDP and you'll easily land at "the largest value creators in the history of humankind."

Mechanically, they sat in rooms and argued about phrasing.

That's the skill the best engineers now need. Not prompting by hand. Not shaving tokens. Writing the words that turn capable-but-aimless agents into something that creates lasting value.

Every harness rule should express clean intent. The intent should cut deep — into what's actually being built and why. The system around the rules should ask questions when intent is unclear, and accumulate understanding instead of starting fresh every time.

Conclusions.

The right metric isn't tokens saved. It's universality — how much of your company's intent the harness can absorb and carry forward without you in the loop on every decision.

Tokens are cheap and getting cheaper. Feedback loops have collapsed from days to minutes. Tweaking prompts to save a few hundred tokens is debating the font of the parchment.

The work that matters is constitutional. Saving 50% of tokens means nothing compared to a harness that talks to your whole company nonstop, learns the shape of your organization within weeks, and keeps producing the work indefinitely — as a productive team member that carries your vision through time.

The founding fathers would understand.

Full post: https://dimakorolev.substack.com/p/founding-fathers-of-agents
🔥4
As we've definitely entered the era of smart AI agents, I think it's time we go back to the roots with shell scripting.

Say no to long shell scripts. Say no to complex business logic in shell scripts. Instead, this logic belongs in plain English!

Replace one script with ten options by ten shallow do-one-thing-and-do-it-well scripts. Or fifteen, if some of them have modes and start / stop / check functionality.

Then write a plain English file, likely markdown, outlining for the agent what these scripts are, how to use them, and what to keep in mind. A human may also read this English file, but who are we kidding?

And then invoke these scripts via a decent AI agent, such as Cursor. In your workflow, as needed, it will present a concise summary, with tables where needed.

If uncertain about the output, just ask it again, in plain English. And if some script output needs proper visualization, that's definitely not the job of the script itself, but of its outer harness.

And this outer harness is the human operator and an AI model. A model so small that it'll be running locally in just one or two generations of our laptops.

What a time to be alive.
👍6
The Short Interval Between Two Eras

Here's the post I forgot to write a few weeks ago, right after watching Project Hail Mary.

(The book is excellent too.)

One distinct aftertaste of the movie is how precisely the short interval between the book and the screenplay coincided with the short interval between the pre-AI and post-AI adoption era.

Just a few years ago, computer interfaces were presumed to be incredibly difficult to manage. And computer programming — a.k.a. software engineering — was widely understood to be a difficult discipline.

Andy Weir is famously a "science maximalist": his books are ripe with the idea that The Science has solutions to all problems. Yet, in The Martian, I cannot help but wonder how it is possible that a botanist navigates the intricacies of a complex software system so flawlessly.

Reading Project Hail Mary, I found myself thinking along the same lines. Being good at first-principles physics does not automatically make one proficient with bleeding-edge tech — which, inevitably, is what an interstellar spacecraft is full of.

And yet! By the time the movie is released, we know for a fact how powerful human-first, natural language interfaces can be.

Quite literally just five years ago, the thought that an interface could be both powerful and intuitive was unimaginable. Sure, sci-fi authors had been talking about this possibility forever. But the tech community — yours truly included — was rather sceptical.

AI-assisted coding has changed this in a matter of single-digit years. An astonishing product progress, if you ask me.

Moreover, Neuralink et al. is no longer something impossible. Muscle memory is tricky — it'll be a while until one can wake up like Neo in The Matrix, knowing Kung Fu after a short session.

But for intellectual tasks — for virtually all of them — the problem can largely be declared solved. To my taste, if one can understand the domain, articulate the desired outcome, and answer a few clarifying questions, then even today's AI models are quite capable of making things happen at astronomical scale. Literally astronomical: tasks such as planning extraterrestrial space travel are a piece of cake for a mid-sized AI model equipped with just a few tools.

What a time to be alive (c)
🔥4🥰1
Important question: is it true that LLM models are better than other LLM models when it comes to suggesting how to optimize prompts for themselves?

Say I have a process which involves querying the model again and again. And as the good ML/AI citizen I am, I have labeled data — journaled and marked-down records of where the model performed well, and where it might need improvement.

Fine-tuning and post-training the model are of course plausible directions. But that's slow and expensive. A much cheaper alternative is tweaking the prompts.

So I have a process to incrementally run some end-to-end test. Or, in pure SGD terms, I have a process to incrementally approximate some gradient stochastically.

As the result of this process I get a suggestion on what can be improved in the prompts, so that the result gets Pareto-better. Presumably. So: check and repeat.

I also use other models to cross-check those improvement suggestions, to make sure they don't overfit to the very problem — since it's often generalizing a particular failure mode from just a few examples.

This is plausible. This works. This is cost-effective.

But one thing doesn't let me sleep well at night. We have this belief that model A is better at improving prompts for model A. What is this belief based on? Is it even true?

It sounds wise to use Anthropic models to improve prompts that are later fed to Anthropic models. Replace "Anthropic" with any LLM provider here. But do we know whether it's true at all?

Perhaps the next big thing is some Grok training a model that can prompt some Qwen better? Asking for a friend, of course.

Seen in this light, all in all, the idea of a proper bench of models and model ensembles in a closed-loop system starts to look more and more lucrative.