Not boring, and a bit of a condescending prick
219 subscribers
27 photos
113 links
Semi-digested observations about our world right after they are phrased well enough in my head to be shared broader.
Download Telegram
Many years ago, back at Google, the junior me suggested a crazy idea: every line of code has a half-life.

In other words, the company is committed to keeping no legacy code, by constantly updating its codebase.

Say, your team shipped 20K lines of code in late 2008. At most half of them are allowed to still be part of the codebase by the end of 2010, at most half of the half can remain by the end of 2012, etc. If your code is business-critical, then make sure you re-implement at least half of its logic every two years.

Back then we concluded that it's not too bad of an idea, especially if we can keep the tests. Which we, of course, can.

Obviously, companies (and, especially, managers!) would not endorse such an approach. They are mostly in the "move fast, break things" mode, regardless, with tech debt piling up year after year being the rule, not an exception.

I still think it's a good idea. If your friends are building a company this way, let me know — I'd love to talk to them, at least.

~ ~ ~

These days I am having a déjà vu when it comes to infrastructure as code.

How about introducing the term half-life for infra? At least half of the servers you have configured for prod use today must be re-configured a month from now.

If you are cloud-first, you are, in a way, already half-way there. Servers die, or get decommissioned, and you have to have the process by which your code can be re-staged on fresh machines, being fully connected to the rest of the fleet.

Admittedly, in most companies there will be a few servers with fine-tuned manual configuration, for which this job is more manual than just a few actions. Those are exactly the points of impact that are not getting the attention they deserve most these days, and those are exactly the ones, IMHO, which a new solution, once it emerges, would improve the most.

A big reason I like the cloud is that it's easier to be disciplined if your deployment has to follow certain rules and constraints.

For on-prem deployment, this "luxury" is not there. In a way, it is making the DevOps job much harder in the long run.

~ ~ ~

Curiously, this is where the crypto community might be ahead of the curve.

The crypto folks don't have the luxury of relying on some cloud infra. At the same time, crypto solutions can not afford manual configuration, mostly because they are, well, decentralized by design.

But designing software to run in a 100% decentralized fashion doesn't necessarily solve the problems right away. Sure, decentralized solutions are a lot harder when it comes to reliability, low latency, and safety from bad actors. Still, a lot more down to earth problems, such as fault tolerance and node discoverability, have to be taken care of.

~ ~ ~

I am going to make a strong claim — and a falsifiable prediction! — that if a new and effective configuration / server management solution for one's own cloud, or for a hybrid cloud, would emerge, it would come not from the people who are thinking 24/7 of how to improve Chef or Ansible or Kuberbetes or CloudFormation.

Rather, it would be designed by those looking for better ways to speed up their smart contracts by offering users to run their own support nodes, for a modest income in certain tokens.
Consistent Hashing is the second episode of the Educational Series of the System Design Meetup.
Monoliths train strong engineers
Strong engineers design microservice architectures
Microservice architectures create weak engineers
Weak engineers build monoliths

(c) paraphrased from Slack comments of System Design Meetup
A crazy thought: Why aren't software decisions made in jury-style?

Imagine the question being discussed is to pick product A, vendor B, and option C it to build a solution in-house. Here's how the "pure" solution could look like.

• Screen developers who don't make the ultimate decision from all the conversations about the decision. They could run experiments, measure the numbers, read the code and documentation, make estimates, etc.

• Dedicate (twelve) jurors to ultimately make the decision. Make sure they are not exposed to any part of this decision at all. I.e. make sure they are not having lunches or drinking with those developers running the experiments.

• Have the advocates a.k.a. prosecutors/defendants for each option. As good senior engineers and architects should be able to advocate for an against any idea, these roles can and should rotate over several days (i.e. Alice advocates pro B on Monday, then pro C on Tuesday).

• Have the high-level moderator (a Head of Engineering or Data, or a CTO, acting as the "judge"), whose job is limited to a) making sure the right questions are asked and answered, b) making sure the thresholds against which the numerical results of engineering experiments are compared do get discussed and agreed upon beforehand and communicated to the jurors accordingly, and c) preventing off topics, and overall biased / under-justified opinions from becoming too vocal in the [court] room.

• Hold the process of making the engineering decision (or "the trial") over several days, making sure that all the decisions are ultimately made by the jurors, with the jurors only receiving the relevant input datapoints, and no noise.

Obviously, for simple decisions the above is unnecessary. But when it comes to company strategy, when people are biased, and when the technological future five years ahead is at stake, I'd love to watch such a process, participate in it, or even moderate one!
The programming language of mid-21st century would:

⒜ Have a primitively-recursively-evaluated static type system.
⒝ Focus on the very business logic / application code being imperative.
⒞ Enable the tests to be functional and/or declarative.

On ⒜, the point is that strong typing doesn't have to be limited to numbers vs. strings and objects vs. arrays. Whatever invariant can be enforced statically is worth enforcing statically. Just think of an IDE highlighting that you forgot an important corner case, not on the level of forgetting one switch / match option, but on the level of breaking some one-to-one contract as part of a presumably atomic mutation / transaction.

For ⒝, I'm becoming more pragmatic as I get older, and we just won't get functional any time soon. The hardware that that runs our code is imperative (don't get me started on OOP, it really mostly is syntactic sugar in its present shape & form, and actors-reactors or lock-free queues are imperative AF). Besides, the complexity of code is no longer within individual "app features", but rather in the contracts between various components (a.k.a. microservices and DB/cache/LB layers). So, keep it simple, and write strongly typed and easy to follow imperative code.

Finally, ⒞ is where we need most help, I believe. We've had some golden age of testing ~10 years, with the explosion of unit testing frameworks, mocking, shimming, model-based testing, etc. All these things are hopelessly behind these days, IMHO. Not to say they are bad, they just haven't adjusted to the modern day's needs. Most true production problems these days are nearly impossible to be caught with unit tests, unless the amount of thought and work put into testing is far greater than that put into development. Configure your message broker wrong, don't handle idempotency correctly in one place, or rely on the order of messages being as expected — and, voila, a few years later, under an increased load, those pesky production issues emerge like mushrooms after rain.

I wonder how many people are thinking along the same lines. The nerd in me would love to help the world to get there, but it's not clear how do I best start, aside from aligning my worldview with what other smart people have to say, and educating people who work with me.
Forwarded from SysDesign Meetup
Figured I could start executing on 2022 plans for the System Design Meetup before 2022 is started: https://dimakorolev.substack.com/p/happy-2021-holidays

TL;DR: Yours truly and the System Design Meetup now have our own Substack.
Got the permission to publish some of our real work's research, and decided my Substack would be a good place for it: https://dimakorolev.substack.com/p/open-policy-agent-musings

If you know about the topic, or know someone who knows about the topic, or know someone who knows someone, I'd appreciate a forward — the text and the research should be deep enough to be worthy of a few minutes of an experienced person who's in this space. Thanks!
Re. the Lex Fridman & Peter Wang podcast.

Folks, you both are missing the point.

There is inherent beauty in the fact that human-to-human communication is messy and ambiguous. This beauty is due to that when we communicate with humans we develop an emotional connection.

There's certain non-fungible pleasure in "clicking", with another person or with the audience. Over time, assuming two humans, or a group of humans, choose to and do communicate more and more, important higher-order effects emerge, ref. "finishing each other's sentences". These higher-order effects largely are what, in my book, qualifies for "building a relationship", and the way the humans are wired makes (most of) us long for quality relationships.

A seasoned dog owner would tell you their relationship with their dog follows a similar pattern, and activates similar neural and hormonal "pathways" in the dog owner's self.

The irrevocable property of this relationship being built and developed is that not just one human, but all the parties change over time; not only in our ideas and how we express them, but also in how do we align the higher-order concepts in our heads, which topics keep popping up, which are best to be avoided, etc., etc. It's hard to imagine a dog owner sharing their beloved dog with a hundred more dog owners -- there would just not be enough room (enough "intimacy" if you wish) for this dog to "contain" the nuanced relationships with all those humans.

Now, the argument that made me write this post is that the "ML-powered" (machine-learning-powered) programming is the approach that takes software development further away from the "formal", "robotic" process that it is today, and thus closer to the "human to human" collaboration.

I see a lot of wrongs with this argument, but the major point I have in mind is the following. If you want to build software by means of a language that is inherently ambiguous, you'd have to build a human-level relationship with the "ML-powered tool" that you are partnering with in order to build that software.

In other words, the very possibility of ML-based "messy" programming languages taking off is predicated on the critical mass of human beings who are both capable of and willing to establish human-style emotional bonds with the non-living entities.

Which I personally do not view as something impossible. On the cybernetics/computability level, I would argue that it already is relatively easy to "engineer" and then "simulate" a "partner" that would be able to successfully act as a high-quality replacement for, say, a pet.

But, on the grand scheme of things, it is my understanding that very few people would truly be comfortable taking such a "replacement" over a "real thing". In fact, the primary reason humans like pets is to more effectively fulfil their emotional needs.

To continue with the dogs analogy, think of K9 dogs that help patrol our airports. The human being who is working in a tandem with that dog has to establish a strong emotional bond for this partnership to work. How many humans are capable of acting out this role?

So, I think with ML-powered "programming" the state of the art would be about the same. In some specific cases — similar to K9 dogs helping to keep our airports safe — tandems of humans + adaptive ML models would emerge as a more effective solution, on the social scale. Much like the customer support department of virtually any large company is now augmented with robots answering voice calls and with chatbots that take a good share of the job away from human representatives.

But we all know those chatbots can only take us that far. *Even* for the domains where the scope of what can be done is remarkably limited and well mapped.
And this is far before we enter the the territory of thinking about the state of a piece of software evolving over time. Which requires the brain wiring of the level absolutely unreachable for even the top-notch ML systems of today. The example of asking GPT-3 to talk to you in sentences of an odd number of words comes in handy here; and it's still about "imagining" one step ahead, not a potentially infinite & unbounded evolution of the data in some DB and the UI/UX of some application.

And any nontrivial piece of software absolutely requires reasoning on this "infinite evolution of data" level, where the mutations of this data form chains and sequences, and inevitably depend on one another.

Now, maybe you can imagine a machine learning system that can "pretend" to "understand" you (by correctly reading between the words and "completing your sentences"), and thus convert your unfiltered & unstructured thoughts into fixed-form software instructions -- all the power to you! Maybe we'll get there eventually.

But, personally, I just can't imagine us, the humankind, developing such a technology in the next 20+ years.

Because, honestly, building correctly functioning software from incomplete specifications is literally the hardest possible AGI task I can think of.

Before we can get there, show me virtual drivers and virtual teachers or coaches and virtual doctors who consistently outperform humans. The experts in those professions — who most certainly are extremely talented and intelligent and hard-working — still do operate in far more constrained environments compared to software developers. Simply because the spaces of all possible driving conditions, or of all possible ways a student may have troubles grasping some topic, of human body malfunctions — these spaces are still rather bounded in their nature, while the space of all possible algorithms and data structures is literally unbounded, and it is expanding at an increasing pace as we speak.
To my taste, this is an extremely important post: https://blog.roblox.com/2022/01/roblox-return-to-service-10-28-10-31-2021/

<rant>

First, it talks about a real problem in a real company, that affected millions of users for an extended period of time.

Second, at least to my taste, it's quite poorly written. Interestingly, at the same time, I believe the authors of the post do not think so!

It's easy to see through the post that the people who were debugging the issue are not seeing the forest, but just the trees.

The median latency that used to be 500ms and now is 2 seconds? Why is it even important? Should 2 seconds, or even 10 seconds, not be good enough for service discovery? How often do you even have new nodes to discover, to begin with?

Shouldn't the whole fleet work just fine if the service mesh / service discovery subsystem is down and not accepting updates? At least while no machines are failing, the previously "discovered" ones should just talk to each other as if nothing happened, no?

Shouldn't there be a read-only cache present for the cases where the mutation path (the write path) is slow? Shouldn't the Consul folks think about this? Especially when introducing the "optimized" access pattern, being fully aware that, under certain circumstances, that are obviously not well understood by most users, this "optimized" access pattern would inevitably be a disaster in the making?

The folks who wrote this blog post seem to not even consider the above questions important; at least it doesn't show through the text at all.

I get it, it is important to keep it cool, to praise the players who "DNS-hacked" the system to begin playing a bit sooner than others. And it is important to keep a good relationship with the Consul team. And it is important to maintain the image of a transparent, developer-first organization.

Nonetheless, to me, the post does read as an affirmation that software system design at scale is experiencing a major crisis. It's the kind of a crisis that can have a cascading domino effect of failures. Across multiple companies.

<more_rant>

Heck, in my professional life I do system design interviews and coaching. And a lot of people sincerely believe Mongo and/or Cassandra guarantee data availability and consistency across the whole spectrum of possible failure modes, while local hash maps are a big no-no "because Redis is better". Now Consul, that, presumably, does offer stronger consistency guarantees compared to Mongo or Redis, is behind a massive outage. And nobody appears to want to ring a bell; people are just praising each other and the overall spirit of teamwork.

I get it, if Roblox et. al. get down for a few hours every other year, the world would not come to a stop. But it's the same people who design the air traffic control systems, or our healthcare or elections software, or the semi-autonomous weapons that our military is planning to adopt as we speak. Doesn't this concern you at all?

</more_rant>

Okay, maybe it's me whining. But I seriously don't know what big conclusions are right to be drawn from this Roblox post. To me it does read as a poor post where the authors are not only explaining the situation well, but are genuinely clueless about how to build systems of the scale they happened to have built.

I do hope to be wrong; I do hope that at Roblox there are more people, behind the scenes, who have nothing to do with this post, but have a lot to do with helping clean this particular mess and prevent the future ones. But, from my experience, I can assure you that it is not the case generally, even if in Roblox in particular the situation is not as bad as I'm reading between the lines.

Because the hipsters do appear to run the show, at least when it comes to microservices / SOA these days, and the overall system design situation does look like the early stages of Idiocracy to my taste. With the conversation about using Mongo and Redis as a DB and a cache resembling the "it's got electrolytes" scenes more and more.

</rant>
So, in a company I'm helping with the design, we couldn't come up with a memorable name for the core component of our service.

Partly out of desperation, I looked into Sci-Fi for inspiration, and suggested we call it Arrakis. Somehow, the folks liked it.

That was late last week. Now, over the weekend, I'm thinking that the piece that grabs data from all other places and feeds it into our core should be called a [Spice] Harvester.
Regulation-heavy folks, I have a question about how the modern-day data residency rules affect core services such as authorization.

Say, your service is the one that ultimately decides if a user can only view a Google Doc, leave comments there, or edit it directly.

Now, you're a European user, subject to GDPR, of course, and you are finding yourself in the US.

It most certainly is not the case that every single time you attempt to change something in this Google Doc a request goes out to the European server (where "your data" "resides") so that that European server may confirm you are indeed granted that permission?

~ ~ ~

My limited understanding of data residency regulations is that it only affects the storage of PII (personally identifiable information). In other words:

• If you need a cache of permissions for a particular user token, for example, you are absolutely allowed to keep it anywhere, as long as that user token does not contain this user's PII, right?

• If your data is ephemeral (you just store it, in memory, on your own server), and you never respond back with it, you are absolutely allowed to serve the requests that use this data, right?

~ ~ ~

For example, your legal name is PII, so we shouldn't store it anywhere, outside the servers in Europe. Still, if you're opening your Google account, and then its settings, from America, it's an American "frontend server" that is doing the rendering of this page, right?

Thus, once you, as the user, have requested this data, it is allowed, in some form, to pass through a Google data center in America. There might be constraints on not storing this data in any caches, or any web server logs, for instance, or constraints on keeping it encrypted while in transit, but it's not that this data can't cross the country border. Or am I wrong?

~ ~ ~

On the other hand, the tricky case is age. Say, you are a European user, and you have consciously consented to share your age with YouTube in Europe. Now you are in the US, and are about to watch some videos.

Say, the publisher of a video can make it restricted to an arbitrary age. I am making it up, just bear with me. In order to answer the question whether you can watch a certain video, the age restriction on the very video would have to be compared to your age.

Thus, a "malicious" actor can create a hundred videos, 1+, 2+, 3+, ..., 99+, 100+, and then issue requests of the kind of "can this user watch this video?" The authorization service would never disclose the age, but, as all computer scientists understand, it would take seven requests to get to know this user's age down to a year.

~ ~ ~

What does the regulation say here?

If a user from Europe is opening YouTube in the US, on which side of the Atlantic should the very check happen?

For how long can this result be cached?

And is the service liable if, say, a change in important data did not propagate to the other side of the ocean quickly enough, causing the user to mistakenly be allowed to and/or prevented from accessing certain content?
I recall how many years ago Microsoft was made to divorce Internet Explorer from, well, the components of Windows that are integral to the OS. And that was good.

Now -- yes, I do reboot into Windows every now and then -- when I click on the text to expand what that beautiful login screen picture is about ...

... I get to choose between "Edge" and "Edge", with the third option being "Choose from Microsoft Store". And this choice begins with the "search query" of, drum rolls, microsoft-edge.

Just so that we are clear: I have Brave, Firefox, and Chrome installed. And need to open a URL. But Windows would not show me that URL unless I "agree" to use Edge to open it. For real, just a "copy URL" option would work for me -- but it is an "essential OS feature", apparently, to tell me in which exact part of the Indian ocean was that photo taken.

I don't even know what to make of it; I can't even convince myself to dislike Microsoft due to this. It's the rules of the game that companies play by these days. Somehow we're back to square one, where these rules are nowhere near being pro Us, The People, and they are, again, almost entirely pro corporations.

Seriously, in the US, the only "recent "example of an IT regulation I can think of that made life better for an average Joe or Jane was that one can keep their phone number when changing carriers. That's about it.

Back to Microsoft, I don't use macOS, but I do use an iPhone. When that iPhone wants to install its update, I trust Apple to do it. When my Windows 10 "recommends" upgrading to Windows 11, I'm finding myself automatically looking for an option to never offer it again.

And here we are, with the same old company, employing the same old tricks, being in the tops of the world capitalization-wise. It's a really well run company. It cares about its employees diversity and inclusion. It cares about climate. It cares about privacy. The only thing that seems to be missing is caring about these simple things that ruin the experiences of some end users.

I'm going to go all in though and postulate that it's the right strategy for the company. Because the people who truly demand these features from an OS do settle for Linux or for macOS, and are not the target audience of Microsoft Windows. Personally, I wouldn't boot into Windows if my Razer camera would work as well -- 60 FPS Full HD with proper lighting correction and background blur -- on Ubuntu as it does on Windows.
Our next #SystemDesignMeetup episode, must be real good, especially if you are into stream crunching and data processing: comparing #Kafka and #RabbitMQ.

This actually is the first half of what I have to share about the Distributed Task Queue problem. We just decided to split this into two, as there's quite some content on the topic that is not specific to the very problem, but rather talks about message queues used out there in prod.
I just merged in a pull request from dependabot that bumps the versions of some npm modules. And then five more emerged, as if this bot is a hydra. And I've merged in all of them.

This, most likely, is a good thing by itself. But I can smell a lot of possible bad things coming out of this. I mean, this approach is not security through obscurity, but it is creating a lot of room for such in the future.

Just imagine developers all over the world learning to blindly click "merge" on three-lines pull requests to their package-lock.json files. What if a package is hijacked, and updating it results in running malicious code on your production box, or on your home+work laptop? What if a zero-day vulnerability, in node or in npm, or elsewhere, is found, and this kind of automatic pull request is [ab]used to trigger an avalanche of people updating, by accident or through second-order malicious intent?

Finally, what percentage of people would blindly accept a +3 lines change from a user that is not @dependabot, but some seemingly unnoticeable alteration of the name? I'd bet at least a few percent. And how many bad things can be squeezed into three lines? An unbounded number of them. And what are a few percent of developers? 10K+ repositories? 100K+?

If anything, this human factor of potential (and real) over-reliance on semi-automated "improvers" of our code is the strongest argument I have today in favor of using several repositories. Monorepo FTW is my motto in many situations, but, at least from a purely damage control perspective, multiple repos help keep some boundaries tight.

Yesterday I could argue that if some TypeScript code and some C++ code are sharing the same data schema, then it is quite beneficial to keep this code, and the relevant schema, in the same repository.

Today I would think twice before allowing package[-lock].json anywhere in a repository where the main language is C++, and where various pre-submit scripts and post-merge actions do build and run this code on the servers that I care for security-wise.

Or maybe this post just signifies that I've bought into the religion of GitHub actions at scale. Yes, they are quite wasteful, true that. Yes, they often introduce extra complexity. But in our unstable world they do seem to be the safest technique of loose coupling in environments where zero trust is quickly becoming the norm, especially when it comes to popular, widely depended upon, and somewhat fragile open source projects.
For a while, my understanding of what WebAssembly, wasm, is good for fit roughly the following pattern:

• There is an isolated, mostly algorithmic, task.
• This task is best to be solved on the client side, in the browser.
• One could think of writing an npm module that solves this task.
• But an implementation in another language already exists, on the backend (BE).
• So just take this implementation, wrap it into wasm, "ship" the very function to the frontend (BE), and you're done.

One big benefit of this approach is that the FE itself becomes fully decoupled from the implementation of this magic function. There is no dependency on the FE developer, on the build process of the FE repo, or the like. Local iterations are fast and local, just as I like them.

Plus, the tests for the existing code effectively test the wasm-transpiled code as well. In my case, when the code is C++, tested with googletest, and the transpiler is Emscripten (emsdk), everything just works out of the box. I only needed to add a build target so that, among other things, my freshly-built BE also exposes the very transpiled wasm function.

On the team dynamics level, it is awesome that the FE people can focus on making the product look & feel great to the user, while the BE people can do the data heavylifting part. And this way to use wasm does the job indeed.

But the above is Level One of wasm. I'm enlightened now, and can fast track you through the full journey.

Level Two: Ship the "demo" of your SPA with 100% of its BE code served by browser-run wasm.

Chances are, your BE is already implemented in a language that can be transpiled into wasm (node.js, Java, C++, Go, all fit). So, it's a one-weekend hackathon for your BE team to make sure 100% of the necessary code is shipped as wasm, and then another one-weekend hackathon for the FE team to integrate it.

The win of this approach might look minor, but, trust me, it is oh so important for a great demo. It is that your web app becomes extremely, insanely responsive.

Chances are, especially if you are a startup, that you have taken quite a few shortcuts here and there. Extra network round-trips are being made, and/or some operations that can be cleverly combined into synchronous updates (ref. BFF) are now asynchronous. Well, with wasm — assuming the amount of data you are crunching is small, of course — consider these problems non-existent. If all you need is to run to serve the response from a BE is a few quick for-loops, trust me, your wasm engine would likely be able to do it in single-digit milliseconds; indistinguishable from instant by most Web App users these days.

Level Three: Automagically add "offline mode" to any SPA ("single-page [web] app"), with no code change. It is surprisingly easy once this very SPA already contains an initializer for the wasm wrapper. Once it is there, literally all one needs to make an SPA offline-friendly is to:

• Build the whole BE, including the data it needs for the demo, into a single wasm bundle,
• Add the logic to intercept XMLHttpRequest calls during the very SPA initialization of the wasm part, and then
• Whenever the FE code is calling the API, don't make an outgoing HTTP call, but call the now-local wasm function instead.

I have not done this myself; and I would probably not go this far with automating the process. But the idea is just too brilliant for me to not share it.

Who knows, maybe soon we would learn to launch full Docker containers via wasm. There is a fully functioning wasm Linux distro, after all, so the path is paved. And then someone could ship an npm module that would just take a Docker container, wrap it into wasm, and then surgically intercept the calls to the BE to now be served within the browser, probably running this Docker container in a separate Web Worker.

This someone might be me, but I've got more things to build now. So, the idea is up for grabs — feel free to!