Not boring, and a bit of a condescending prick
224 subscribers
16 photos
98 links
Semi-digested observations about our world right after they are phrased well enough in my head to be shared broader.
Download Telegram
A lot of ideas that float around with respect to AGI Alignment seem to converge to a rather trivial dichotomy:

Do you want to help people better achieve their goals / live happier lives as they themselves would like to? (1)

Or:

Do you believe people, en masse, are not good enough at making decisions for themselves, so that, to a certain degree, some form of "positive propaganda" would be a push for the greater good? (2)

If you would have asked me ~ten years ago, I would most definitely argue that (2) is a recipe for disaster, while (1), although quite hard, is the way to go.

Looking at the world today though, I am far less convinced.
Because even if — even if! — we believe human beings are good at deciding for themselves, and even if — even if! — we believe in some form of a well-functioning democracy, so that the will of the majority of people is what drives the society forward ... I have major questions about the sustainability of such a model in the long run.

To put it blunt: I am not positive I would definitively pick such a society, however much of a utopia it may sound, over a different society, some more similar to the 1990s / 2000s USA.

Because I am legitimately afraid such a society, however much of a utopia it might be, would not legitimately enable me to execute on all my desires. Despite the fact that, objectively, none of my desires can abuse or in any way harm the other citizens / inhabitants of such a seemingly perfect society.

Not sure what to make of the above thought. But to me it does, at the very least, serve as good reminder that we are nowhere near being able to postulate the AGI Alignment problem well enough. Because, with a perfect AI or with no AI, we, humans, are having major trouble deciding for ourselves what our lives should be in the long run.

Go figure ¯\_(ツ)_/¯
I’ve used this argument a lot, but have never seen it employed seriously before. So, as I’m not likely to be a genius, I must have missed something major in tech and/or sci-fi, so do enlighten me pls.

The argument is: in the context of sustainable human colonies (underwater, the orbit, Mars, anywhere), why do we need large populations to retain large gene pools?

We can print the desired DNA molecules today, and we have plenty of DNAs scanned and stored. This data is virtually impossible to lose and/or to erase.

And we can control the DNA of an embryo. I.e. we can have a woman conceive and carry to birth a human child of a desired DNA.

Thus, even with no artificial wombs whatsoever, the human race as we know it today may well, in theory, proliferate in the conditions as harsh as a spacecraft traveling thousands of years with very scares food supply.

Because we literally only need a single-digit number of women to be alive at a time. Plus a hard drive with plenty of further DNAs to choose from. Plus that very technology and the means to operate it, which also has to be single-digit humans, and they don’t even have to be women.

Not hundreds of thousands. Nowhere near this huge number.

This, or what am I missing?
Forwarded from SysDesign Meetup
One of the ideas we are flirting with is to sketch out some “open source design” of a nontrivial architectural component.

If done well, not only it could serve as a reference document on how to prepare such designs, but also a blueprint based on which we could implement the prototype.

I think there’s room for such a design out there today, from contracts and protocol specifications (HTTP JSON to gRPC to binary) all the way to components’ implementation and unit/integration/load testing.

As an example problem I keep coming back to either the Total Order Broadcast engine or a simple De-Duplicator.

The former can be viewed as an extension of a service mesh proxy, which inserts a sequence ID into every requests, guaranteeing that each and every ID would be unique.

The latter can be viewed as something similar, but less so on the transport level, focusing on respecting the logic of idempotency tokens instead.

Do you think this would be a good idea? Would you be interested in participating in parts of such “open designs” one way or another? Let me know pls.
Continuing with the "crowdsourced open source design" idea, after a few review iterations, a set of principles on how we can develop and publish them is now public: https://github.com/SysDesignMeetup/sdm

I promise to not spam this channel much, but want to get the word out broader, that this thought has now graduated from a mere idea to something that we're executing upon.

My dream of some Architect's Corner born out of the @SysDesignMeetup and our community has taken a major step towards becoming reality!
Publishing Enriched Data Update Events
"When caching and "poor man's replication" play well together."

I've put together quite a few thoughts, and would appreciate external wisdom at how to approach them from the larger eng org standpoint.

https://dimakorolev.substack.com/p/publishing-enriched-data-update-events
Shared some thoughts on LinkedIn about being more explicit in how the APIs are best to be structured.

My take:

• Vague names are indeed bad,
• Use type system to prevent mistakes,
• Make state machines explicit, and
• Keep data mutations as atomic as possible!

Vague names are indeed bad: It is always a good heuristic to make sure the readers of the code are not surprised by naming conventions. Writers too, but the code is read far more often than written.

Use type system to prevent mistakes: Let's talk about programmatically pushing code to a Git repository, or about programmatically starting Docker containers. There is a natural flow of events: commit then push to git, creating an image, then container, then starting it for Docker. A good API will not make it very clear for the reader at which stage of the process a particular entity is, a good API will also make sure that is physically impossible to use this API the wrong way. First create the commit, then use .push(), but only on the object that is returned by .commit(); same logic applies to Docker containers and the order of their startup / teardown.

Make state machines explicit: The example in the previous paragraph illustrates a trivial logic of A -> B -> C. This approach can be generalized. For instance, if an object can be transformed into one of two incompatible states, with a different set of further legal actions on them, it would be best for these incompatible states to be of their own types. This way both the compiler/linter and the IDE would be quick to highlight what exactly is being performed, as well as to throw an error that what is being attempted is not possible in the first place. Static typing FTW, and this applies not only to general types, but also to dataflows and logical invariants.

Keep data mutations as atomic as possible! This reminds me of the difference between `adduser` and `useradd`. Admins know it by heart, but for the mere mortals it's nothing but an extra source of confusion. If the student can be present or absent in some group or class or what not, .Add() and .Remove() are perfectly fine names to my taste.

Personally, if I see .Enroll() instead of .Add(), I will right away begin thinking of what are the omitted side effects of this operation. Maybe some other data structure (or some other database!) now has a record that in order to graduate this student must pass this test or take this exam? Or maybe this student now has extra prerequisites to complete before being eligible to begin the next year? My experience, as the reader of the code, will be suboptimal, because a seemingly better word makes me wonder if it means something other than what I think it may mean.

Sure, .Enroll() is, at times, better than .Add(). But an even better way, IMHO, would be to pack the act of enrollment into a single CQRS command, or into a single domain event, if you wish.

The code would then read — bear in mind, I'm a C++ guy — along the lines of:

university_db.ApplyMutation(
StudentEnrollmentRequest(some_student)
.IntoClass(some_class)
.RequireExam(true)
.NoPrerequisites());

This way, on the type system level, for certain types of classes, for example, forgetting to add `.RequireExam()` or `.NoPrerequisites()` would result in a compile-time error.

Or in a deprecation warning, which is also useful. The deprecation warning at compile time, I mean.

Moreover, once the company is ready to introduce business events, this particular StudentEnrollmentRequest is an atomic (!), transactional (!!), and ready-to-be-replayed (!!!) mutation of the data. The response to it, some StudentEnrollmentResult, also a strongly typed message, which can be journaled, into an adjacent stream, with the same offset ID, and these two streams — of commands and of responses — form a 100% event-sourcing-friendly design.
My new public post: Miro & Styra: Battle-testing Enterprise OPA.

Learn how Miro extensively benchmarked Styra’s Enterprise OPA and found it was 10,000 times more performant than the open source container.

#miro #opa #rego #styra
Crazy thought, but I can't shake it off.

Imagine "Independence Day". Some Jeff Goldblum, with the help of some Will Smith, needs to find a way to reverse-engineer, hack, and then penetrate some alien spacecraft, some older version of which crashed a long time ago in what we now know as some Area 51.

What tools will the software people assigned to this project use?

Linux? Most likely.

Python? Possibly.

Hadoop? Maybe.

Rust / C++ for high-volume data crunching and SQL / Spark for real-time querying? Sounds legit.

But will there be Kafka? I seriously doubt it.

Because if there's a large volume of data that needs to be crunched in real time, we would need a custom-built solution. If the data needs to be crunched at near real time -- even more so. If no real-time updates are needed, just save snapshot files every five seconds, aggregated into every minute, then hour, then day.

So, let me set this straight. Isn't it true that Kafka is solving a problem that is not fundamental to software architecture? In other words, isn't it true that had we had properly designed databases and event buses before Kafka emerged, there would just be no market for Kafka to enter?

And, if so, isn't Kafka today a liability for future generations? As in, could it be that we are already past the point where it makes more sense to try to work around Kafka as opposed to trying to fit into the system if our goal is to build long-lived software?

Just curious.
Is Event Sourcing about historic data or historical data?


ChatGPT

Event sourcing is a design pattern used in software architecture that deals with persisting data as a sequence of events. Instead of storing the current state of the data in a database, it stores the sequence of actions or events that led to that state.

This might seem like a subtle difference, but it has profound implications. Storing the current state only allows you to answer the question "what is the current state?" However, by storing events, you can also answer questions like "how did we get to this state?" and "what is the history of changes to this state?" This ability to navigate through time and understand the history and progression of changes can be very powerful.

So, to answer your question, event sourcing is about historic data in the sense that it records the series of events or changes over time. It doesn't necessarily deal with "historical data" in the sense of historical facts or information, unless the system it's being used on happens to deal with historical information. In general, the term "historic data" refers to a record of past events, which fits the idea of event sourcing quite well.

Looks li
ke GPT-4 is wrong here though. Event logs, and event sourcing in general, are about historical data, much like past stock quotes are.

Historic moments are the special ones, somewhat synonymous with epic, majestic data. If it's the stream that matters, where the order is crucial to maintain, and the events are important, at removing any of them breaks the whole thing.


Interestingly, GPT-4 is making an even more strange statement:

Dima:


Would the Bitcoin ledger be the one of historic data or of historical data?

GPT-4:


The Bitcoin ledger, also known as the blockchain, is essentially a record of historic data ...


"Historical data," in a broader context, often refers to information about the past, often of significant or impactful events, which doesn't quite fit how we think about the Bitcoin ledger, despite the significant impact Bitcoin has had. That said, there might be cases where "historical" could be used to refer to the Bitcoin ledger, especially if looking at it from the perspective of its role in the history of digital currency or blockchain technology.

Personally, I've settled this question for myself by looking at the number of Google hits for {Bitcoin, $TICKER, Dow Jones} "historic data" vs. "historical data". Historical it is for event logs.
Not sure who needs to see this, but as a vim user — RIP Bram! — I very much dislike bulky IDEs and heavy build tools. So, here's a small Makefile repo that can run Kotlin code with make.

If you need modules, one-module-per-line definitions of them in deps.txt do the job. A make install target is available to do what npm i or pip3 install -r requirements.txt do. (Hmm, maybe I should rename this target into some make modules or make deps?)

I used mvn behind the scenes, just because it was cleaner to get to work in this "ad-hoc" manner of downloading packages one by one, into a specific directory.

There are also Github actions in the repo to confirm that the stuff does indeed do what it should. And a few examples are there as well, including typed and untyped JSON operations, as well as an HTTP server.

PS: No gradle is not minimalistic to my taste. The amount of boilerplate code to import one dependency is just absurd. It's probably the largest crap-to-value ratio I've seen in a while.

PS2: This kotlin-stdlib.jar file not found by default just sucks. How come there's no kotlinc --print-kotlin-stdlib-jar-path is beyond me. I just ended up searching for it in a few places as part of the make process. If you are on a Mac, pointers appreciated.

PS3: No, I'm not learning Kotlin. It's a compromise, and the lesser of several evils. Besides, one has to exercise their coding muscle in new languages sometimes, right? I've played with Golang several months ago, and am playing with Kotlin now. Rust next, of course, just looking forward for an opportunity to use it and be paid for it to arise =)

PS4: It's actually quite fun to learn something as terrible as Kotlin with ChatGPT. Far better than with the Google + StackOverflow combo. Might be the novelty effect, but I think it's actually real, and, for once, am looking forward to doing more crap like this in the months to come.
I have come to the point when I universally admit the value of GPT models for myself.

(For the record: I have seen their value as a business, as well as their value for the humankind, from day one. But I am well aware my personal needs are often far from standard, so the very statement I am making is, in a way, quite stronger!)

For queries that are either evergreen ("Locke's views", "Prove asymptotic complexity of comparison sort") or have to do with well-established domain ("Traverse a JSON object recursively in Kotlin", "Dockerfile to switch away from the root user"), its answers are very much spot on.

I have yet to form an opinion on whether GPT models are useful at replacing humans. But as a savvy colleague who's well educated and always available for a quick advice, it's irreplaceable for me at the moment. For each of such queries, it literally saves me at least minutes, and, more likely, dozens of minutes of collating data from various online sources.

So, yes, GPT models can and should kill search engines. In fact, they can do it "as is" already, as long as they are kept up to date.

The very "bias towards too popular, and often untrue content" is pure BS. That's how the world works. Whether or not GPT models can spark a positive against-fake-news and pro-true statements revolution, and/or whether being able to correctly cite sources and to properly rank them by their own biases and track records -- that's besides the point to me.

As a human being interacting with such a system, I will always use my own judgement regardless; and my own judgement will be biased all the time, that's normal and healthy and human nature overall. The right question to postulate is not whether GPT models can be trained in a neutral way; the right question is whether they can be trained to be less biased than what modern-day search engines tell us, and not in a vacuum but to a conscious human being who is exercising their judgement while interpreting the results. And, evidently, the bar of beating, say, Google, in being unbiased is quite low, and I'd comfortably bet on GPT models beating this bar already.

Last but not least: While I am not a huge fan of regulation, it does worry me that we may live in the world where such technology is only available for select few, and/or is severely restricted or censored for the majority of the human population. The modern trends of open-sourcing more and more models are, of course, amazing, and I do hope this is a false alarm.

Still, in the back of my mind, I am terrified by the prospect that in some twenty years from now building one's own LLM at home would be a matter of a single-thousands-dollars and single-days or single-weeks amount of investment, and yet in many parts of the world doing so might be against the law. Much like, for instance, it was and remains against the law in certain places to possess certain books. Or to keep an ML model trained to remove clothing.

As a classical liberal, I see absolutely no harm that can possibly be dealt to a member of a healthy society if they get access to a certain book; same goes to them having access to a certain ML model. Yet, the practice of banning books is real. Well-trained GPT models may well (and absolutely will!) make banning books per se ineffective. Thus the fear of the strong & powerful of our species to try to control these models is, at least, not groundless.

Of course, Generative Pre-training Transformer models are just the tip of the iceberg. What I am saying above applies to future kinds of models as well.

Also, note that I am saying nothing of consciousness, as well as of the "motivation" of those models. That's a totally separate topic. All I am saying is that every single human being having access to a somewhat accurately-restored author or historical figure, so that instead of speculating what they may have meant we may well just ask it to outline an argument on any topic -- that is ultimately a Good Thing. And the "danger" that may stem from people realizing what real people used to think about real issue is only a
"danger" in a dysfunctional society, whereas the society I would like to help to push forward is pretty much the opposite of a dysfunctional one.

PS: "Summarize as a table" is awesome, and "Summarize as a JSON" is even more useful for data collation purposes if later on this dataset is to be used programmatically. You're welcome.
For decades now, I keep talking about the pain of not being able to perform atomic "transactions" with the contents of the file system.

Something as trivial as "as long as these files exist, and as long as this pattern matches exactly this number of files, perform this atomic rename of these files" would be awesome to have.

If this would be supported by RAIDs, database design would become far more straightforward. (Okay, we'll need to standardize what fsync actually does, but that's another can of worms.)

Now, in the world of S3, the ask of transactionality is even more valid. A lot of SysDesign-s become very clean if the cloud filesystem could be used in a transactional way, albeit with a very limited scope of what these transactions could offer.

Maybe it's a worthy standalone project? Build a wrapper around S3 to enable simple "transactions" and to guarantee their atomicity, isolation, and linearazability? 🤔
Observation of the day: Google Drive storage is same price as S3.

I was under the impression S3 would be ~10x to ~100x cheaper. But they both cost $2.5 USD per 100GB per month.