Crazy thought, but I can't shake it off.
Imagine "Independence Day". Some Jeff Goldblum, with the help of some Will Smith, needs to find a way to reverse-engineer, hack, and then penetrate some alien spacecraft, some older version of which crashed a long time ago in what we now know as some Area 51.
What tools will the software people assigned to this project use?
Linux? Most likely.
Python? Possibly.
Hadoop? Maybe.
Rust / C++ for high-volume data crunching and SQL / Spark for real-time querying? Sounds legit.
But will there be Kafka? I seriously doubt it.
Because if there's a large volume of data that needs to be crunched in real time, we would need a custom-built solution. If the data needs to be crunched at near real time -- even more so. If no real-time updates are needed, just save snapshot files every five seconds, aggregated into every minute, then hour, then day.
So, let me set this straight. Isn't it true that Kafka is solving a problem that is not fundamental to software architecture? In other words, isn't it true that had we had properly designed databases and event buses before Kafka emerged, there would just be no market for Kafka to enter?
And, if so, isn't Kafka today a liability for future generations? As in, could it be that we are already past the point where it makes more sense to try to work around Kafka as opposed to trying to fit into the system if our goal is to build long-lived software?
Just curious.
Imagine "Independence Day". Some Jeff Goldblum, with the help of some Will Smith, needs to find a way to reverse-engineer, hack, and then penetrate some alien spacecraft, some older version of which crashed a long time ago in what we now know as some Area 51.
What tools will the software people assigned to this project use?
Linux? Most likely.
Python? Possibly.
Hadoop? Maybe.
Rust / C++ for high-volume data crunching and SQL / Spark for real-time querying? Sounds legit.
But will there be Kafka? I seriously doubt it.
Because if there's a large volume of data that needs to be crunched in real time, we would need a custom-built solution. If the data needs to be crunched at near real time -- even more so. If no real-time updates are needed, just save snapshot files every five seconds, aggregated into every minute, then hour, then day.
So, let me set this straight. Isn't it true that Kafka is solving a problem that is not fundamental to software architecture? In other words, isn't it true that had we had properly designed databases and event buses before Kafka emerged, there would just be no market for Kafka to enter?
And, if so, isn't Kafka today a liability for future generations? As in, could it be that we are already past the point where it makes more sense to try to work around Kafka as opposed to trying to fit into the system if our goal is to build long-lived software?
Just curious.
Is Event Sourcing about historic data or historical data?
ChatGPT
Event sourcing is a design pattern used in software architecture that deals with persisting data as a sequence of events. Instead of storing the current state of the data in a database, it stores the sequence of actions or events that led to that state.
This might seem like a subtle difference, but it has profound implications. Storing the current state only allows you to answer the question "what is the current state?" However, by storing events, you can also answer questions like "how did we get to this state?" and "what is the history of changes to this state?" This ability to navigate through time and understand the history and progression of changes can be very powerful.
So, to answer your question, event sourcing is about historic data in the sense that it records the series of events or changes over time. It doesn't necessarily deal with "historical data" in the sense of historical facts or information, unless the system it's being used on happens to deal with historical information. In general, the term "historic data" refers to a record of past events, which fits the idea of event sourcing quite well.
Looks like GPT-4 is wrong here though. Event logs, and event sourcing in general, are about historical data, much like past stock quotes are.
Historic moments are the special ones, somewhat synonymous with epic, majestic data. If it's the stream that matters, where the order is crucial to maintain, and the events are important, at removing any of them breaks the whole thing.
Interestingly, GPT-4 is making an even more strange statement:
Dima:
Would the Bitcoin ledger be the one of historic data or of historical data?
GPT-4:
The Bitcoin ledger, also known as the blockchain, is essentially a record of historic data ...
"Historical data," in a broader context, often refers to information about the past, often of significant or impactful events, which doesn't quite fit how we think about the Bitcoin ledger, despite the significant impact Bitcoin has had. That said, there might be cases where "historical" could be used to refer to the Bitcoin ledger, especially if looking at it from the perspective of its role in the history of digital currency or blockchain technology.
Personally, I've settled this question for myself by looking at the number of Google hits for {Bitcoin, $TICKER, Dow Jones} "historic data" vs. "historical data". Historical it is for event logs.
Not sure who needs to see this, but as a
If you need modules, one-module-per-line definitions of them in
I used
There are also Github actions in the repo to confirm that the stuff does indeed do what it should. And a few examples are there as well, including typed and untyped JSON operations, as well as an HTTP server.
PS: No
PS2: This
PS3: No, I'm not learning Kotlin. It's a compromise, and the lesser of several evils. Besides, one has to exercise their coding muscle in new languages sometimes, right? I've played with Golang several months ago, and am playing with Kotlin now. Rust next, of course, just looking forward for an opportunity to use it and be paid for it to arise =)
PS4: It's actually quite fun to learn something as terrible as Kotlin with ChatGPT. Far better than with the Google + StackOverflow combo. Might be the novelty effect, but I think it's actually real, and, for once, am looking forward to doing more crap like this in the months to come.
vim
user — RIP Bram! — I very much dislike bulky IDEs and heavy build tools. So, here's a small Makefile repo that can run Kotlin code with make
.If you need modules, one-module-per-line definitions of them in
deps.txt
do the job. A make install
target is available to do what npm i
or pip3 install -r requirements.txt
do. (Hmm, maybe I should rename this target into some make modules
or make deps
?)I used
mvn
behind the scenes, just because it was cleaner to get to work in this "ad-hoc" manner of downloading packages one by one, into a specific directory.There are also Github actions in the repo to confirm that the stuff does indeed do what it should. And a few examples are there as well, including typed and untyped JSON operations, as well as an HTTP server.
PS: No
gradle
is not minimalistic to my taste. The amount of boilerplate code to import one dependency is just absurd. It's probably the largest crap-to-value ratio I've seen in a while.PS2: This
kotlin-stdlib.jar
file not found by default just sucks. How come there's no kotlinc --print-kotlin-stdlib-jar-path
is beyond me. I just ended up searching for it in a few places as part of the make
process. If you are on a Mac, pointers appreciated.PS3: No, I'm not learning Kotlin. It's a compromise, and the lesser of several evils. Besides, one has to exercise their coding muscle in new languages sometimes, right? I've played with Golang several months ago, and am playing with Kotlin now. Rust next, of course, just looking forward for an opportunity to use it and be paid for it to arise =)
PS4: It's actually quite fun to learn something as terrible as Kotlin with ChatGPT. Far better than with the Google + StackOverflow combo. Might be the novelty effect, but I think it's actually real, and, for once, am looking forward to doing more crap like this in the months to come.
GitHub
GitHub - dkorolev/kt-makefile: Simplified Kotlin builds for the "no-IDE" setup.
Simplified Kotlin builds for the "no-IDE" setup. Contribute to dkorolev/kt-makefile development by creating an account on GitHub.
I have come to the point when I universally admit the value of GPT models for myself.
(For the record: I have seen their value as a business, as well as their value for the humankind, from day one. But I am well aware my personal needs are often far from standard, so the very statement I am making is, in a way, quite stronger!)
For queries that are either evergreen ("Locke's views", "Prove asymptotic complexity of comparison sort") or have to do with well-established domain ("Traverse a JSON object recursively in Kotlin", "Dockerfile to switch away from the root user"), its answers are very much spot on.
I have yet to form an opinion on whether GPT models are useful at replacing humans. But as a savvy colleague who's well educated and always available for a quick advice, it's irreplaceable for me at the moment. For each of such queries, it literally saves me at least minutes, and, more likely, dozens of minutes of collating data from various online sources.
So, yes, GPT models can and should kill search engines. In fact, they can do it "as is" already, as long as they are kept up to date.
The very "bias towards too popular, and often untrue content" is pure BS. That's how the world works. Whether or not GPT models can spark a positive against-fake-news and pro-true statements revolution, and/or whether being able to correctly cite sources and to properly rank them by their own biases and track records -- that's besides the point to me.
As a human being interacting with such a system, I will always use my own judgement regardless; and my own judgement will be biased all the time, that's normal and healthy and human nature overall. The right question to postulate is not whether GPT models can be trained in a neutral way; the right question is whether they can be trained to be less biased than what modern-day search engines tell us, and not in a vacuum but to a conscious human being who is exercising their judgement while interpreting the results. And, evidently, the bar of beating, say, Google, in being unbiased is quite low, and I'd comfortably bet on GPT models beating this bar already.
Last but not least: While I am not a huge fan of regulation, it does worry me that we may live in the world where such technology is only available for select few, and/or is severely restricted or censored for the majority of the human population. The modern trends of open-sourcing more and more models are, of course, amazing, and I do hope this is a false alarm.
Still, in the back of my mind, I am terrified by the prospect that in some twenty years from now building one's own LLM at home would be a matter of a single-thousands-dollars and single-days or single-weeks amount of investment, and yet in many parts of the world doing so might be against the law. Much like, for instance, it was and remains against the law in certain places to possess certain books. Or to keep an ML model trained to remove clothing.
As a classical liberal, I see absolutely no harm that can possibly be dealt to a member of a healthy society if they get access to a certain book; same goes to them having access to a certain ML model. Yet, the practice of banning books is real. Well-trained GPT models may well (and absolutely will!) make banning books per se ineffective. Thus the fear of the strong & powerful of our species to try to control these models is, at least, not groundless.
Of course, Generative Pre-training Transformer models are just the tip of the iceberg. What I am saying above applies to future kinds of models as well.
Also, note that I am saying nothing of consciousness, as well as of the "motivation" of those models. That's a totally separate topic. All I am saying is that every single human being having access to a somewhat accurately-restored author or historical figure, so that instead of speculating what they may have meant we may well just ask it to outline an argument on any topic -- that is ultimately a Good Thing. And the "danger" that may stem from people realizing what real people used to think about real issue is only a
(For the record: I have seen their value as a business, as well as their value for the humankind, from day one. But I am well aware my personal needs are often far from standard, so the very statement I am making is, in a way, quite stronger!)
For queries that are either evergreen ("Locke's views", "Prove asymptotic complexity of comparison sort") or have to do with well-established domain ("Traverse a JSON object recursively in Kotlin", "Dockerfile to switch away from the root user"), its answers are very much spot on.
I have yet to form an opinion on whether GPT models are useful at replacing humans. But as a savvy colleague who's well educated and always available for a quick advice, it's irreplaceable for me at the moment. For each of such queries, it literally saves me at least minutes, and, more likely, dozens of minutes of collating data from various online sources.
So, yes, GPT models can and should kill search engines. In fact, they can do it "as is" already, as long as they are kept up to date.
The very "bias towards too popular, and often untrue content" is pure BS. That's how the world works. Whether or not GPT models can spark a positive against-fake-news and pro-true statements revolution, and/or whether being able to correctly cite sources and to properly rank them by their own biases and track records -- that's besides the point to me.
As a human being interacting with such a system, I will always use my own judgement regardless; and my own judgement will be biased all the time, that's normal and healthy and human nature overall. The right question to postulate is not whether GPT models can be trained in a neutral way; the right question is whether they can be trained to be less biased than what modern-day search engines tell us, and not in a vacuum but to a conscious human being who is exercising their judgement while interpreting the results. And, evidently, the bar of beating, say, Google, in being unbiased is quite low, and I'd comfortably bet on GPT models beating this bar already.
Last but not least: While I am not a huge fan of regulation, it does worry me that we may live in the world where such technology is only available for select few, and/or is severely restricted or censored for the majority of the human population. The modern trends of open-sourcing more and more models are, of course, amazing, and I do hope this is a false alarm.
Still, in the back of my mind, I am terrified by the prospect that in some twenty years from now building one's own LLM at home would be a matter of a single-thousands-dollars and single-days or single-weeks amount of investment, and yet in many parts of the world doing so might be against the law. Much like, for instance, it was and remains against the law in certain places to possess certain books. Or to keep an ML model trained to remove clothing.
As a classical liberal, I see absolutely no harm that can possibly be dealt to a member of a healthy society if they get access to a certain book; same goes to them having access to a certain ML model. Yet, the practice of banning books is real. Well-trained GPT models may well (and absolutely will!) make banning books per se ineffective. Thus the fear of the strong & powerful of our species to try to control these models is, at least, not groundless.
Of course, Generative Pre-training Transformer models are just the tip of the iceberg. What I am saying above applies to future kinds of models as well.
Also, note that I am saying nothing of consciousness, as well as of the "motivation" of those models. That's a totally separate topic. All I am saying is that every single human being having access to a somewhat accurately-restored author or historical figure, so that instead of speculating what they may have meant we may well just ask it to outline an argument on any topic -- that is ultimately a Good Thing. And the "danger" that may stem from people realizing what real people used to think about real issue is only a
"danger" in a dysfunctional society, whereas the society I would like to help to push forward is pretty much the opposite of a dysfunctional one.
PS: "Summarize as a table" is awesome, and "Summarize as a JSON" is even more useful for data collation purposes if later on this dataset is to be used programmatically. You're welcome.
PS: "Summarize as a table" is awesome, and "Summarize as a JSON" is even more useful for data collation purposes if later on this dataset is to be used programmatically. You're welcome.
For decades now, I keep talking about the pain of not being able to perform atomic "transactions" with the contents of the file system.
Something as trivial as "as long as these files exist, and as long as this pattern matches exactly this number of files, perform this atomic rename of these files" would be awesome to have.
If this would be supported by RAIDs, database design would become far more straightforward. (Okay, we'll need to standardize what
Now, in the world of S3, the ask of transactionality is even more valid. A lot of SysDesign-s become very clean if the cloud filesystem could be used in a transactional way, albeit with a very limited scope of what these transactions could offer.
Maybe it's a worthy standalone project? Build a wrapper around S3 to enable simple "transactions" and to guarantee their atomicity, isolation, and linearazability? 🤔
Something as trivial as "as long as these files exist, and as long as this pattern matches exactly this number of files, perform this atomic rename of these files" would be awesome to have.
If this would be supported by RAIDs, database design would become far more straightforward. (Okay, we'll need to standardize what
fsync
actually does, but that's another can of worms.)Now, in the world of S3, the ask of transactionality is even more valid. A lot of SysDesign-s become very clean if the cloud filesystem could be used in a transactional way, albeit with a very limited scope of what these transactions could offer.
Maybe it's a worthy standalone project? Build a wrapper around S3 to enable simple "transactions" and to guarantee their atomicity, isolation, and linearazability? 🤔
Forwarded from SysDesign Meetup
The next — Educational — episode of the meetup is up.
Never confuse functional, integration, regression, and end-to-end tests again!
Moreover, don't take this very question seriously. For if you want to set things straight, offense is the best defense!
Video: https://tinyurl.com/sdm-testing-video
Slides: https://tinyurl.com/sdm-testing-slides
Never confuse functional, integration, regression, and end-to-end tests again!
Moreover, don't take this very question seriously. For if you want to set things straight, offense is the best defense!
Video: https://tinyurl.com/sdm-testing-video
Slides: https://tinyurl.com/sdm-testing-slides
YouTube
Testing:: Educational SysDesign Meetup :: 2023-August-26
Today, in this educational episode, we talk about testing software.
Timekeys:
0:00 Intro
0:39 Goals
4:32 ChatGPT screenshots
5:33 A test that is failing is a good test!
6:56 Types of tests
8:34 Dimensions to the types of tests
13:24 Who is the stakeholder…
Timekeys:
0:00 Intro
0:39 Goals
4:32 ChatGPT screenshots
5:33 A test that is failing is a good test!
6:56 Types of tests
8:34 Dimensions to the types of tests
13:24 Who is the stakeholder…
Forwarded from SysDesign Meetup
https://www.youtube.com/watch?v=y8OnoxKotPQ -- a really well done pun, many thanks to the kind soul in our Slack who shared it!
YouTube
Microservices
it's because of the way our backend works
// merch: https://merch.krazam.tv
// https://www.instagram.com/krazam.tv
// https://twitter.com/krazamtv
// merch: https://merch.krazam.tv
// https://www.instagram.com/krazam.tv
// https://twitter.com/krazamtv
Observation of the day: Google Drive storage is same price as S3.
I was under the impression S3 would be ~10x to ~100x cheaper. But they both cost $2.5 USD per 100GB per month.
I was under the impression S3 would be ~10x to ~100x cheaper. But they both cost $2.5 USD per 100GB per month.
CERN has its own https://onlyfans.web.cern.ch/ page!
I've published a longread about Event Sourcing, in the "Gentle Introduction" way.
The introduction is gentle both from the engineering standpoint and from the grounds of convincing the business to begin making the first steps.
If you struggle with justifying it to the stakeholders that the company needs to upgrade its data stack, or if you already have a soft buy-in but are unsure how to best proceed -- I hope you'd find this post helpful and insightful.
The introduction is gentle both from the engineering standpoint and from the grounds of convincing the business to begin making the first steps.
If you struggle with justifying it to the stakeholders that the company needs to upgrade its data stack, or if you already have a soft buy-in but are unsure how to best proceed -- I hope you'd find this post helpful and insightful.
Dima Korolev
Event Sourcing V1
Gentle introduction to the Event Sourcing paradigm.
Tom from IGotAnOffer interviewed me, and here's a video of Dima designing Uber.
In hindsight, I should really work on mastering the art of explaining nontrivial things in a simple way.
In hindsight, I should really work on mastering the art of explaining nontrivial things in a simple way.
YouTube
Uber system design: mock interview walk-through with Dima Korolev (ex-Google)
GET 1-to-1 COACHING for system design interviews: https://igotanoffer.com/en/interview-coaching/type/system-design-interview
Today's mock interview: "Design Uber".
Our special guest candidate: Dima Korolev (ex-Google SWE) gives us a walk-through of how…
Today's mock interview: "Design Uber".
Our special guest candidate: Dima Korolev (ex-Google SWE) gives us a walk-through of how…
Me:
Github: "This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository."
Me:
Github: Ah, right! Here you go.
It's from my fork, yes. But you can do better, Github!
https://github.com/$org/$repo/commit/$sha
Github: "This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository."
Me:
https://github.com/$org/$repo/pull/$n/commits/$sha
Github: Ah, right! Here you go.
It's from my fork, yes. But you can do better, Github!
Is there a decent out-of-the-box Android emulator for Linux?
As a joke, some weekend last year, I built
(One thing I did use it for is to run Google Authenticator on my laptop, in a save-and-restore-able state, with a few OTP tokens. Clever-ish, but still a toy.)
Now, the more I think about responsive software and CPU-efficiency of my workflows, the more I like the idea of being able to open lightweight "mobile" apps on my laptop:
• Total isolation, as in those apps don't have access to my files.
• Pre-downloaded apps, which means far less traffic.
• Controlled network, so that I can use any pre-configured proxy or VPN.
I can totally see myself using such a solution for:
• Online banking apps (now I re-login into a dedicated user for them, which is slow)
• The browser for lightweight browsing (i.e. news or ChatGPT queries)
• Messengers (I use them in the browser already, and native mobile experience is both smoother and faster, for virtually every messenger I'm aware of)
• Stocks tracking apps (they are slick on phones/tablets, and suck in browsers)
• Travel (Airline and train ticket apps, as well as Expedia / Sixt / Marriott / IHG are far better as native apps)
• Social media (I'm deliberately logged out of everything but LinkedIn in my laptop browser)
• Podcasts and music (I'm happy with my phone for them, but just in case)
• The "apps" where one needs to type in my personal details and/or addresses (food delivery, co-workings memberships, ETAs)
• The OTP apps (because reaching out for my phone every single time is a pain)
So, is there a quick way to install something on Linux with one command, download a bunch of `.apk`-s and roll with it? Thanks in advance!
PS: Please don't suggest Chrome apps. One of the goals is reducing my ties with Google. In fact, I'll probably switch to Firefox next time I switch laptops, so this option is out.
As a joke, some weekend last year, I built
anbox
in a Docker container to ssh -X
into it. It was fun while it lasted, but it was also impractical.(One thing I did use it for is to run Google Authenticator on my laptop, in a save-and-restore-able state, with a few OTP tokens. Clever-ish, but still a toy.)
Now, the more I think about responsive software and CPU-efficiency of my workflows, the more I like the idea of being able to open lightweight "mobile" apps on my laptop:
• Total isolation, as in those apps don't have access to my files.
• Pre-downloaded apps, which means far less traffic.
• Controlled network, so that I can use any pre-configured proxy or VPN.
I can totally see myself using such a solution for:
• Online banking apps (now I re-login into a dedicated user for them, which is slow)
• The browser for lightweight browsing (i.e. news or ChatGPT queries)
• Messengers (I use them in the browser already, and native mobile experience is both smoother and faster, for virtually every messenger I'm aware of)
• Stocks tracking apps (they are slick on phones/tablets, and suck in browsers)
• Travel (Airline and train ticket apps, as well as Expedia / Sixt / Marriott / IHG are far better as native apps)
• Social media (I'm deliberately logged out of everything but LinkedIn in my laptop browser)
• Podcasts and music (I'm happy with my phone for them, but just in case)
• The "apps" where one needs to type in my personal details and/or addresses (food delivery, co-workings memberships, ETAs)
• The OTP apps (because reaching out for my phone every single time is a pain)
So, is there a quick way to install something on Linux with one command, download a bunch of `.apk`-s and roll with it? Thanks in advance!
PS: Please don't suggest Chrome apps. One of the goals is reducing my ties with Google. In fact, I'll probably switch to Firefox next time I switch laptops, so this option is out.
Played around with with CPU-level investigation for once.
Task: Detect uint64 overflow with no extra overhead.
Compared this:
To this:
Results:
• They generate identical code.
• Both on Ubuntu and on macOS.
Links:
• The Github repo + the test run.
• The LinkedIn thread.
This was fun. Now back to real work.
Task: Detect uint64 overflow with no extra overhead.
Compared this:
while (true) {
++iteration;
step = (step / 8) + step;
value += step;
if (value < step) {
break;
}
}
To this:
while (true) {
++iteration;
step = (step / 8) + step;
if (__builtin_add_overflow(value, step, &value)) {
break;
}
}
Results:
• They generate identical code.
• Both on Ubuntu and on macOS.
Links:
• The Github repo + the test run.
• The LinkedIn thread.
This was fun. Now back to real work.
GitHub
GitHub - dkorolev/clang_builtin_add_overflow
Contribute to dkorolev/clang_builtin_add_overflow development by creating an account on GitHub.
SQL injections were a real thing in the late 90s.
Expectation: By mid 2010s we've built all the right frameworks to make these kinds of silly mistakes impossible. And have happily moved on since.
Reality: I'm sitting through a Github Copilot workshop, and the example they use to illustrate its power is pointing out that a certain piece of code is unsafe with respect to a possible SQL injection. And this Copilot piece of AI is suggesting a fix to that code.
Humankind is presenting the value of a multi-dozen-billion-dollars business by using an example of what should absolutely not exist as of 15+ years by now.
This literally feels like being explained that a new 2024 Toyota now has the mechanics to prevent you from switching gears and damaging your gearbox if the clutch pedal is not fully pressed down. Me facepalm.
Expectation: By mid 2010s we've built all the right frameworks to make these kinds of silly mistakes impossible. And have happily moved on since.
Reality: I'm sitting through a Github Copilot workshop, and the example they use to illustrate its power is pointing out that a certain piece of code is unsafe with respect to a possible SQL injection. And this Copilot piece of AI is suggesting a fix to that code.
Humankind is presenting the value of a multi-dozen-billion-dollars business by using an example of what should absolutely not exist as of 15+ years by now.
This literally feels like being explained that a new 2024 Toyota now has the mechanics to prevent you from switching gears and damaging your gearbox if the clutch pedal is not fully pressed down. Me facepalm.
Forwarded from SysDesign Meetup
https://open.substack.com/pub/dimakorolev/p/the-oltp-language -- collected my thoughts on how can we accelerate the retirement of SQL as the meant to implement OLTP transactions.
Dima Korolev
The OLTP Language
Time to deprecate SQL as the means to implement business logic.