Shad Intensive. Evaluation.
It seems I’ll be eating this mammoth in small pieces for ages. So for now, here’s a link to a nice post on agent evaluation
It seems I’ll be eating this mammoth in small pieces for ages. So for now, here’s a link to a nice post on agent evaluation
Telegram
Поляков считает: AI, код и кейсы
Как тестировать AI-агентов: на полях лекций в ШАД
Продолжаю Agents Week от ШАД. Четвёртая лекция — как проверять качество агентов. Тема, которую все откладывают и которая больше всего бьёт по репутации ИИ в проде или интегратора.
📋 Что советует лекция
…
Продолжаю Agents Week от ШАД. Четвёртая лекция — как проверять качество агентов. Тема, которую все откладывают и которая больше всего бьёт по репутации ИИ в проде или интегратора.
📋 Что советует лекция
…
Are we doing meme shitposting today?
Anonymous Poll
81%
Memes! Yeah!
6%
I'll unsubscribe immediately
0%
I've already unsubscribed
25%
TGIF!!!
TGIF. Meme
First of all, let's check the results of the poll. One subscriber promised to unsubscribe, and the other 9 voted for memes and shitposting. A naive approach would be to think that if I publish the meme, I'll lose 1 subscriber. But you can solve the proportion, and it gives -27.6 subscribers. The result is stunning, so, let's see. I expect to drop to 248.4 subscribers.
The topic of today's Friday meme is The Soup.
First of all. I stole these memes from The Wizard . I think that it is a magnificent channel and I ask you to promote it as widely as you can.
And now to the chase.
Dad's soup
My dad cooks absolutely hellish food.
It’s a sort of averaged recipe, because there are lots of variations.
He takes soup — but reheating it is not my dad’s style.
He pours the soup into a frying pan and starts frying it.
He adds a huge amount of onion, garlic, tomato paste, flour for thickness, and mayonnaise on top.
The whole thing fries until smoke starts coming out.
Then he takes it off the heat, lets it cool on the balcony, brings it back, pours on even more mayonnaise, and starts eating.
He eats straight from the pan, scraping it with a spoon, muttering under his breath, “oh, damn.”
Sweat is standing on his forehead.
Sometimes he politely offers me some, but I refuse.
Needless to say, the aftermath is monstrous.
The stench is so intense that the wallpaper peels off the walls.
P.S. To be honest, all this subscribe/unsubscribe stuff is starting to get to me. I’d really appreciate some support — even a couple of emojis wouldn’t hurt.
P.P.S. Tomorrow I’ll try to pull myself together and write something clever. Probably continue the “Titanic” line.
First of all, let's check the results of the poll. One subscriber promised to unsubscribe, and the other 9 voted for memes and shitposting. A naive approach would be to think that if I publish the meme, I'll lose 1 subscriber. But you can solve the proportion, and it gives -27.6 subscribers. The result is stunning, so, let's see. I expect to drop to 248.4 subscribers.
The topic of today's Friday meme is The Soup.
First of all. I stole these memes from The Wizard . I think that it is a magnificent channel and I ask you to promote it as widely as you can.
And now to the chase.
Dad's soup
My dad cooks absolutely hellish food.
It’s a sort of averaged recipe, because there are lots of variations.
He takes soup — but reheating it is not my dad’s style.
He pours the soup into a frying pan and starts frying it.
He adds a huge amount of onion, garlic, tomato paste, flour for thickness, and mayonnaise on top.
The whole thing fries until smoke starts coming out.
Then he takes it off the heat, lets it cool on the balcony, brings it back, pours on even more mayonnaise, and starts eating.
He eats straight from the pan, scraping it with a spoon, muttering under his breath, “oh, damn.”
Sweat is standing on his forehead.
Sometimes he politely offers me some, but I refuse.
Needless to say, the aftermath is monstrous.
The stench is so intense that the wallpaper peels off the walls.
P.S. To be honest, all this subscribe/unsubscribe stuff is starting to get to me. I’d really appreciate some support — even a couple of emojis wouldn’t hurt.
P.P.S. Tomorrow I’ll try to pull myself together and write something clever. Probably continue the “Titanic” line.
❤4😁3🔥1
Titanic: family bonds
Nowadays, it is hard to imagine an ML person without a Jupyter notebook. So let’s think a little outside the box and consider other options.
Quite recently, I discovered an interesting option for CSV table analysis. It consists of two instruments: DBeaver and DuckDB. The former is a Swiss army knife for database access, and the latter is a plugin that is quite powerful, or so I was told, but can also be used to open CSV files.
So, you can create a new DuckDB connection in DBeaver (ducks and beavers, yeah...), select the titanic.csv file you got from Kaggle, and start running queries.
SibSp is a factor, so to speak, horizontal in the ancestry tree. It is the sum of siblings and spouses. Parch is vertical in these coordinates: parents and children.
The first two queries show us the strength of these factors separately. The third uses the nested query technique to produce a derived factor, family size. One can see that these factors are useful, but the derived factor gives a stronger depletion/enrichment effect. This time, we were lucky enough to engineer a new feature.
I would say that family sizes 2, 3, and 4 vote for survival; the others vote against it.
Nowadays, it is hard to imagine an ML person without a Jupyter notebook. So let’s think a little outside the box and consider other options.
Quite recently, I discovered an interesting option for CSV table analysis. It consists of two instruments: DBeaver and DuckDB. The former is a Swiss army knife for database access, and the latter is a plugin that is quite powerful, or so I was told, but can also be used to open CSV files.
So, you can create a new DuckDB connection in DBeaver (ducks and beavers, yeah...), select the titanic.csv file you got from Kaggle, and start running queries.
SibSp is a factor, so to speak, horizontal in the ancestry tree. It is the sum of siblings and spouses. Parch is vertical in these coordinates: parents and children.
The first two queries show us the strength of these factors separately. The third uses the nested query technique to produce a derived factor, family size. One can see that these factors are useful, but the derived factor gives a stronger depletion/enrichment effect. This time, we were lucky enough to engineer a new feature.
I would say that family sizes 2, 3, and 4 vote for survival; the others vote against it.
👍1
Let’s rock ROC
All this time, it bothered me that the ROC in the post contains only three points. I’ve been thinking about this situation for a while, and I think I’ve found a small but genuinely new idea. I’m going to explain it slowly in a series of upcoming posts. But first, I want to check whether this thought is actually trivial. Please vote in the poll below, and feel free to leave comments.
Quick recap:
We have a model that gives scores of -1, 0, and 1. When we plot the ROC curve, we get a graph with 4 points and an AUROC of 56%.
The question is: can we calculate the dispersion of AUROC in this case?
All this time, it bothered me that the ROC in the post contains only three points. I’ve been thinking about this situation for a while, and I think I’ve found a small but genuinely new idea. I’m going to explain it slowly in a series of upcoming posts. But first, I want to check whether this thought is actually trivial. Please vote in the poll below, and feel free to leave comments.
Quick recap:
We have a model that gives scores of -1, 0, and 1. When we plot the ROC curve, we get a graph with 4 points and an AUROC of 56%.
The question is: can we calculate the dispersion of AUROC in this case?
Telegram
Algorithms. Physics. Mathematics. Machine Learning.
Titanic. Age.
We started talking about the Titanic dataset. Let's discuss the Age factor. We saw that Sex, Pclass and Embarked are strong features and to study them we used that these features are categorical with low cardinality. Age is different. It's…
We started talking about the Titanic dataset. Let's discuss the Age factor. We saw that Sex, Pclass and Embarked are strong features and to study them we used that these features are categorical with low cardinality. Age is different. It's…
Can we calculate the dispersion of AUROC in this case?
Anonymous Poll
31%
Of course, it's 0. Obviously.
54%
AUROC is a metrics, it has no dispersion
38%
Your question is both stupid and offending
31%
Ouch. I think I have an idea. (And will share in comments)
31%
Shut up and post more memes.
Plan: from school-level math to a March 2026 arXiv
The stakes turned out to be higher than I expected. Our quiet little discussion of the Titanic dataset may have suddenly wandered into publication-level territory. You can check the comments under the poll.
So now I have a difficult choice to make: either post more TFWR and beginner-level programming, or start a short series about the picture above. Let’s postpone TFWR for about a week and focus on ROC with ties.
LLMs make this kind of thing harder to understand. They help you solve the immediate problem quickly, but they often blur the core idea. So let’s try to chase the idea itself.
The plan:
👉 Digitization of objects
👉 What a Score is in ML
👉 ML Models
👉 EML as ML model
👉 ROC and AUROC
👉 Stochastic wandering
👉 Binomial distribution
👉 Variance of AUROC for low-cardinality model outputs
The stakes turned out to be higher than I expected. Our quiet little discussion of the Titanic dataset may have suddenly wandered into publication-level territory. You can check the comments under the poll.
So now I have a difficult choice to make: either post more TFWR and beginner-level programming, or start a short series about the picture above. Let’s postpone TFWR for about a week and focus on ROC with ties.
LLMs make this kind of thing harder to understand. They help you solve the immediate problem quickly, but they often blur the core idea. So let’s try to chase the idea itself.
The plan:
👉 Digitization of objects
👉 What a Score is in ML
👉 ML Models
👉 EML as ML model
👉 ROC and AUROC
👉 Stochastic wandering
👉 Binomial distribution
👉 Variance of AUROC for low-cardinality model outputs
Are you ready for a Friday trash?
Anonymous Poll
64%
Yeah! Friday trash!
9%
No! Give me my stochastic wandering math.
9%
I want to read more about agents.
0%
Please, give me a small Pythonic beginner-level program with explanation
27%
What about Gradient Boosted Decision trees with Extrapolation?
18%
Whatever...
Friday. Trash. Rescue in Midair.
It’s an ancient story. I probably read it sometime around 1993, and it’s probably apocryphal. It doesn’t matter. The storytelling is magnificent.
A demonstration performance by parachutists from the local flying club at some major aviation festival.
The team mechanic, Petrovich, having seen off the last group of athletes in bright jumpsuits aboard the old An-2, decided that he could finally relax.
He went into a shabby little 2-by-2 shed at the edge of the airfield, where all sorts of useless junk was stored, carefully shut the door, pulled a bottle of port wine out of its hiding place, wiped his hands on his bright yellow jumpsuit like the rest of the team wore, and began slicing tomatoes.
The final act of the demonstration program was a stunt called “Rescue in Midair.”
The idea was as follows: a dummy in a jumpsuit, pretending to be either a passenger who had fallen out along the way or a parachutist with a failed parachute, was thrown out of the plane. Then an athlete jumped after it, caught up with the dummy in midair, embraced it, opened his own parachute, and both landed to the audience’s thunderous applause.
The spectators gathered on the airfield watched the aviators’ tricks with delight, washing down the spectacle with kebabs, soft drinks, and stronger beverages.
At last came the final act. A man separated from the airplane and flew toward the ground; another jumped out after him and rushed in pursuit. The crowd froze. The second parachutist skillfully caught up with the first and grabbed him by the hand. At that moment, whether because of a gust of wind or for some other reason, they were torn apart. There was no time left. The second one, waving goodbye to his comrade, opened his parachute.
The crowd, unaware that anything was wrong, went rigid with horror. The first body hurtled toward the ground and at enormous speed smashed into the decrepit shed at the edge of the airfield. Clouds of dust, pieces of slate roofing, and rotten boards shot up from the site of the shed. An ambulance, siren blaring, raced toward the scene of the tragedy, not really expecting to be able to help anyone anymore. People ran after it. In front of the huge pile of boards, everyone stopped in uncertainty.
Suddenly the boards began to move, and out from underneath them crawled Petrovich in his bright yellow flight overalls, drenched in port wine and smeared with tomatoes. Wildly looking around and spewing curses, he shook his fist at the departing airplane:
“Some rescuers you are! If you don’t know how to catch, don’t make fools of people! I’m not working for you bastards anymore!”
They say that after these words, the ambulance doctor fainted.
Recorded almost from Petrovich’s own words.
It’s an ancient story. I probably read it sometime around 1993, and it’s probably apocryphal. It doesn’t matter. The storytelling is magnificent.
A demonstration performance by parachutists from the local flying club at some major aviation festival.
The team mechanic, Petrovich, having seen off the last group of athletes in bright jumpsuits aboard the old An-2, decided that he could finally relax.
He went into a shabby little 2-by-2 shed at the edge of the airfield, where all sorts of useless junk was stored, carefully shut the door, pulled a bottle of port wine out of its hiding place, wiped his hands on his bright yellow jumpsuit like the rest of the team wore, and began slicing tomatoes.
The final act of the demonstration program was a stunt called “Rescue in Midair.”
The idea was as follows: a dummy in a jumpsuit, pretending to be either a passenger who had fallen out along the way or a parachutist with a failed parachute, was thrown out of the plane. Then an athlete jumped after it, caught up with the dummy in midair, embraced it, opened his own parachute, and both landed to the audience’s thunderous applause.
The spectators gathered on the airfield watched the aviators’ tricks with delight, washing down the spectacle with kebabs, soft drinks, and stronger beverages.
At last came the final act. A man separated from the airplane and flew toward the ground; another jumped out after him and rushed in pursuit. The crowd froze. The second parachutist skillfully caught up with the first and grabbed him by the hand. At that moment, whether because of a gust of wind or for some other reason, they were torn apart. There was no time left. The second one, waving goodbye to his comrade, opened his parachute.
The crowd, unaware that anything was wrong, went rigid with horror. The first body hurtled toward the ground and at enormous speed smashed into the decrepit shed at the edge of the airfield. Clouds of dust, pieces of slate roofing, and rotten boards shot up from the site of the shed. An ambulance, siren blaring, raced toward the scene of the tragedy, not really expecting to be able to help anyone anymore. People ran after it. In front of the huge pile of boards, everyone stopped in uncertainty.
Suddenly the boards began to move, and out from underneath them crawled Petrovich in his bright yellow flight overalls, drenched in port wine and smeared with tomatoes. Wildly looking around and spewing curses, he shook his fist at the departing airplane:
“Some rescuers you are! If you don’t know how to catch, don’t make fools of people! I’m not working for you bastards anymore!”
They say that after these words, the ambulance doctor fainted.
Recorded almost from Petrovich’s own words.
Integration: Codex and JupyterLab
I tried a few ways of working with an LLM inside JupyterLab: magic commands, editing files through Codex, even using Cursor. All of them felt like different shades of “this is not what I actually want.”
Finally, I found something that works. More or less.
In the console, I can just tell Codex in plain natural language what I want, and it gives me the result. No notebook reloading, no extra manual juggling.
I can’t say this solution is ideal. To be honest, it’s still not completely stable. But so far, it’s the best option I’ve found.
If you want to try it, just clone the GitHub repo
and follow the instructions. At the very least, it may save you some pain in the neck while figuring out Python library compatibility.
Feel free to share this post with friends who want to use an agentic assistant in JupyterLab.
I tried a few ways of working with an LLM inside JupyterLab: magic commands, editing files through Codex, even using Cursor. All of them felt like different shades of “this is not what I actually want.”
Finally, I found something that works. More or less.
In the console, I can just tell Codex in plain natural language what I want, and it gives me the result. No notebook reloading, no extra manual juggling.
I can’t say this solution is ideal. To be honest, it’s still not completely stable. But so far, it’s the best option I’ve found.
If you want to try it, just clone the GitHub repo
and follow the instructions. At the very least, it may save you some pain in the neck while figuring out Python library compatibility.
Feel free to share this post with friends who want to use an agentic assistant in JupyterLab.
Digitization of objects
content|next »
The simplest things are the hardest.
In this post, I want to start discussing the stochasticity of ROC. This is a very simple but very important post for me, because it gives me instruments to discuss complex ideas using very simple language.
We use machine learning to make decisions. And these decisions deal with real-world objects. A computer cannot operate with the objects directly, but it can operate with their descriptions. Features.
In our Titanic problem, we already know these features: sex, age, pclass, and so on.
A typical step is to organize these features into a table (X), where each row is an object (a person), and each column corresponds to one feature.
(Y) is the target we want to predict. In our case, it is survival. Let me be a little inconsistent here and draw happy and unhappy faces. I am not punk enough to draw deceased passengers with crosses instead of eyes.
To the chase. Instead of saying “let’s split the object table,” I want to use statements like this:
“We have a crowd with an average survival rate of 0.38. If we split it into male and female crowds, their survival rates are 0.19 and 0.74 respectively.”
Now, instead of splitting and sorting a table, we can tell people to form different crowds or line up into a rank. I think this is much easier to understand.
content|next »
The simplest things are the hardest.
In this post, I want to start discussing the stochasticity of ROC. This is a very simple but very important post for me, because it gives me instruments to discuss complex ideas using very simple language.
We use machine learning to make decisions. And these decisions deal with real-world objects. A computer cannot operate with the objects directly, but it can operate with their descriptions. Features.
In our Titanic problem, we already know these features: sex, age, pclass, and so on.
A typical step is to organize these features into a table (X), where each row is an object (a person), and each column corresponds to one feature.
(Y) is the target we want to predict. In our case, it is survival. Let me be a little inconsistent here and draw happy and unhappy faces. I am not punk enough to draw deceased passengers with crosses instead of eyes.
To the chase. Instead of saying “let’s split the object table,” I want to use statements like this:
“We have a crowd with an average survival rate of 0.38. If we split it into male and female crowds, their survival rates are 0.19 and 0.74 respectively.”
Now, instead of splitting and sorting a table, we can tell people to form different crowds or line up into a rank. I think this is much easier to understand.
❤2
The Score
« prev|content|next »
👹 I totally don’t like the statement that a model gives you the probability of an object being in some class. We want it. We are trying to achieve it. But you have to put TONS of effort into getting it.
📋 Also. For Titanic, we know that women were more likely to survive than men, first-class passengers more likely than third-class ones, and so on. Let’s just sum up good omens and subtract the bad ones. For five features, we will get a number in the range from -5 to 5. Is it useful? Yes. Is it a probability? No.
🎰🎰🎰 It is the Score. 🎰🎰🎰
For me, Score is a very important thing. In a binary classification task, it is some quantity that is large if we expect the object to belong to the positive class and small if we expect it to belong to the negative class.
Now let’s look at the whole picture. We have an object. We digitize it and send its features to a model. Then we move all these levers and knobs of the model so that it gives low score to objects from the negative class and high score to objects from the positive class.
👻 And here we are. We trained our model. Is that it? No.
We still have to set a threshold for decision-making and measure the quality of the model.
« prev|content|next »
👹 I totally don’t like the statement that a model gives you the probability of an object being in some class. We want it. We are trying to achieve it. But you have to put TONS of effort into getting it.
📋 Also. For Titanic, we know that women were more likely to survive than men, first-class passengers more likely than third-class ones, and so on. Let’s just sum up good omens and subtract the bad ones. For five features, we will get a number in the range from -5 to 5. Is it useful? Yes. Is it a probability? No.
🎰🎰🎰 It is the Score. 🎰🎰🎰
For me, Score is a very important thing. In a binary classification task, it is some quantity that is large if we expect the object to belong to the positive class and small if we expect it to belong to the negative class.
Now let’s look at the whole picture. We have an object. We digitize it and send its features to a model. Then we move all these levers and knobs of the model so that it gives low score to objects from the negative class and high score to objects from the positive class.
👻 And here we are. We trained our model. Is that it? No.
We still have to set a threshold for decision-making and measure the quality of the model.
👍1
The model, a model...
« prev | content | next »
Totally forgot to include a few words about models in the previous post. So, a separate post.
The picture is quite traditional. Formally speaking, a model is a function f = f(x, α). In our terms, x is the values of features for an object, and α is the parameters. All our knobs and handles that we can adjust in order to fit the dataset.
Of course, you know the linear model y = kx + c. It has parameters k and c. You have probably also heard about decision trees. There, the parameters are thresholds in the nodes and values in the leaves. In neural networks, parameters are the numbers in the weight matrices.
All these complex things are already common and simple. For me personally, it was quite a fresh idea that each feature itself is a model. So you can estimate its quality individually using ROC AUC, MSE, or anything else you like. It immediately gives you an idea of how important a feature may be, without training complex models and without SHAP analysis.
From this point of view, feature engineering like family = sibsp + parch + 1 can be considered a kind of stacking of models.
And finally. [0.23, -1.18, 3.25] looks like a normal set of model parameters. But what if I tell you that ((1 x) ((1 1) x)) is a model description as well? Let’s discuss it next time.
« prev | content | next »
Totally forgot to include a few words about models in the previous post. So, a separate post.
The picture is quite traditional. Formally speaking, a model is a function f = f(x, α). In our terms, x is the values of features for an object, and α is the parameters. All our knobs and handles that we can adjust in order to fit the dataset.
Of course, you know the linear model y = kx + c. It has parameters k and c. You have probably also heard about decision trees. There, the parameters are thresholds in the nodes and values in the leaves. In neural networks, parameters are the numbers in the weight matrices.
All these complex things are already common and simple. For me personally, it was quite a fresh idea that each feature itself is a model. So you can estimate its quality individually using ROC AUC, MSE, or anything else you like. It immediately gives you an idea of how important a feature may be, without training complex models and without SHAP analysis.
From this point of view, feature engineering like family = sibsp + parch + 1 can be considered a kind of stacking of models.
And finally. [0.23, -1.18, 3.25] looks like a normal set of model parameters. But what if I tell you that ((1 x) ((1 1) x)) is a model description as well? Let’s discuss it next time.
🔥1🤔1🤝1