Titanic: family bonds
Nowadays, it is hard to imagine an ML person without a Jupyter notebook. So let’s think a little outside the box and consider other options.
Quite recently, I discovered an interesting option for CSV table analysis. It consists of two instruments: DBeaver and DuckDB. The former is a Swiss army knife for database access, and the latter is a plugin that is quite powerful, or so I was told, but can also be used to open CSV files.
So, you can create a new DuckDB connection in DBeaver (ducks and beavers, yeah...), select the titanic.csv file you got from Kaggle, and start running queries.
SibSp is a factor, so to speak, horizontal in the ancestry tree. It is the sum of siblings and spouses. Parch is vertical in these coordinates: parents and children.
The first two queries show us the strength of these factors separately. The third uses the nested query technique to produce a derived factor, family size. One can see that these factors are useful, but the derived factor gives a stronger depletion/enrichment effect. This time, we were lucky enough to engineer a new feature.
I would say that family sizes 2, 3, and 4 vote for survival; the others vote against it.
Nowadays, it is hard to imagine an ML person without a Jupyter notebook. So let’s think a little outside the box and consider other options.
Quite recently, I discovered an interesting option for CSV table analysis. It consists of two instruments: DBeaver and DuckDB. The former is a Swiss army knife for database access, and the latter is a plugin that is quite powerful, or so I was told, but can also be used to open CSV files.
So, you can create a new DuckDB connection in DBeaver (ducks and beavers, yeah...), select the titanic.csv file you got from Kaggle, and start running queries.
SibSp is a factor, so to speak, horizontal in the ancestry tree. It is the sum of siblings and spouses. Parch is vertical in these coordinates: parents and children.
The first two queries show us the strength of these factors separately. The third uses the nested query technique to produce a derived factor, family size. One can see that these factors are useful, but the derived factor gives a stronger depletion/enrichment effect. This time, we were lucky enough to engineer a new feature.
I would say that family sizes 2, 3, and 4 vote for survival; the others vote against it.
👍1
Let’s rock ROC
All this time, it bothered me that the ROC in the post contains only three points. I’ve been thinking about this situation for a while, and I think I’ve found a small but genuinely new idea. I’m going to explain it slowly in a series of upcoming posts. But first, I want to check whether this thought is actually trivial. Please vote in the poll below, and feel free to leave comments.
Quick recap:
We have a model that gives scores of -1, 0, and 1. When we plot the ROC curve, we get a graph with 4 points and an AUROC of 56%.
The question is: can we calculate the dispersion of AUROC in this case?
All this time, it bothered me that the ROC in the post contains only three points. I’ve been thinking about this situation for a while, and I think I’ve found a small but genuinely new idea. I’m going to explain it slowly in a series of upcoming posts. But first, I want to check whether this thought is actually trivial. Please vote in the poll below, and feel free to leave comments.
Quick recap:
We have a model that gives scores of -1, 0, and 1. When we plot the ROC curve, we get a graph with 4 points and an AUROC of 56%.
The question is: can we calculate the dispersion of AUROC in this case?
Telegram
Algorithms. Physics. Mathematics. Machine Learning.
Titanic. Age.
We started talking about the Titanic dataset. Let's discuss the Age factor. We saw that Sex, Pclass and Embarked are strong features and to study them we used that these features are categorical with low cardinality. Age is different. It's…
We started talking about the Titanic dataset. Let's discuss the Age factor. We saw that Sex, Pclass and Embarked are strong features and to study them we used that these features are categorical with low cardinality. Age is different. It's…
Can we calculate the dispersion of AUROC in this case?
Anonymous Poll
29%
Of course, it's 0. Obviously.
50%
AUROC is a metrics, it has no dispersion
43%
Your question is both stupid and offending
29%
Ouch. I think I have an idea. (And will share in comments)
29%
Shut up and post more memes.
Plan: from school-level math to a March 2026 arXiv
The stakes turned out to be higher than I expected. Our quiet little discussion of the Titanic dataset may have suddenly wandered into publication-level territory. You can check the comments under the poll.
So now I have a difficult choice to make: either post more TFWR and beginner-level programming, or start a short series about the picture above. Let’s postpone TFWR for about a week and focus on ROC with ties.
LLMs make this kind of thing harder to understand. They help you solve the immediate problem quickly, but they often blur the core idea. So let’s try to chase the idea itself.
The plan:
👉 Digitization of objects
👉 What a Score is in ML
👉 ML Models
👉 EML as ML model
👉 ROC and AUROC
👉 Stochastic wandering
👉 Gaussian binomial distribution
👉 Sources for this topic
The stakes turned out to be higher than I expected. Our quiet little discussion of the Titanic dataset may have suddenly wandered into publication-level territory. You can check the comments under the poll.
So now I have a difficult choice to make: either post more TFWR and beginner-level programming, or start a short series about the picture above. Let’s postpone TFWR for about a week and focus on ROC with ties.
LLMs make this kind of thing harder to understand. They help you solve the immediate problem quickly, but they often blur the core idea. So let’s try to chase the idea itself.
The plan:
👉 Digitization of objects
👉 What a Score is in ML
👉 ML Models
👉 EML as ML model
👉 ROC and AUROC
👉 Stochastic wandering
👉 Gaussian binomial distribution
👉 Sources for this topic
❤1
Are you ready for a Friday trash?
Anonymous Poll
54%
Yeah! Friday trash!
15%
No! Give me my stochastic wandering math.
8%
I want to read more about agents.
0%
Please, give me a small Pythonic beginner-level program with explanation
23%
What about Gradient Boosted Decision trees with Extrapolation?
23%
Whatever...
Friday. Trash. Rescue in Midair.
It’s an ancient story. I probably read it sometime around 1993, and it’s probably apocryphal. It doesn’t matter. The storytelling is magnificent.
A demonstration performance by parachutists from the local flying club at some major aviation festival.
The team mechanic, Petrovich, having seen off the last group of athletes in bright jumpsuits aboard the old An-2, decided that he could finally relax.
He went into a shabby little 2-by-2 shed at the edge of the airfield, where all sorts of useless junk was stored, carefully shut the door, pulled a bottle of port wine out of its hiding place, wiped his hands on his bright yellow jumpsuit like the rest of the team wore, and began slicing tomatoes.
The final act of the demonstration program was a stunt called “Rescue in Midair.”
The idea was as follows: a dummy in a jumpsuit, pretending to be either a passenger who had fallen out along the way or a parachutist with a failed parachute, was thrown out of the plane. Then an athlete jumped after it, caught up with the dummy in midair, embraced it, opened his own parachute, and both landed to the audience’s thunderous applause.
The spectators gathered on the airfield watched the aviators’ tricks with delight, washing down the spectacle with kebabs, soft drinks, and stronger beverages.
At last came the final act. A man separated from the airplane and flew toward the ground; another jumped out after him and rushed in pursuit. The crowd froze. The second parachutist skillfully caught up with the first and grabbed him by the hand. At that moment, whether because of a gust of wind or for some other reason, they were torn apart. There was no time left. The second one, waving goodbye to his comrade, opened his parachute.
The crowd, unaware that anything was wrong, went rigid with horror. The first body hurtled toward the ground and at enormous speed smashed into the decrepit shed at the edge of the airfield. Clouds of dust, pieces of slate roofing, and rotten boards shot up from the site of the shed. An ambulance, siren blaring, raced toward the scene of the tragedy, not really expecting to be able to help anyone anymore. People ran after it. In front of the huge pile of boards, everyone stopped in uncertainty.
Suddenly the boards began to move, and out from underneath them crawled Petrovich in his bright yellow flight overalls, drenched in port wine and smeared with tomatoes. Wildly looking around and spewing curses, he shook his fist at the departing airplane:
“Some rescuers you are! If you don’t know how to catch, don’t make fools of people! I’m not working for you bastards anymore!”
They say that after these words, the ambulance doctor fainted.
Recorded almost from Petrovich’s own words.
It’s an ancient story. I probably read it sometime around 1993, and it’s probably apocryphal. It doesn’t matter. The storytelling is magnificent.
A demonstration performance by parachutists from the local flying club at some major aviation festival.
The team mechanic, Petrovich, having seen off the last group of athletes in bright jumpsuits aboard the old An-2, decided that he could finally relax.
He went into a shabby little 2-by-2 shed at the edge of the airfield, where all sorts of useless junk was stored, carefully shut the door, pulled a bottle of port wine out of its hiding place, wiped his hands on his bright yellow jumpsuit like the rest of the team wore, and began slicing tomatoes.
The final act of the demonstration program was a stunt called “Rescue in Midair.”
The idea was as follows: a dummy in a jumpsuit, pretending to be either a passenger who had fallen out along the way or a parachutist with a failed parachute, was thrown out of the plane. Then an athlete jumped after it, caught up with the dummy in midair, embraced it, opened his own parachute, and both landed to the audience’s thunderous applause.
The spectators gathered on the airfield watched the aviators’ tricks with delight, washing down the spectacle with kebabs, soft drinks, and stronger beverages.
At last came the final act. A man separated from the airplane and flew toward the ground; another jumped out after him and rushed in pursuit. The crowd froze. The second parachutist skillfully caught up with the first and grabbed him by the hand. At that moment, whether because of a gust of wind or for some other reason, they were torn apart. There was no time left. The second one, waving goodbye to his comrade, opened his parachute.
The crowd, unaware that anything was wrong, went rigid with horror. The first body hurtled toward the ground and at enormous speed smashed into the decrepit shed at the edge of the airfield. Clouds of dust, pieces of slate roofing, and rotten boards shot up from the site of the shed. An ambulance, siren blaring, raced toward the scene of the tragedy, not really expecting to be able to help anyone anymore. People ran after it. In front of the huge pile of boards, everyone stopped in uncertainty.
Suddenly the boards began to move, and out from underneath them crawled Petrovich in his bright yellow flight overalls, drenched in port wine and smeared with tomatoes. Wildly looking around and spewing curses, he shook his fist at the departing airplane:
“Some rescuers you are! If you don’t know how to catch, don’t make fools of people! I’m not working for you bastards anymore!”
They say that after these words, the ambulance doctor fainted.
Recorded almost from Petrovich’s own words.
Integration: Codex and JupyterLab
I tried a few ways of working with an LLM inside JupyterLab: magic commands, editing files through Codex, even using Cursor. All of them felt like different shades of “this is not what I actually want.”
Finally, I found something that works. More or less.
In the console, I can just tell Codex in plain natural language what I want, and it gives me the result. No notebook reloading, no extra manual juggling.
I can’t say this solution is ideal. To be honest, it’s still not completely stable. But so far, it’s the best option I’ve found.
If you want to try it, just clone the GitHub repo
and follow the instructions. At the very least, it may save you some pain in the neck while figuring out Python library compatibility.
Feel free to share this post with friends who want to use an agentic assistant in JupyterLab.
I tried a few ways of working with an LLM inside JupyterLab: magic commands, editing files through Codex, even using Cursor. All of them felt like different shades of “this is not what I actually want.”
Finally, I found something that works. More or less.
In the console, I can just tell Codex in plain natural language what I want, and it gives me the result. No notebook reloading, no extra manual juggling.
I can’t say this solution is ideal. To be honest, it’s still not completely stable. But so far, it’s the best option I’ve found.
If you want to try it, just clone the GitHub repo
and follow the instructions. At the very least, it may save you some pain in the neck while figuring out Python library compatibility.
Feel free to share this post with friends who want to use an agentic assistant in JupyterLab.
Digitization of objects
content|next »
The simplest things are the hardest.
In this post, I want to start discussing the stochasticity of ROC. This is a very simple but very important post for me, because it gives me instruments to discuss complex ideas using very simple language.
We use machine learning to make decisions. And these decisions deal with real-world objects. A computer cannot operate with the objects directly, but it can operate with their descriptions. Features.
In our Titanic problem, we already know these features: sex, age, pclass, and so on.
A typical step is to organize these features into a table (X), where each row is an object (a person), and each column corresponds to one feature.
(Y) is the target we want to predict. In our case, it is survival. Let me be a little inconsistent here and draw happy and unhappy faces. I am not punk enough to draw deceased passengers with crosses instead of eyes.
To the chase. Instead of saying “let’s split the object table,” I want to use statements like this:
“We have a crowd with an average survival rate of 0.38. If we split it into male and female crowds, their survival rates are 0.19 and 0.74 respectively.”
Now, instead of splitting and sorting a table, we can tell people to form different crowds or line up into a rank. I think this is much easier to understand.
content|next »
The simplest things are the hardest.
In this post, I want to start discussing the stochasticity of ROC. This is a very simple but very important post for me, because it gives me instruments to discuss complex ideas using very simple language.
We use machine learning to make decisions. And these decisions deal with real-world objects. A computer cannot operate with the objects directly, but it can operate with their descriptions. Features.
In our Titanic problem, we already know these features: sex, age, pclass, and so on.
A typical step is to organize these features into a table (X), where each row is an object (a person), and each column corresponds to one feature.
(Y) is the target we want to predict. In our case, it is survival. Let me be a little inconsistent here and draw happy and unhappy faces. I am not punk enough to draw deceased passengers with crosses instead of eyes.
To the chase. Instead of saying “let’s split the object table,” I want to use statements like this:
“We have a crowd with an average survival rate of 0.38. If we split it into male and female crowds, their survival rates are 0.19 and 0.74 respectively.”
Now, instead of splitting and sorting a table, we can tell people to form different crowds or line up into a rank. I think this is much easier to understand.
❤3
The Score
« prev|content|next »
👹 I totally don’t like the statement that a model gives you the probability of an object being in some class. We want it. We are trying to achieve it. But you have to put TONS of effort into getting it.
📋 Also. For Titanic, we know that women were more likely to survive than men, first-class passengers more likely than third-class ones, and so on. Let’s just sum up good omens and subtract the bad ones. For five features, we will get a number in the range from -5 to 5. Is it useful? Yes. Is it a probability? No.
🎰🎰🎰 It is the Score. 🎰🎰🎰
For me, Score is a very important thing. In a binary classification task, it is some quantity that is large if we expect the object to belong to the positive class and small if we expect it to belong to the negative class.
Now let’s look at the whole picture. We have an object. We digitize it and send its features to a model. Then we move all these levers and knobs of the model so that it gives low score to objects from the negative class and high score to objects from the positive class.
👻 And here we are. We trained our model. Is that it? No.
We still have to set a threshold for decision-making and measure the quality of the model.
« prev|content|next »
👹 I totally don’t like the statement that a model gives you the probability of an object being in some class. We want it. We are trying to achieve it. But you have to put TONS of effort into getting it.
📋 Also. For Titanic, we know that women were more likely to survive than men, first-class passengers more likely than third-class ones, and so on. Let’s just sum up good omens and subtract the bad ones. For five features, we will get a number in the range from -5 to 5. Is it useful? Yes. Is it a probability? No.
🎰🎰🎰 It is the Score. 🎰🎰🎰
For me, Score is a very important thing. In a binary classification task, it is some quantity that is large if we expect the object to belong to the positive class and small if we expect it to belong to the negative class.
Now let’s look at the whole picture. We have an object. We digitize it and send its features to a model. Then we move all these levers and knobs of the model so that it gives low score to objects from the negative class and high score to objects from the positive class.
👻 And here we are. We trained our model. Is that it? No.
We still have to set a threshold for decision-making and measure the quality of the model.
👍1
The model, a model...
« prev | content | next »
Totally forgot to include a few words about models in the previous post. So, a separate post.
The picture is quite traditional. Formally speaking, a model is a function f = f(x, α). In our terms, x is the values of features for an object, and α is the parameters. All our knobs and handles that we can adjust in order to fit the dataset.
Of course, you know the linear model y = kx + c. It has parameters k and c. You have probably also heard about decision trees. There, the parameters are thresholds in the nodes and values in the leaves. In neural networks, parameters are the numbers in the weight matrices.
All these complex things are already common and simple. For me personally, it was quite a fresh idea that each feature itself is a model. So you can estimate its quality individually using ROC AUC, MSE, or anything else you like. It immediately gives you an idea of how important a feature may be, without training complex models and without SHAP analysis.
From this point of view, feature engineering like family = sibsp + parch + 1 can be considered a kind of stacking of models.
And finally. [0.23, -1.18, 3.25] looks like a normal set of model parameters. But what if I tell you that ((1 x) ((1 1) x)) is a model description as well? Let’s discuss it next time.
« prev | content | next »
Totally forgot to include a few words about models in the previous post. So, a separate post.
The picture is quite traditional. Formally speaking, a model is a function f = f(x, α). In our terms, x is the values of features for an object, and α is the parameters. All our knobs and handles that we can adjust in order to fit the dataset.
Of course, you know the linear model y = kx + c. It has parameters k and c. You have probably also heard about decision trees. There, the parameters are thresholds in the nodes and values in the leaves. In neural networks, parameters are the numbers in the weight matrices.
All these complex things are already common and simple. For me personally, it was quite a fresh idea that each feature itself is a model. So you can estimate its quality individually using ROC AUC, MSE, or anything else you like. It immediately gives you an idea of how important a feature may be, without training complex models and without SHAP analysis.
From this point of view, feature engineering like family = sibsp + parch + 1 can be considered a kind of stacking of models.
And finally. [0.23, -1.18, 3.25] looks like a normal set of model parameters. But what if I tell you that ((1 x) ((1 1) x)) is a model description as well? Let’s discuss it next time.
🔥1🤔1🤝1
Models. EML in ML
« prev|content|next »
Three weeks ago hype erupted around the EML function. Three articles appeared almost at once. The first stated that all we need is a "two buttoned calculator", one button for "1", one for "EML" and that's enough to express everything we need to describe the world around us. Here EML(x, y) = exp(x) - ln(y). The second applies this approach to model battery charge cycle, the third - mixes traditional neural-network elements with a new gate.
I have mixed feelings here. On the one hand, intuition says that to be useful, these trees have to be really high. Like, 10 or 20 nodes. Too much for brute force. On the other hand, usual optimization methods do not just work out of the box. Backpropagation on tree structures - no. Gradient descent over a correct bracket sequence - no. To be honest, even conversion between EML notation and usual symbolic notation is a pain in the a... neck.
Then I had a thought. An EML expression is an expression like eml(eml(1, x), 1). First of all, we can see that there is always "eml(", so we can just use "(". This way it becomes "((1x)1)". Looks interesting. In the one-variable case, we have only four token types to encode all EML expressions: (, ), 1, and x. Would it be beneficial for transformers to have only four tokens to predict? Interesting thought, but not for this post. Maybe later...
But now we are ready to understand the cryptic expression from the previous post. ((1 x) ((1 1) x)) is a function, encoded by EML notation. You can decipher it recursively. (1 1) is (exp(1) - ln(1)) = e. Then ((1 1) x) = (e x) = exp(e) - ln(x) = e**e - ln(x). And so on.
I conducted an experiment. I took the Titanic dataset and normalized all features to be in the range [0, 1]. Then I ran selection of 3-level EML trees with the best ROC AUC over pairs of features. On the top of these candidates, I ran a genetic algorithm to find the best combination by ROC AUC and logloss. The result is an EML tree which solves the Titanic dataset almost as well as the specialized CatBoost package does.
This exercise is for joke only, because this algorithm is extremely computationally expensive and can't be scaled at all.
But I totally think that it was worth it. As a result, we have something as precious as a talking frog: an EML tree which solves the Titanic dataset, and the understanding that notation like (1(x (x (1 1)))) enumerates mathematical expressions and therefore, ML models.
👉👉Know someone who likes math, ML, and weird experiments? Share this post with them!👈👈
To check:
🛸 EML tree for Titanic
(((ps_log ((sa_log sf_auc) ((ps_prob sf_prob) sx_fs_sx))) (sf_prob sa_log)) ((sf_prob ((ps_log (ps_log pc_ag_mx)) (ps_prob (pc_ag_mx pc_fr_mx)))) (sf_prob ((sf_prob (sa_log sx_fs_sx)) sx_fs_sx))))
🤿 GitHub - code of EML solving the Titanic
« prev|content|next »
Three weeks ago hype erupted around the EML function. Three articles appeared almost at once. The first stated that all we need is a "two buttoned calculator", one button for "1", one for "EML" and that's enough to express everything we need to describe the world around us. Here EML(x, y) = exp(x) - ln(y). The second applies this approach to model battery charge cycle, the third - mixes traditional neural-network elements with a new gate.
I have mixed feelings here. On the one hand, intuition says that to be useful, these trees have to be really high. Like, 10 or 20 nodes. Too much for brute force. On the other hand, usual optimization methods do not just work out of the box. Backpropagation on tree structures - no. Gradient descent over a correct bracket sequence - no. To be honest, even conversion between EML notation and usual symbolic notation is a pain in the a... neck.
Then I had a thought. An EML expression is an expression like eml(eml(1, x), 1). First of all, we can see that there is always "eml(", so we can just use "(". This way it becomes "((1x)1)". Looks interesting. In the one-variable case, we have only four token types to encode all EML expressions: (, ), 1, and x. Would it be beneficial for transformers to have only four tokens to predict? Interesting thought, but not for this post. Maybe later...
But now we are ready to understand the cryptic expression from the previous post. ((1 x) ((1 1) x)) is a function, encoded by EML notation. You can decipher it recursively. (1 1) is (exp(1) - ln(1)) = e. Then ((1 1) x) = (e x) = exp(e) - ln(x) = e**e - ln(x). And so on.
I conducted an experiment. I took the Titanic dataset and normalized all features to be in the range [0, 1]. Then I ran selection of 3-level EML trees with the best ROC AUC over pairs of features. On the top of these candidates, I ran a genetic algorithm to find the best combination by ROC AUC and logloss. The result is an EML tree which solves the Titanic dataset almost as well as the specialized CatBoost package does.
This exercise is for joke only, because this algorithm is extremely computationally expensive and can't be scaled at all.
But I totally think that it was worth it. As a result, we have something as precious as a talking frog: an EML tree which solves the Titanic dataset, and the understanding that notation like (1(x (x (1 1)))) enumerates mathematical expressions and therefore, ML models.
👉👉Know someone who likes math, ML, and weird experiments? Share this post with them!👈👈
To check:
🛸 EML tree for Titanic
(((ps_log ((sa_log sf_auc) ((ps_prob sf_prob) sx_fs_sx))) (sf_prob sa_log)) ((sf_prob ((ps_log (ps_log pc_ag_mx)) (ps_prob (pc_ag_mx pc_fr_mx)))) (sf_prob ((sf_prob (sa_log sx_fs_sx)) sx_fs_sx))))
🤿 GitHub - code of EML solving the Titanic
👍4
SCI BOT
Have you ever wanted to chat with an LLM that has access to a vast body of scientific knowledge? Now you can
The famous archive that gave free access to tons of scientific papers now lets you chat with them via a bot.
Have you ever wanted to chat with an LLM that has access to a vast body of scientific knowledge? Now you can
The famous archive that gave free access to tons of scientific papers now lets you chat with them via a bot.
🔥3