Three out of eight
Let's check three out of eight components. On the colored picture you can see the whole title.
Let’s inspect three of the eight components (the colored figure shows the full title). The first component captures the “hats” of the T’s and the upper parts of B and O; the second picks up interior points that happen to lie on steep slopes; the third follows portions of the vertical strokes.
Let's check three out of eight components. On the colored picture you can see the whole title.
Let’s inspect three of the eight components (the colored figure shows the full title). The first component captures the “hats” of the T’s and the upper parts of B and O; the second picks up interior points that happen to lie on steep slopes; the third follows portions of the vertical strokes.
🥰1
The Absolute Evil of Default Parameters
Default parameters are evil. Seriously. It’s very convenient evil—so convenient that it crawled into my library.
On the first image there is a call to train a model with:
🐴n_stages = 1: one step of boosting (one tree)
🐴learning_rate = 1.0: jump immediately into the minimum of the current parabola
🐴max_depth = 3: use three static variables to separate objects into groups (8 groups)
And that seems to be it. So why do we have a non-zero loss? No clue.
God bless ChatGPT. I uploaded the dataset and explained the hypothesis: there’s something odd—try to find the best fitting of extra parameters to the target in each group defined by static features. Then I went to brush my teeth. When I was back, the answer was ready: the dataset is OK, MSE is about 1e-32—zero in terms of double precision.
I scratched my head, zipped my whole repository, uploaded it to ChatGPT, and asked: “Find me the source of 0.0007 RMSE.” Then I started my daily English routine. Fifteen minutes later the answer was ready:
F**ING DEFAULT PARAMETERS
My iron friend dug out this line:
🐴reg_lambda: float = 1e-4
This L2 regularization parameter forced the model to miss points in the training dataset and caused that small but annoying mismatch.
I can imagine myself trying to debug this. My iron friend saved me about two days of checking different parts of my system again and again. And, probably, everything would have been paused - again.
Default parameters are evil. Seriously. It’s very convenient evil—so convenient that it crawled into my library.
On the first image there is a call to train a model with:
🐴n_stages = 1: one step of boosting (one tree)
🐴learning_rate = 1.0: jump immediately into the minimum of the current parabola
🐴max_depth = 3: use three static variables to separate objects into groups (8 groups)
And that seems to be it. So why do we have a non-zero loss? No clue.
God bless ChatGPT. I uploaded the dataset and explained the hypothesis: there’s something odd—try to find the best fitting of extra parameters to the target in each group defined by static features. Then I went to brush my teeth. When I was back, the answer was ready: the dataset is OK, MSE is about 1e-32—zero in terms of double precision.
I scratched my head, zipped my whole repository, uploaded it to ChatGPT, and asked: “Find me the source of 0.0007 RMSE.” Then I started my daily English routine. Fifteen minutes later the answer was ready:
F**ING DEFAULT PARAMETERS
My iron friend dug out this line:
🐴reg_lambda: float = 1e-4
This L2 regularization parameter forced the model to miss points in the training dataset and caused that small but annoying mismatch.
I can imagine myself trying to debug this. My iron friend saved me about two days of checking different parts of my system again and again. And, probably, everything would have been paused - again.
👍1
Everybody wants to be like ewe
Track on youtube
Just a small reminder:
🐑Homophones - "you" and "ewe", same pronunciation but different meanings
🐑Homographs - same spelling, different pronunciations (wind noun vs. wind verb). cat noun and cat verb...
🐑Homonyms (broad sense) umbrella term covering same-sound and/or same-spelling pairs
That's not it.
🐑Oronyms: phrases that sound alike — “ice cream” / “I scream”.
🐑Homographs: same spelling, different meanings — lead (metal) / lead (guide)
🐑Heteronyms: same spelling, different pronunciations and meanings — wind (noun) / wind (verb).
🐑Capitonyms: meaning (and sometimes pronunciation) changes with capitalization — Polish / polish, March / march.
🐑Heterographs: different spellings for words in a pair — you/ewe
🐑Polysemy: one word with related senses — head (of a person, of a department)
🐑Auto-antonyms (contronyms/Janus words): one word with opposite meanings — sanction (“approve” / “penalize”)
Mishearings
🐑Mondegreen: misheard lyric — “’Scuse me while I kiss this guy” for “kiss the sky.”
🐑Eggcorn: plausible but wrong substitution — “for all intensive purposes” for “intents and purposes.”
🐑Malapropism: wrong word, similar sound — “dance the flamingo” for flamenco.
🐑Spoonerism: swapping sounds — “You have hissed all my mystery lectures.”
Track on youtube
Just a small reminder:
🐑Homophones - "you" and "ewe", same pronunciation but different meanings
🐑Homographs - same spelling, different pronunciations (wind noun vs. wind verb). cat noun and cat verb...
🐑Homonyms (broad sense) umbrella term covering same-sound and/or same-spelling pairs
That's not it.
🐑Oronyms: phrases that sound alike — “ice cream” / “I scream”.
🐑Homographs: same spelling, different meanings — lead (metal) / lead (guide)
🐑Heteronyms: same spelling, different pronunciations and meanings — wind (noun) / wind (verb).
🐑Capitonyms: meaning (and sometimes pronunciation) changes with capitalization — Polish / polish, March / march.
🐑Heterographs: different spellings for words in a pair — you/ewe
🐑Polysemy: one word with related senses — head (of a person, of a department)
🐑Auto-antonyms (contronyms/Janus words): one word with opposite meanings — sanction (“approve” / “penalize”)
Mishearings
🐑Mondegreen: misheard lyric — “’Scuse me while I kiss this guy” for “kiss the sky.”
🐑Eggcorn: plausible but wrong substitution — “for all intensive purposes” for “intents and purposes.”
🐑Malapropism: wrong word, similar sound — “dance the flamingo” for flamenco.
🐑Spoonerism: swapping sounds — “You have hissed all my mystery lectures.”
👍1
The Farmer Was Replaced
In one of ML channels I'm reading, I found a reference to the game "The Farmer Was Replaced". Here you are to write programs on a language which is quite similar to Python, but is not quite Python to solve farm puzzles.
Some of them are quite essential: to chop grass and collect hay, to grow bushes and trees and collect wood. Carrot is totally normal - you just to spend some wood to seed it. Weird things starts with pumpkins: they rot with 20% probability, but when there is a patch of ready to harvest pumpkins, they merge together and when you harvest this mega-pumpkin, you have bonus harvest.
Then things go totally astray: on the picture it's a dinosaur chasing the apple. Basically you are to program automatic pilot for the famous "snake" game.
There are also cacti. You have to sort them on two-dimensional grid in order to get bonus.
I totally like it because it gives a chance to practice medium programming tasks at the Meta/Google 20-minutes level while playing the game.
It also brought back some memories I thought I'd forgotten. I will try to share them in this channel.
In one of ML channels I'm reading, I found a reference to the game "The Farmer Was Replaced". Here you are to write programs on a language which is quite similar to Python, but is not quite Python to solve farm puzzles.
Some of them are quite essential: to chop grass and collect hay, to grow bushes and trees and collect wood. Carrot is totally normal - you just to spend some wood to seed it. Weird things starts with pumpkins: they rot with 20% probability, but when there is a patch of ready to harvest pumpkins, they merge together and when you harvest this mega-pumpkin, you have bonus harvest.
Then things go totally astray: on the picture it's a dinosaur chasing the apple. Basically you are to program automatic pilot for the famous "snake" game.
There are also cacti. You have to sort them on two-dimensional grid in order to get bonus.
I totally like it because it gives a chance to practice medium programming tasks at the Meta/Google 20-minutes level while playing the game.
It also brought back some memories I thought I'd forgotten. I will try to share them in this channel.
Memories Awakened by “The Farmer Was Replaced”
It seems like a very trivial question: “What should a drone do if it is on the lower (South) edge of the field and it has a move(South) command to perform?” But, strangely, this question leads to quite an interesting set of consequences.
In The Farmer Was Replaced the drone teleports to the other end of the field. It means that the upper edge of the field is glued to the lower edge. As I drew it, a square with upper and lower edges glued together becomes a cylinder from the topological point of view. Then the same goes for the left and right boundaries, and we have a torus.
There are other options. For instance, if the drone disappears on the lower edge and then appears on the left one (and similarly with upper and right), we have a sphere.
But that’s not all. You can play with the orientation of edges. If the drone appears at the beginning of the lower edge and then at the end of the upper edge, it means that we reversed the orientation. After we finish gluing edges, we get a Klein bottle or the projective plane. Unfortunately, you can’t embed these figures in 3D without self-intersection.
About memories. I first read about this in Anatoly Fomenko’s book. He also wrote “History: Fiction or Science?” (New Chronology) and, while he’s quite famous for that, he’s also a mathematician; his “Visual Geometry and Topology” is an interesting read. It even contains illustrations for “The Master and Margarita.” What I explained about gluing the edges of a square together is a simplified version of the introductory section of that book.
It seems like a very trivial question: “What should a drone do if it is on the lower (South) edge of the field and it has a move(South) command to perform?” But, strangely, this question leads to quite an interesting set of consequences.
In The Farmer Was Replaced the drone teleports to the other end of the field. It means that the upper edge of the field is glued to the lower edge. As I drew it, a square with upper and lower edges glued together becomes a cylinder from the topological point of view. Then the same goes for the left and right boundaries, and we have a torus.
There are other options. For instance, if the drone disappears on the lower edge and then appears on the left one (and similarly with upper and right), we have a sphere.
But that’s not all. You can play with the orientation of edges. If the drone appears at the beginning of the lower edge and then at the end of the upper edge, it means that we reversed the orientation. After we finish gluing edges, we get a Klein bottle or the projective plane. Unfortunately, you can’t embed these figures in 3D without self-intersection.
About memories. I first read about this in Anatoly Fomenko’s book. He also wrote “History: Fiction or Science?” (New Chronology) and, while he’s quite famous for that, he’s also a mathematician; his “Visual Geometry and Topology” is an interesting read. It even contains illustrations for “The Master and Margarita.” What I explained about gluing the edges of a square together is a simplified version of the introductory section of that book.
TFWR. Navigation.
In order to navigate the drone, there are get_pos_x(), get_pos_y(), and move() functions. move() takes East (x + 1), North (y + 1), West (x − 1), and South (y − 1) commands. get_world_size() is useful too. The world is square, so a single number (the side length) gives you everything you need.
nav(x, y) is a nice-to-have function. It moves the drone to position (x, y).
In the screenshot, a naive navigation function decides where to go—positive or negative direction—and moves accordingly. Why naive? Because it doesn't take into account the toroidal nature of the world (wrap-around at the boundaries).
In order to navigate the drone, there are get_pos_x(), get_pos_y(), and move() functions. move() takes East (x + 1), North (y + 1), West (x − 1), and South (y − 1) commands. get_world_size() is useful too. The world is square, so a single number (the side length) gives you everything you need.
nav(x, y) is a nice-to-have function. It moves the drone to position (x, y).
In the screenshot, a naive navigation function decides where to go—positive or negative direction—and moves accordingly. Why naive? Because it doesn't take into account the toroidal nature of the world (wrap-around at the boundaries).
✍1👨💻1
Hello everyone!
I really appreciate that all of you have signed up for my channel. It's kinda touching, and I feel a little bit awkward just writing about GBDT and TFWR (but what else could I write about…).
So, I’d like to have a small talk with you. Let’s discuss in a comments a few things:
🤖How has the LLM revolution affected you?
🤖What do you expect in the near future?
🤖How do you see your career plans, taking into account all this hubbub?
I'll address some of these topics not to bias discussion.
Всем привет!
Я очень ценю то, что вы подписались на этот канал. Это трогательно и я чувствую себя немного неловко продолжая просто писать про градиентный бустинг и игрушку с дроном вместо фермера (правда, я не очень знаю, о чём ещё писать). Мне очень хочется со всеми поговорить. Давайте по-флудим в комментах по следующему поводу:
🤖Как тебя затронула LLM шумиха?
🤖Какие ожидания от ближайшего будущего?
🤖Влияет ли вся эта радость на карьерные планы?
Я для затравки напишу чуть-чуть чтобы ни у кого не украсть тезисы и допишу когда (если) пойдёт жара.
I really appreciate that all of you have signed up for my channel. It's kinda touching, and I feel a little bit awkward just writing about GBDT and TFWR (but what else could I write about…).
So, I’d like to have a small talk with you. Let’s discuss in a comments a few things:
🤖How has the LLM revolution affected you?
🤖What do you expect in the near future?
🤖How do you see your career plans, taking into account all this hubbub?
I'll address some of these topics not to bias discussion.
Всем привет!
Я очень ценю то, что вы подписались на этот канал. Это трогательно и я чувствую себя немного неловко продолжая просто писать про градиентный бустинг и игрушку с дроном вместо фермера (правда, я не очень знаю, о чём ещё писать). Мне очень хочется со всеми поговорить. Давайте по-флудим в комментах по следующему поводу:
🤖Как тебя затронула LLM шумиха?
🤖Какие ожидания от ближайшего будущего?
🤖Влияет ли вся эта радость на карьерные планы?
Я для затравки напишу чуть-чуть чтобы ни у кого не украсть тезисы и допишу когда (если) пойдёт жара.
❤2
Eggcorn of the day
Bone apple tea → bon appétit
Bone apple tea → bon appétit
Some time ago my friend, who worked with native English speakers, told me that when they worked with a colleague on a piece of technical documentation, the word “she” suddenly appeared referring to the “user.” They discussed it, and it turned out the gender of this word in English is opposite to the gender of the word’s translation into Russian.
When I was on Vocabulary.com, I found the phrase above and recalled that “she-user” conversation. ChatGPT assures me that the user is gender-neutral, as is soldier in the example above.
I tried to find cases of gender reversal. In the form of “word”: (English gender) (Russian gender)
the sea: she neuter
a city: she masculine
a ship: she masculine
a car: she (“автомобиль” — муж, “машина” — жен)
a country: she
a hurricane: she
This is ChatGPT’s opinion. Do you agree? Did you have some clashes like that in your practice?
When I was on Vocabulary.com, I found the phrase above and recalled that “she-user” conversation. ChatGPT assures me that the user is gender-neutral, as is soldier in the example above.
I tried to find cases of gender reversal. In the form of “word”: (English gender) (Russian gender)
the sea: she neuter
a city: she masculine
a ship: she masculine
a car: she (“автомобиль” — муж, “машина” — жен)
a country: she
a hurricane: she
This is ChatGPT’s opinion. Do you agree? Did you have some clashes like that in your practice?
Dawn of the day
It dawned on me today that Harlequin ↔️ Harley Quinn. An example of paronomasia (a pun)—more specifically, it works like an eggcorn/mondegreen for a proper name
It dawned on me today that Harlequin ↔️ Harley Quinn. An example of paronomasia (a pun)—more specifically, it works like an eggcorn/mondegreen for a proper name
Torus navigation
Problem statement: issue control commands on torus map to navigate bot to a given point.
Essence
* pythonic (x2-x1) % n gives eastward toroidal distance
* easier to think in "direction first" terms
* go_bestt(East, West, (x2-x1)%n, (x1-x2)%n) is a solution
About three years ago I met this challenge on codingame platform. Now when I tried to solve it again in TFWR I totally forget the approach and here I want to make a note on what works well and what leads to a cumbersome constructions. I always try not to memorize solutions, but approaches instead. In this post I want to compare different ways of thinking about this problem and check programs we have as a result.
My first approach
It's a straightforward way to think about this problem: to select a direction and to walk in this direction. I think it is not bad, in general, while can be improved using functions. Then I tried to extend this approach to handle torus topology. It worked, but the result (obviously) was so cumbersome, I decided not to publish it.
Next idea was to reverse direction. Logic behind is like "we reverse our decision to go East or West if it gives not the best distance on torus". I think this trick is funny, so, let's check it closer
Here we use an interesting property of XOR gate: it gives the inversion of the first parameter, if the second is true. When I realized that ^ doesn't work in TFWR, I used == instead. (a ^ b) == (a != b).
But this way is kinda creepy and I dwelled for a while, what it the best way of thinking about this problem in a right way. The results are:
1) We are working with remainders modulo n. Indeed, possible coordinates are 0, ..., (n-1) and n is equivalent to 0
2) we can think about our positions as of positions on vertices of n-gon
3) if we have two points x1, x2 they divide the n-gon into two arcs
4) in python we can calculate lengths of arcs using (x2 - x1) % n for east and (x1 - x2) % n for west directions
And that is it!
P.S.
Here I use "boolean as index" trick because there is no (v1 if cond else v2) construction in the TFWR dialect.
Note that (x2 - x1) % n is Pythonic expression is important because, for example, in C++ operator % works in a little bit different way and there you are to write something like (r%n+n)%n to have the same result.
Problem statement: issue control commands on torus map to navigate bot to a given point.
Essence
* pythonic (x2-x1) % n gives eastward toroidal distance
* easier to think in "direction first" terms
* go_bestt(East, West, (x2-x1)%n, (x1-x2)%n) is a solution
About three years ago I met this challenge on codingame platform. Now when I tried to solve it again in TFWR I totally forget the approach and here I want to make a note on what works well and what leads to a cumbersome constructions. I always try not to memorize solutions, but approaches instead. In this post I want to compare different ways of thinking about this problem and check programs we have as a result.
My first approach
def navigate(x, y):
dx = x - get_pos_x()
dy = y - get_pos_y()
if dx:
if dx > 0:
for _ in range(dx):
move(East)
else:
for _ in range(-dx):
move(West)
if dy:
if dy > 0:
for _ in range(dy):
move(North)
else:
for _ in range(-dy):
move(South)
It's a straightforward way to think about this problem: to select a direction and to walk in this direction. I think it is not bad, in general, while can be improved using functions. Then I tried to extend this approach to handle torus topology. It worked, but the result (obviously) was so cumbersome, I decided not to publish it.
Next idea was to reverse direction. Logic behind is like "we reverse our decision to go East or West if it gives not the best distance on torus". I think this trick is funny, so, let's check it closer
if (dx > 0) ^ (n - dx < dx):
go(East, abs(dx))
else:
go(West, abs(dx))
Here we use an interesting property of XOR gate: it gives the inversion of the first parameter, if the second is true. When I realized that ^ doesn't work in TFWR, I used == instead. (a ^ b) == (a != b).
But this way is kinda creepy and I dwelled for a while, what it the best way of thinking about this problem in a right way. The results are:
1) We are working with remainders modulo n. Indeed, possible coordinates are 0, ..., (n-1) and n is equivalent to 0
2) we can think about our positions as of positions on vertices of n-gon
3) if we have two points x1, x2 they divide the n-gon into two arcs
4) in python we can calculate lengths of arcs using (x2 - x1) % n for east and (x1 - x2) % n for west directions
And that is it!
def go_best(dr, dl, nr, nl):
d, s = [(dr, nr), (dl, nl)][dl<dr]
for _ in range(s):
move(d)
def nav(x2, y2):
n = get_world_size()
x1, y1 = get_pos_x(), get_pos_y()
go_best(East, West, (x2 - x1) % n, (x1 - x2) %n)
go_best(North, South, (y2 - y1) % n, (y1 - y2) %n)
P.S.
Here I use "boolean as index" trick because there is no (v1 if cond else v2) construction in the TFWR dialect.
Note that (x2 - x1) % n is Pythonic expression is important because, for example, in C++ operator % works in a little bit different way and there you are to write something like (r%n+n)%n to have the same result.
CodinGame
Practice Conditions with the exercise "Power of Thor - Episode 1"
Want to practice coding? Try to solve this easy puzzle "Power of Thor - Episode 1" (25+ languages supported).
👀1
Liquid Nitrogen Station
An attentive reader of one of my previous posts about the diamond plate in an electron microscope could notice a repetitive note: let's pour liquid nitrogen here, let's pour liquid nitrogen there. Sounds like a lot of liquid nitrogen. And it is. When I worked at the Physical Institute of the Russian Academy of Sciences, it was an everyday workout—to go with a special Dewar bowl and bring 17 liters of liquid nitrogen to the laboratory. Like 13 liters went to pump the air out of the microscope and cool down the specimen during the measurement. The remaining 4 liters evaporated during the night.
It's quite an interesting liquid and, when you have it in abundance, you can have a lot of fun. I heard that our colleagues froze ice cream with it. I myself conducted all the well-known experiments: freeze and cleave a piece of rubber. Dip your hand into it. The most amazing thing was to put a porous plastic which was widespread in the Soviet Union for packing scientific equipment. In normal conditions it's a springy material that you can compress and then it restores its original shape. But when you put it into liquid nitrogen, it becomes extremely brittle, and when you squeeze it, it becomes just a small heap of thin powder.
One more experiment I conducted inadvertently. When I was pouring liquid nitrogen into the photodetector bowl, a narrow stream escaped and soaked my jeans. So, when I unbent my leg, my jeans cracked. Jeans. Cracked. It was fun.
An attentive reader of one of my previous posts about the diamond plate in an electron microscope could notice a repetitive note: let's pour liquid nitrogen here, let's pour liquid nitrogen there. Sounds like a lot of liquid nitrogen. And it is. When I worked at the Physical Institute of the Russian Academy of Sciences, it was an everyday workout—to go with a special Dewar bowl and bring 17 liters of liquid nitrogen to the laboratory. Like 13 liters went to pump the air out of the microscope and cool down the specimen during the measurement. The remaining 4 liters evaporated during the night.
It's quite an interesting liquid and, when you have it in abundance, you can have a lot of fun. I heard that our colleagues froze ice cream with it. I myself conducted all the well-known experiments: freeze and cleave a piece of rubber. Dip your hand into it. The most amazing thing was to put a porous plastic which was widespread in the Soviet Union for packing scientific equipment. In normal conditions it's a springy material that you can compress and then it restores its original shape. But when you put it into liquid nitrogen, it becomes extremely brittle, and when you squeeze it, it becomes just a small heap of thin powder.
One more experiment I conducted inadvertently. When I was pouring liquid nitrogen into the photodetector bowl, a narrow stream escaped and soaked my jeans. So, when I unbent my leg, my jeans cracked. Jeans. Cracked. It was fun.
Big MSE dataset for GBDT
In the previous post I demonstrated a small dataset that shows how GBDT works. Now I want to present you quite a big one. There are 10 000 points in it.
I wasn't happy with the quality of the text in the small dataset, so I decided to repeat the algorithm and take more points this time.
While it's quite hard to set many points with a pen, it's easy to ask an iron friend to sample them.
At this point I got stuck for a while with an unexpected problem: I didn't like the fonts. I work in Linux, and mostly fonts don't bother me. Until now. I wanted a nice and interesting font for this task. I have no clue what "interesting" means here—probably a fat, rounded font. I don't know. And then a strange thing happened. When I googled something like "try different fonts online tool", I got nothing. Probably I was banned by Google that day, or I was extremely unlucky, but all the pages I came up with were non-functional or solved some other problem. At last, somehow I got to the... ta-da... fonts.google.com page. But the chain of thought wasn't straight.
Then progress started to go faster. My iron friend quickly rendered the title and scattered 10 000 points across it. Separation into components wasn't hard either. So I ended up with a dataset of 10 000 points separated into 128 groups.
Let's pause here. There's quite an interesting thing about tree height and the number of groups. I want to talk about it slowly and clearly, so let's do it in the next post.
In the previous post I demonstrated a small dataset that shows how GBDT works. Now I want to present you quite a big one. There are 10 000 points in it.
I wasn't happy with the quality of the text in the small dataset, so I decided to repeat the algorithm and take more points this time.
While it's quite hard to set many points with a pen, it's easy to ask an iron friend to sample them.
At this point I got stuck for a while with an unexpected problem: I didn't like the fonts. I work in Linux, and mostly fonts don't bother me. Until now. I wanted a nice and interesting font for this task. I have no clue what "interesting" means here—probably a fat, rounded font. I don't know. And then a strange thing happened. When I googled something like "try different fonts online tool", I got nothing. Probably I was banned by Google that day, or I was extremely unlucky, but all the pages I came up with were non-functional or solved some other problem. At last, somehow I got to the... ta-da... fonts.google.com page. But the chain of thought wasn't straight.
Then progress started to go faster. My iron friend quickly rendered the title and scattered 10 000 points across it. Separation into components wasn't hard either. So I ended up with a dataset of 10 000 points separated into 128 groups.
Let's pause here. There's quite an interesting thing about tree height and the number of groups. I want to talk about it slowly and clearly, so let's do it in the next post.
Big MSE dataset. Depth of trees.
There are 128 groups in the Big dataset. And to distinguish them perfectly, it's necessary to use exactly 7 binary features. Why is it so? Because when we add one more level to a decision tree, we change one leaf into two, replacing it with a decision rule. So, if we start with one root and build 7 levels over it, we will have exactly 128 leaves—one leaf for each group.
This illustrates a common phrase about GBDT: "Height of trees is the number of factors we want to work together." I have always heard this phrase, but it was just a theory for me, because in real‑world datasets you have no idea how many factors should work together.
In this dataset we know the exact number of factors to work together—7—and we can test that intuition. And that's exactly the story these three plots are telling us. At the top—trees have only two levels. The model has a limited ability to learn, and we see classic learning curves: the model grasps some generic knowledge and both train and test curves go down. Then it starts to learn noise and the test curve goes up. We can see that, because of the low depth, it can't fit the train set with good quality. It's exactly what we saw in the previous post on the plot "predicted values on train data."
If we check the lower graph, we can see that with depth 7 the MSE becomes much lower and we see a better correspondence between the dataset and the model's inference.
The third image is about the learning rate. lr for upper plots is 0.3 and for the lower 0.1 You can see that the learning curves are less steep. But it doesn't help the asymptotic value of the train curve. When points are wrongly attributed to groups in the initial steps, there is no way to re‑attribute them later.
There are 128 groups in the Big dataset. And to distinguish them perfectly, it's necessary to use exactly 7 binary features. Why is it so? Because when we add one more level to a decision tree, we change one leaf into two, replacing it with a decision rule. So, if we start with one root and build 7 levels over it, we will have exactly 128 leaves—one leaf for each group.
This illustrates a common phrase about GBDT: "Height of trees is the number of factors we want to work together." I have always heard this phrase, but it was just a theory for me, because in real‑world datasets you have no idea how many factors should work together.
In this dataset we know the exact number of factors to work together—7—and we can test that intuition. And that's exactly the story these three plots are telling us. At the top—trees have only two levels. The model has a limited ability to learn, and we see classic learning curves: the model grasps some generic knowledge and both train and test curves go down. Then it starts to learn noise and the test curve goes up. We can see that, because of the low depth, it can't fit the train set with good quality. It's exactly what we saw in the previous post on the plot "predicted values on train data."
If we check the lower graph, we can see that with depth 7 the MSE becomes much lower and we see a better correspondence between the dataset and the model's inference.
The third image is about the learning rate. lr for upper plots is 0.3 and for the lower 0.1 You can see that the learning curves are less steep. But it doesn't help the asymptotic value of the train curve. When points are wrongly attributed to groups in the initial steps, there is no way to re‑attribute them later.
GBDTE log‑loss dataset
In this post I want to solemnly declare: I'm not a mathematician. My friends who are, I'm totally sure, would solve this problem without effort using Bayes' formalism. I can only wave my hands.
So, what's the fuss?
I want to generate a synthetic dataset with properties similar to those of the initial fraudulent‑users dataset. And I want to control how much information the features bring about the target value. Moreover, I want to introduce a new variable to the dataset—t (time)—and add time dependence to the dataset's statistical properties.
Because of my math inaptitude, I use my intuition from physics problem‑solving. First, let's draw the figure you see in the picture. The horizontal axis stands for the binary factor f (feature). The vertical axis stands for the binary target l (label). The height of the horizontal line that splits l=0 and l=1 has a well‑defined meaning—it's the average value of our target, α. Let α = 0.5. That's the second equation, because the first is a + b + c + d = 1.
Then we can think about the average value of the factor. I want to have 16 factors in my dataset, and I want them, in total, to give slightly less information than would allow 100% recovery of the target. So the average factor value β should be 1/16 = 0.0625. β is the coverage—how often the factor equals 1. Third equation.
And finally, the lift. It's a ratio: in the numerator is the probability that the target equals 1 when the factor equals 1, and in the denominator is the average target value. In terms of our variables, d/(d+b) is the average target when f = 1, and α is the average target value.
When lift = 1, the factor gives no information about the target. When it's > 1, it shows how much stronger our position is when using this factor. For example, a lift of 1.3 shows that we would catch 30% more credit‑fraud users when using this factor. It's convenient to use log‑lift: it is 0 when there is no gain and has the same sign as the correlation between the target and the factor.
For my dataset I want time to run from 0 to 1, and I want two groups of factors: with the lift going up and with the lift going down. The expressions for these lifts are quite simple:
Now we have four variables and four equations to determine them. I solved the system using Gaussian elimination, and the result is in the lower picture.
I'm going to implement these expressions in the synthetic dataset generation script. It's already done, but I wanted to recap the logic behind it. Next time—the dataset.
In this post I want to solemnly declare: I'm not a mathematician. My friends who are, I'm totally sure, would solve this problem without effort using Bayes' formalism. I can only wave my hands.
So, what's the fuss?
I want to generate a synthetic dataset with properties similar to those of the initial fraudulent‑users dataset. And I want to control how much information the features bring about the target value. Moreover, I want to introduce a new variable to the dataset—t (time)—and add time dependence to the dataset's statistical properties.
Because of my math inaptitude, I use my intuition from physics problem‑solving. First, let's draw the figure you see in the picture. The horizontal axis stands for the binary factor f (feature). The vertical axis stands for the binary target l (label). The height of the horizontal line that splits l=0 and l=1 has a well‑defined meaning—it's the average value of our target, α. Let α = 0.5. That's the second equation, because the first is a + b + c + d = 1.
Then we can think about the average value of the factor. I want to have 16 factors in my dataset, and I want them, in total, to give slightly less information than would allow 100% recovery of the target. So the average factor value β should be 1/16 = 0.0625. β is the coverage—how often the factor equals 1. Third equation.
And finally, the lift. It's a ratio: in the numerator is the probability that the target equals 1 when the factor equals 1, and in the denominator is the average target value. In terms of our variables, d/(d+b) is the average target when f = 1, and α is the average target value.
When lift = 1, the factor gives no information about the target. When it's > 1, it shows how much stronger our position is when using this factor. For example, a lift of 1.3 shows that we would catch 30% more credit‑fraud users when using this factor. It's convenient to use log‑lift: it is 0 when there is no gain and has the same sign as the correlation between the target and the factor.
For my dataset I want time to run from 0 to 1, and I want two groups of factors: with the lift going up and with the lift going down. The expressions for these lifts are quite simple:
lift_up = 0.25 + 0.5*t
lift_down = 0.75 - 0.5*t
Now we have four variables and four equations to determine them. I solved the system using Gaussian elimination, and the result is in the lower picture.
I'm going to implement these expressions in the synthetic dataset generation script. It's already done, but I wanted to recap the logic behind it. Next time—the dataset.