Hello everyone!
This is a place for me to share my pet projects and everything that bothers or fascinates me right now. It can be almost anything: a new book, a half-baked DIY idea, a mathematical trick, or part of a long-term project. If things like 3D printing, electronics, sci-fi, or vinyl record optical restoration resonate with you, join and enjoy the journey.
Today I’m thinking about gradient-boosted decision trees with extrapolation. Who knows what it will be tomorrow.
This is a place for me to share my pet projects and everything that bothers or fascinates me right now. It can be almost anything: a new book, a half-baked DIY idea, a mathematical trick, or part of a long-term project. If things like 3D printing, electronics, sci-fi, or vinyl record optical restoration resonate with you, join and enjoy the journey.
Today I’m thinking about gradient-boosted decision trees with extrapolation. Who knows what it will be tomorrow.
👍2❤1
Here it is— a historical picture of a modification my friend and I made to gradient-boosted decision trees. It all started at Equifax. In 2016—almost ten years ago—I tried to become part of a credit-scoring project. At the time, the well-known XGBoost package (2014) was only two years old, and everyone was trying to apply it in their domain. In credit scoring it gave quite a boost, especially compared with linear-regression models, which are widely used in banks for their interpretability. But GBDT models quickly degrade because of the dynamic nature of the fraud market. The picture shows the main idea: put extrapolators in the leaves of the GBDT, turning it into an extrapolator rather than an interpolator, unlike the majority of well-known ML models.
❤1👍1
Credit scoring problem statement
In the previous article, I mentioned the credit-scoring task and noted that results can be unstable and model quality degrades over time.
Let’s dive a little deeper. When I try to recall the algorithm details and share the math, I can’t help adding a few side notes—like using Oracle PL/SQL for dataset preparation. I hope that’s interesting too.
When we need to decide whether to give a loan to an applicant, we have their application. We can find their previous applications and calculate a bunch of factors:
* how many different mobile phone numbers they used;
* how many different street names were mentioned;
* and so on.
Gradient Boosted Decision Trees (GBDT), in a nutshell, divide our customers into groups by thresholding features and assign each group a constant score. A high score means the group is more likely to default; a low score suggests reliable customers who will repay.
The problem is that factors change their strength—and even their meaning—over time.
Take the feature “exactly three phone numbers in a user’s history.” We checked three consecutive years: 2014, 2015, 2016. The average fraud rate was 1% in each year. We ran a simple uplift check: selected the group with exactly three phone numbers and computed the average target value for each year. Results:
* 2014: 1.5%
* 2015: 1%
* 2016: 0.5%
In 2014, this factor was solid evidence of higher fraud risk. In 2015, it became neutral. In 2016, its polarity reversed: it suggested a more lawful customer.
Next time I'm going to talk about dataset preparation and why I'm not happy with Oracle PL/SQL.
In the previous article, I mentioned the credit-scoring task and noted that results can be unstable and model quality degrades over time.
Let’s dive a little deeper. When I try to recall the algorithm details and share the math, I can’t help adding a few side notes—like using Oracle PL/SQL for dataset preparation. I hope that’s interesting too.
When we need to decide whether to give a loan to an applicant, we have their application. We can find their previous applications and calculate a bunch of factors:
* how many different mobile phone numbers they used;
* how many different street names were mentioned;
* and so on.
Gradient Boosted Decision Trees (GBDT), in a nutshell, divide our customers into groups by thresholding features and assign each group a constant score. A high score means the group is more likely to default; a low score suggests reliable customers who will repay.
The problem is that factors change their strength—and even their meaning—over time.
Take the feature “exactly three phone numbers in a user’s history.” We checked three consecutive years: 2014, 2015, 2016. The average fraud rate was 1% in each year. We ran a simple uplift check: selected the group with exactly three phone numbers and computed the average target value for each year. Results:
* 2014: 1.5%
* 2015: 1%
* 2016: 0.5%
In 2014, this factor was solid evidence of higher fraud risk. In 2015, it became neutral. In 2016, its polarity reversed: it suggested a more lawful customer.
Next time I'm going to talk about dataset preparation and why I'm not happy with Oracle PL/SQL.
👾1
Intro
In previous posts I explained why using gradient-boosted decision trees “as is” is problematic for credit scoring and gave a glimpse of the factors used in this area. Today let’s discuss the approach Equifax used to create datasets—and how to apply extrapolating gradient-boosted decision trees to this problem.
Dataset
About the target. We check whether an applicant paid their loan back within three months: paid back → “0”, didn’t → “1”. That creates an extra headache: by the time the target is known, the dataset is already three months out of date.
Equifax had several million applications in an Oracle database. As I vaguely recall, there were hundreds of thousands of unique applicants. Our task was to run a massive GROUP BY over the database and compute counts like the number of unique addresses, phone numbers, etc. That wasn’t the last step: for each “raw” factor we also derived a set of binary factors—so-called “flags.”
Map-Reduce over Oracle
It turned out the speed of this procedure depended non-linearly on chunk size, so I wrote the longest PL/SQL script of my life—about 400 lines. It pulled chunks of ~100,000 records, processed them, and stored the results. After the whole table was processed, all chunks were joined together. So to speak, MapReduce over Oracle. The calculations took about five days. Coming from Yandex’s MapReduce platform, I was in quiet disbelief: a GROUP BY over a couple of million records should be five minutes, not five days. I still think we’re missing tools for meso-scale data—something between a pandas DataFrame and full-blown MapReduce—but that’s a topic for another post.
Obstacles
With this dataset in hand, we started experiments. A few notable moments:
* I found a bug in XGBoost (something about importing data from Python arrays).
* GBDT on binary flag features had the same quality as linear regression (boosting can find better thresholds that are frozen in the flags).
* Predictions were unstable; quality degraded quickly over time.
Unfortunately, overcoming these obstacles didn’t build a strong relationship with my manager. On the contrary, it led to the termination of my employment at Equifax at the end of the trial period.
The birth of the approach
By the end of that period I had an idea: if the system is unstable over time, we can leverage that by trying to learn temporal dependencies. That’s exactly the idea behind Gradient-Boosted Decision Trees with Extrapolation. I remember the moment clearly—it was Friday the 13th, April 13, 2018.
To finalize the dataset description: a binary target with ~1% positive rate; a dozen or two “raw” features like “number of unique phone numbers in previous applications”; and about 75 derived “flag” features built from those raw features.
I said goodbye to that dataset, but the idea was too interesting to drop—I was eager to implement it.
In previous posts I explained why using gradient-boosted decision trees “as is” is problematic for credit scoring and gave a glimpse of the factors used in this area. Today let’s discuss the approach Equifax used to create datasets—and how to apply extrapolating gradient-boosted decision trees to this problem.
Dataset
About the target. We check whether an applicant paid their loan back within three months: paid back → “0”, didn’t → “1”. That creates an extra headache: by the time the target is known, the dataset is already three months out of date.
Equifax had several million applications in an Oracle database. As I vaguely recall, there were hundreds of thousands of unique applicants. Our task was to run a massive GROUP BY over the database and compute counts like the number of unique addresses, phone numbers, etc. That wasn’t the last step: for each “raw” factor we also derived a set of binary factors—so-called “flags.”
Map-Reduce over Oracle
It turned out the speed of this procedure depended non-linearly on chunk size, so I wrote the longest PL/SQL script of my life—about 400 lines. It pulled chunks of ~100,000 records, processed them, and stored the results. After the whole table was processed, all chunks were joined together. So to speak, MapReduce over Oracle. The calculations took about five days. Coming from Yandex’s MapReduce platform, I was in quiet disbelief: a GROUP BY over a couple of million records should be five minutes, not five days. I still think we’re missing tools for meso-scale data—something between a pandas DataFrame and full-blown MapReduce—but that’s a topic for another post.
Obstacles
With this dataset in hand, we started experiments. A few notable moments:
* I found a bug in XGBoost (something about importing data from Python arrays).
* GBDT on binary flag features had the same quality as linear regression (boosting can find better thresholds that are frozen in the flags).
* Predictions were unstable; quality degraded quickly over time.
Unfortunately, overcoming these obstacles didn’t build a strong relationship with my manager. On the contrary, it led to the termination of my employment at Equifax at the end of the trial period.
The birth of the approach
By the end of that period I had an idea: if the system is unstable over time, we can leverage that by trying to learn temporal dependencies. That’s exactly the idea behind Gradient-Boosted Decision Trees with Extrapolation. I remember the moment clearly—it was Friday the 13th, April 13, 2018.
To finalize the dataset description: a binary target with ~1% positive rate; a dozen or two “raw” features like “number of unique phone numbers in previous applications”; and about 75 derived “flag” features built from those raw features.
I said goodbye to that dataset, but the idea was too interesting to drop—I was eager to implement it.
Math in Telegram
Intro
Change of subject. I’m feeling, probably, like Knuth felt 50 years ago. I started typing a Telegram post, and when it felt appropriate to add a simple mathematical expression, I hit a brick wall. As my friend and I discussed ten years ago, there’s still no easy way to type math expressions in Telegram.
This post is a new style for me—maybe a “struggle story.” Everyone can tell a success story, and we’re all a bit fed up with that. My friend, who runs the channel, told me the struggle itself could be interesting. Let’s test this hypothesis.
Recap
Brief recap: I fixed a nasty issue in my gradient-boosted decision trees repository, started improving it and writing about the process in this channel, and realized I can’t write math expressions in Telegram posts. So instead of improving GBDT-E, I’m trying to figure out this math hiccup.
Math in Telegram, SOTA
First, I decided to diagnose the problem. I asked our “iron” friend, ChatGPT, about a convenient way to include math in Telegram posts. It gave a long explanation about making links and sharing PDFs. None of that satisfied me. I just want to type text and have inline inserts—like when I’m writing with a pen.
Next, I checked several math-heavy channels—contest problems, ML and DS lectures. Almost all of them use the same ugly approach: copy-pasted PDFs and pictures.
Instant View
During these attempts I noticed “Instant View” and started digging into this technology. The Telegram documentation
says it can parse a blog article or site, cache it on Telegram’s servers, and show it instantly inside the app. Sounds promising. I put the beginning of my article in Markdown, asked a Codex agent to convert it to HTML and publish on GitHub Pages. The first results were quite satisfying: the HTML looked close to what I wanted, and a nice Instant View was generated. I’ll show those results. But then the struggle began.
Intro
Change of subject. I’m feeling, probably, like Knuth felt 50 years ago. I started typing a Telegram post, and when it felt appropriate to add a simple mathematical expression, I hit a brick wall. As my friend and I discussed ten years ago, there’s still no easy way to type math expressions in Telegram.
This post is a new style for me—maybe a “struggle story.” Everyone can tell a success story, and we’re all a bit fed up with that. My friend, who runs the channel, told me the struggle itself could be interesting. Let’s test this hypothesis.
Recap
Brief recap: I fixed a nasty issue in my gradient-boosted decision trees repository, started improving it and writing about the process in this channel, and realized I can’t write math expressions in Telegram posts. So instead of improving GBDT-E, I’m trying to figure out this math hiccup.
Math in Telegram, SOTA
First, I decided to diagnose the problem. I asked our “iron” friend, ChatGPT, about a convenient way to include math in Telegram posts. It gave a long explanation about making links and sharing PDFs. None of that satisfied me. I just want to type text and have inline inserts—like when I’m writing with a pen.
Next, I checked several math-heavy channels—contest problems, ML and DS lectures. Almost all of them use the same ugly approach: copy-pasted PDFs and pictures.
Instant View
During these attempts I noticed “Instant View” and started digging into this technology. The Telegram documentation
says it can parse a blog article or site, cache it on Telegram’s servers, and show it instantly inside the app. Sounds promising. I put the beginning of my article in Markdown, asked a Codex agent to convert it to HTML and publish on GitHub Pages. The first results were quite satisfying: the HTML looked close to what I wanted, and a nice Instant View was generated. I’ll show those results. But then the struggle began.
Latex in Telegram
I came up with the temporal solution for publishing texts with math expressions included. It doesn't allow to put formulas in the middle of the string and quite cumbersome to use, but it works. One can check my current approach.
The current approach:
𐂅Write an article in markdown
𐂅Use agentic policies to create html and pics
𐂅Use template editor to create IV link (probably this link requires my authorization)
𐂅Insert link into the message, it gives instant view button with preview
Notes for the future development
𐂅improve building script to make conversion from .md to html + png smoother
𐂅investigate, why wide pictures don't want to load in Instant View (IV) when narrow pictures are totally OK
𐂅come up with github-pages based solution, which I can share: someone can clone repository and work with it
𐂅try to build "telegraph with Latex" service
I came up with the temporal solution for publishing texts with math expressions included. It doesn't allow to put formulas in the middle of the string and quite cumbersome to use, but it works. One can check my current approach.
The current approach:
𐂅Write an article in markdown
𐂅Use agentic policies to create html and pics
𐂅Use template editor to create IV link (probably this link requires my authorization)
𐂅Insert link into the message, it gives instant view button with preview
Notes for the future development
𐂅improve building script to make conversion from .md to html + png smoother
𐂅investigate, why wide pictures don't want to load in Instant View (IV) when narrow pictures are totally OK
𐂅come up with github-pages based solution, which I can share: someone can clone repository and work with it
𐂅try to build "telegraph with Latex" service
Forging the algorithm
The main idea behind vanilla gradient boosted decision trees
𐂅Loss function
𐂅Split procedure
𐂅Tailor expansion
𐂅Expression for minimum
👉
The main idea behind vanilla gradient boosted decision trees
𐂅Loss function
𐂅Split procedure
𐂅Tailor expansion
𐂅Expression for minimum
👉
t.me
Forging the Algorithm
Forging the Algorithm Arseniy
I have just tested one more way to publish Instant View articles. It uses
U+2060 WORD JOINER (copy between brackets) []
Trick is to copy invisible character from between brackets, set link on it with ctrl+k and publish the article
This way there is no visible text beside instant view text itself.
U+2060 WORD JOINER (copy between brackets) []
Trick is to copy invisible character from between brackets, set link on it with ctrl+k and publish the article
This way there is no visible text beside instant view text itself.
Great news everyone!
I turned on autotranslate on this channel, so everyone can read posts on their favorite language.
I turned on autotranslate on this channel, so everyone can read posts on their favorite language.
The same post in a slightly different format. Let's try and see what's work better for us.
Designing an MSE Dataset
Intro
Today I want to break our strict posting sequence and jump to a current problem.
I want to open my GitHub project and give users a nice dataset to test our method. There’s already a good dataset for the log-loss function, and I was struggling to design one for MSE. So here we are: we have a model that can group objects and find the best linear combination of basis functions to minimize a loss. How do we create a nice, interesting dataset for such a model? Buckle up—let’s dive in.
DIY Dataset
First of all, I started with a small bit of handcrafting and wrote “Extra Boost” with dots on a piece of paper. Then I uploaded it to ChatGPT and started a vibe-coding session. In no time I turned the dots into a scatter plot. But, as you know, screen and math coordinate systems differ in the direction of the Y-axis, so my title was upside down.
It seems trivial—barely worth mentioning. And, probably, it is in the world of traditional programming. But in vibe-coding it became quite a nuisance. Like two silly servants in an old Soviet cartoon, the model did two things at once: it changed the dataset and flipped the sign of Y. As a result, I spent quite a while staring at an upside-down title. After a while I figured out what was going on and got a normal scatter plot.
“And now what?” — probably the question you have in your head right now. Don’t worry—I had the same question. Then I started to think about properties of my approach:
it can group objects by static features;
it can find the best fit for several basis functions.
So I needed some interesting basis functions. OK: [1, t, sin(kt), cos(kt)]. The constant lets you move the plot up and down—always useful. t is for trends. The pair sin(kt) and cos(kt) lets us place a harmonic component where we want it; with the right amplitudes you can shift it left or right.
Let’s stop here. Where these basis functions show up in our “Extra Boost” title—I’ll explain in the next post.
Designing an MSE Dataset
Intro
Today I want to break our strict posting sequence and jump to a current problem.
I want to open my GitHub project and give users a nice dataset to test our method. There’s already a good dataset for the log-loss function, and I was struggling to design one for MSE. So here we are: we have a model that can group objects and find the best linear combination of basis functions to minimize a loss. How do we create a nice, interesting dataset for such a model? Buckle up—let’s dive in.
DIY Dataset
First of all, I started with a small bit of handcrafting and wrote “Extra Boost” with dots on a piece of paper. Then I uploaded it to ChatGPT and started a vibe-coding session. In no time I turned the dots into a scatter plot. But, as you know, screen and math coordinate systems differ in the direction of the Y-axis, so my title was upside down.
It seems trivial—barely worth mentioning. And, probably, it is in the world of traditional programming. But in vibe-coding it became quite a nuisance. Like two silly servants in an old Soviet cartoon, the model did two things at once: it changed the dataset and flipped the sign of Y. As a result, I spent quite a while staring at an upside-down title. After a while I figured out what was going on and got a normal scatter plot.
“And now what?” — probably the question you have in your head right now. Don’t worry—I had the same question. Then I started to think about properties of my approach:
it can group objects by static features;
it can find the best fit for several basis functions.
So I needed some interesting basis functions. OK: [1, t, sin(kt), cos(kt)]. The constant lets you move the plot up and down—always useful. t is for trends. The pair sin(kt) and cos(kt) lets us place a harmonic component where we want it; with the right amplitudes you can shift it left or right.
Let’s stop here. Where these basis functions show up in our “Extra Boost” title—I’ll explain in the next post.
Which of the two formats is more convenient for you to read?
Anonymous Poll
50%
Instant view⚡️
50%
Post + pictures after it🦥
0%
Who is here?❓
0%
I need more info: what are these approaches? ❓
How to find linear superposition in chaos
Now we have a set of points which, while fairly random from a mathematical point of view, give us a depiction of the “Extra Boost” sign. For my method, I need to find several groups, each represented by a linear combination of basis functions. I set time \(t\) to go from left to right, from 0 to 1. The basis functions are [1,t, sin(kt), cos(kt)], so the extrapolating function is (Expression below).
Weights (A,B,C,D) can be estimated from the dataset using least squares, but we still need to pick k. After a set of experiments I chose k=50: it gives a convenient scale—the wavelength is roughly the width of a letter.
With this setup I obtained the picture you see at the beginning of the article. Then I decided the tolerance was too large and reduced the band width.
Here we are: a narrow band.
Next, I removed points within the tolerance range and repeated the process. To my surprise, after the first iteration nothing changed.
You can see that the dots disappeared, but the curve didn’t change. After a while I understood why. It was vibe-coding: I asked my iron friend to find a curve that captures the highest number of points; instead, it wrote code that minimizes MSE. That approach has an interesting property: when you delete points lying on the curve, the MSE is unchanged, so the same curve remains optimal.
I told the iron friend that, instead of minimizing squared distance to the points, it should maximize the number of captured points. It proposed the RANSAC approach, which was new to me: repeatedly select four random points, fit the curve, count captured points, and keep the candidate with the most inliers. It worked.
I ran the process iteratively, and it decomposed the figure into a superposition of functions. Unfortunately, the upper half of “B” wasn’t captured. I suspected the issue was the different heights of lowercase and uppercase letters and created a second version of the drawing.
The same procedure gave me the sign decomposed into eight components, each a superposition of the basis functions.
Finally, I encoded the group number as a 0–1 vector of static features f1,f2,f3 and exported the dataset as CSV. Hooray — now we have data to test the MSE mode of the EXTRA BOOST model.
Now we have a set of points which, while fairly random from a mathematical point of view, give us a depiction of the “Extra Boost” sign. For my method, I need to find several groups, each represented by a linear combination of basis functions. I set time \(t\) to go from left to right, from 0 to 1. The basis functions are [1,t, sin(kt), cos(kt)], so the extrapolating function is (Expression below).
Weights (A,B,C,D) can be estimated from the dataset using least squares, but we still need to pick k. After a set of experiments I chose k=50: it gives a convenient scale—the wavelength is roughly the width of a letter.
With this setup I obtained the picture you see at the beginning of the article. Then I decided the tolerance was too large and reduced the band width.
Here we are: a narrow band.
Next, I removed points within the tolerance range and repeated the process. To my surprise, after the first iteration nothing changed.
You can see that the dots disappeared, but the curve didn’t change. After a while I understood why. It was vibe-coding: I asked my iron friend to find a curve that captures the highest number of points; instead, it wrote code that minimizes MSE. That approach has an interesting property: when you delete points lying on the curve, the MSE is unchanged, so the same curve remains optimal.
I told the iron friend that, instead of minimizing squared distance to the points, it should maximize the number of captured points. It proposed the RANSAC approach, which was new to me: repeatedly select four random points, fit the curve, count captured points, and keep the candidate with the most inliers. It worked.
I ran the process iteratively, and it decomposed the figure into a superposition of functions. Unfortunately, the upper half of “B” wasn’t captured. I suspected the issue was the different heights of lowercase and uppercase letters and created a second version of the drawing.
The same procedure gave me the sign decomposed into eight components, each a superposition of the basis functions.
Finally, I encoded the group number as a 0–1 vector of static features f1,f2,f3 and exported the dataset as CSV. Hooray — now we have data to test the MSE mode of the EXTRA BOOST model.