Hello everyone!
This is a place for me to share my pet projects and everything that bothers or fascinates me right now. It can be almost anything: a new book, a half-baked DIY idea, a mathematical trick, or part of a long-term project. If things like 3D printing, electronics, sci-fi, or vinyl record optical restoration resonate with you, join and enjoy the journey.
Today I’m thinking about gradient-boosted decision trees with extrapolation. Who knows what it will be tomorrow.
This is a place for me to share my pet projects and everything that bothers or fascinates me right now. It can be almost anything: a new book, a half-baked DIY idea, a mathematical trick, or part of a long-term project. If things like 3D printing, electronics, sci-fi, or vinyl record optical restoration resonate with you, join and enjoy the journey.
Today I’m thinking about gradient-boosted decision trees with extrapolation. Who knows what it will be tomorrow.
👍2❤1
Here it is— a historical picture of a modification my friend and I made to gradient-boosted decision trees. It all started at Equifax. In 2016—almost ten years ago—I tried to become part of a credit-scoring project. At the time, the well-known XGBoost package (2014) was only two years old, and everyone was trying to apply it in their domain. In credit scoring it gave quite a boost, especially compared with linear-regression models, which are widely used in banks for their interpretability. But GBDT models quickly degrade because of the dynamic nature of the fraud market. The picture shows the main idea: put extrapolators in the leaves of the GBDT, turning it into an extrapolator rather than an interpolator, unlike the majority of well-known ML models.
❤1👍1
Credit scoring problem statement
In the previous article, I mentioned the credit-scoring task and noted that results can be unstable and model quality degrades over time.
Let’s dive a little deeper. When I try to recall the algorithm details and share the math, I can’t help adding a few side notes—like using Oracle PL/SQL for dataset preparation. I hope that’s interesting too.
When we need to decide whether to give a loan to an applicant, we have their application. We can find their previous applications and calculate a bunch of factors:
* how many different mobile phone numbers they used;
* how many different street names were mentioned;
* and so on.
Gradient Boosted Decision Trees (GBDT), in a nutshell, divide our customers into groups by thresholding features and assign each group a constant score. A high score means the group is more likely to default; a low score suggests reliable customers who will repay.
The problem is that factors change their strength—and even their meaning—over time.
Take the feature “exactly three phone numbers in a user’s history.” We checked three consecutive years: 2014, 2015, 2016. The average fraud rate was 1% in each year. We ran a simple uplift check: selected the group with exactly three phone numbers and computed the average target value for each year. Results:
* 2014: 1.5%
* 2015: 1%
* 2016: 0.5%
In 2014, this factor was solid evidence of higher fraud risk. In 2015, it became neutral. In 2016, its polarity reversed: it suggested a more lawful customer.
Next time I'm going to talk about dataset preparation and why I'm not happy with Oracle PL/SQL.
In the previous article, I mentioned the credit-scoring task and noted that results can be unstable and model quality degrades over time.
Let’s dive a little deeper. When I try to recall the algorithm details and share the math, I can’t help adding a few side notes—like using Oracle PL/SQL for dataset preparation. I hope that’s interesting too.
When we need to decide whether to give a loan to an applicant, we have their application. We can find their previous applications and calculate a bunch of factors:
* how many different mobile phone numbers they used;
* how many different street names were mentioned;
* and so on.
Gradient Boosted Decision Trees (GBDT), in a nutshell, divide our customers into groups by thresholding features and assign each group a constant score. A high score means the group is more likely to default; a low score suggests reliable customers who will repay.
The problem is that factors change their strength—and even their meaning—over time.
Take the feature “exactly three phone numbers in a user’s history.” We checked three consecutive years: 2014, 2015, 2016. The average fraud rate was 1% in each year. We ran a simple uplift check: selected the group with exactly three phone numbers and computed the average target value for each year. Results:
* 2014: 1.5%
* 2015: 1%
* 2016: 0.5%
In 2014, this factor was solid evidence of higher fraud risk. In 2015, it became neutral. In 2016, its polarity reversed: it suggested a more lawful customer.
Next time I'm going to talk about dataset preparation and why I'm not happy with Oracle PL/SQL.
👾1
