Algorithms. Physics. Mathematics. Machine Learning.
170 subscribers
86 photos
7 videos
40 links
DIY projects, fun with 3d, electronics, programming, vibe coding, math, ML algorithms.
Download Telegram
Intro
In previous posts I explained why using gradient-boosted decision trees “as is” is problematic for credit scoring and gave a glimpse of the factors used in this area. Today let’s discuss the approach Equifax used to create datasets—and how to apply extrapolating gradient-boosted decision trees to this problem.

Dataset
About the target. We check whether an applicant paid their loan back within three months: paid back → “0”, didn’t → “1”. That creates an extra headache: by the time the target is known, the dataset is already three months out of date.

Equifax had several million applications in an Oracle database. As I vaguely recall, there were hundreds of thousands of unique applicants. Our task was to run a massive GROUP BY over the database and compute counts like the number of unique addresses, phone numbers, etc. That wasn’t the last step: for each “raw” factor we also derived a set of binary factors—so-called “flags.”

Map-Reduce over Oracle
It turned out the speed of this procedure depended non-linearly on chunk size, so I wrote the longest PL/SQL script of my life—about 400 lines. It pulled chunks of ~100,000 records, processed them, and stored the results. After the whole table was processed, all chunks were joined together. So to speak, MapReduce over Oracle. The calculations took about five days. Coming from Yandex’s MapReduce platform, I was in quiet disbelief: a GROUP BY over a couple of million records should be five minutes, not five days. I still think we’re missing tools for meso-scale data—something between a pandas DataFrame and full-blown MapReduce—but that’s a topic for another post.

Obstacles
With this dataset in hand, we started experiments. A few notable moments:

* I found a bug in XGBoost (something about importing data from Python arrays).
* GBDT on binary flag features had the same quality as linear regression (boosting can find better thresholds that are frozen in the flags).
* Predictions were unstable; quality degraded quickly over time.

Unfortunately, overcoming these obstacles didn’t build a strong relationship with my manager. On the contrary, it led to the termination of my employment at Equifax at the end of the trial period.

The birth of the approach
By the end of that period I had an idea: if the system is unstable over time, we can leverage that by trying to learn temporal dependencies. That’s exactly the idea behind Gradient-Boosted Decision Trees with Extrapolation. I remember the moment clearly—it was Friday the 13th, April 13, 2018.

To finalize the dataset description: a binary target with ~1% positive rate; a dozen or two “raw” features like “number of unique phone numbers in previous applications”; and about 75 derived “flag” features built from those raw features.

I said goodbye to that dataset, but the idea was too interesting to drop—I was eager to implement it.
Math in Telegram

Intro

Change of subject. I’m feeling, probably, like Knuth felt 50 years ago. I started typing a Telegram post, and when it felt appropriate to add a simple mathematical expression, I hit a brick wall. As my friend and I discussed ten years ago, there’s still no easy way to type math expressions in Telegram.

This post is a new style for me—maybe a “struggle story.” Everyone can tell a success story, and we’re all a bit fed up with that. My friend, who runs the channel, told me the struggle itself could be interesting. Let’s test this hypothesis.

Recap

Brief recap: I fixed a nasty issue in my gradient-boosted decision trees repository, started improving it and writing about the process in this channel, and realized I can’t write math expressions in Telegram posts. So instead of improving GBDT-E, I’m trying to figure out this math hiccup.

Math in Telegram, SOTA

First, I decided to diagnose the problem. I asked our “iron” friend, ChatGPT, about a convenient way to include math in Telegram posts. It gave a long explanation about making links and sharing PDFs. None of that satisfied me. I just want to type text and have inline inserts—like when I’m writing with a pen.

Next, I checked several math-heavy channels—contest problems, ML and DS lectures. Almost all of them use the same ugly approach: copy-pasted PDFs and pictures.

Instant View

During these attempts I noticed “Instant View” and started digging into this technology. The Telegram documentation
says it can parse a blog article or site, cache it on Telegram’s servers, and show it instantly inside the app. Sounds promising. I put the beginning of my article in Markdown, asked a Codex agent to convert it to HTML and publish on GitHub Pages. The first results were quite satisfying: the HTML looked close to what I wanted, and a nice Instant View was generated. I’ll show those results. But then the struggle began.
Latex in Telegram


I came up with the temporal solution for publishing texts with math expressions included. It doesn't allow to put formulas in the middle of the string and quite cumbersome to use, but it works. One can check my current approach.

The current approach
:

𐂅Write an article in markdown
𐂅Use agentic policies to create html and pics
𐂅Use template editor to create IV link (probably this link requires my authorization)
𐂅Insert link into the message, it gives instant view button with preview

Notes for the future development
𐂅improve building script to make conversion from .md to html + png smoother
𐂅investigate, why wide pictures don't want to load in Instant View (IV) when narrow pictures are totally OK
𐂅come up with github-pages based solution, which I can share: someone can clone repository and work with it
𐂅try to build "telegraph with Latex" service
Forging the algorithm

The main idea behind vanilla gradient boosted decision trees

𐂅Loss function
𐂅Split procedure
𐂅Tailor expansion
𐂅Expression for minimum

👉
I have just tested one more way to publish Instant View articles. It uses
U+2060 WORD JOINER (copy between brackets) [⁠]

Trick is to copy invisible character from between brackets, set link on it with ctrl+k and publish the article

This way there is no visible text beside instant view text itself.
Congratulations!

We just defied the laws of math: over 100% of our subscribers read this.
Great news everyone!

I turned on autotranslate on this channel, so everyone can read posts on their favorite language.
The same post in a slightly different format. Let's try and see what's work better for us.

Designing an MSE Dataset

Intro
Today I want to break our strict posting sequence and jump to a current problem.

I want to open my GitHub project and give users a nice dataset to test our method. There’s already a good dataset for the log-loss function, and I was struggling to design one for MSE. So here we are: we have a model that can group objects and find the best linear combination of basis functions to minimize a loss. How do we create a nice, interesting dataset for such a model? Buckle up—let’s dive in.

DIY Dataset

First of all, I started with a small bit of handcrafting and wrote “Extra Boost” with dots on a piece of paper. Then I uploaded it to ChatGPT and started a vibe-coding session. In no time I turned the dots into a scatter plot. But, as you know, screen and math coordinate systems differ in the direction of the Y-axis, so my title was upside down.

It seems trivial—barely worth mentioning. And, probably, it is in the world of traditional programming. But in vibe-coding it became quite a nuisance. Like two silly servants in an old Soviet cartoon, the model did two things at once: it changed the dataset and flipped the sign of Y. As a result, I spent quite a while staring at an upside-down title. After a while I figured out what was going on and got a normal scatter plot.

“And now what?” — probably the question you have in your head right now. Don’t worry—I had the same question. Then I started to think about properties of my approach:

it can group objects by static features;
it can find the best fit for several basis functions.
So I needed some interesting basis functions. OK: [1, t, sin(kt), cos(kt)]. The constant lets you move the plot up and down—always useful. t is for trends. The pair sin(kt) and cos(kt) lets us place a harmonic component where we want it; with the right amplitudes you can shift it left or right.

Let’s stop here. Where these basis functions show up in our “Extra Boost” title—I’ll explain in the next post.
Scatter - normal orientation
How to find linear superposition in chaos

Now we have a set of points which, while fairly random from a mathematical point of view, give us a depiction of the “Extra Boost” sign. For my method, I need to find several groups, each represented by a linear combination of basis functions. I set time \(t\) to go from left to right, from 0 to 1. The basis functions are [1,t, sin(kt), cos(kt)], so the extrapolating function is (Expression below).

Weights (A,B,C,D) can be estimated from the dataset using least squares, but we still need to pick k. After a set of experiments I chose k=50: it gives a convenient scale—the wavelength is roughly the width of a letter.

With this setup I obtained the picture you see at the beginning of the article. Then I decided the tolerance was too large and reduced the band width.

Here we are: a narrow band.

Next, I removed points within the tolerance range and repeated the process. To my surprise, after the first iteration nothing changed.

You can see that the dots disappeared, but the curve didn’t change. After a while I understood why. It was vibe-coding: I asked my iron friend to find a curve that captures the highest number of points; instead, it wrote code that minimizes MSE. That approach has an interesting property: when you delete points lying on the curve, the MSE is unchanged, so the same curve remains optimal.

I told the iron friend that, instead of minimizing squared distance to the points, it should maximize the number of captured points. It proposed the RANSAC approach, which was new to me: repeatedly select four random points, fit the curve, count captured points, and keep the candidate with the most inliers. It worked.

I ran the process iteratively, and it decomposed the figure into a superposition of functions. Unfortunately, the upper half of “B” wasn’t captured. I suspected the issue was the different heights of lowercase and uppercase letters and created a second version of the drawing.

The same procedure gave me the sign decomposed into eight components, each a superposition of the basis functions.

Finally, I encoded the group number as a 0–1 vector of static features f1,f2,f3 and exported the dataset as CSV. Hooray — now we have data to test the MSE mode of the EXTRA BOOST model.
The second iteration of RANSAC approach