Aspiring Data Science

#ml #metrics #regression #masters

В задачах регрессии Мастерс советует резервировать, помимо тестового множества, отдельное confidence set, репрезентативное к генеральной совокупности, на котором считать доверительные интервалы ошибок. Если распределение ошибок не нормальное, можно использовать просто квантили. Дополнительно он считает вероятности, что сами доверительные интервалы не нарушатся, используя для этого неполное бета-распределение.

132 viewsAnatoly Alekseev, 16:45

#ml #ensembling #regression #masters

У Тима есть любопытное сравнение RMSE методов ансамблирования для задачи регрессии.

"Model Quality:
• Three moderately good models
• These, plus a fourth completely worthless model
• These, plus a fifth good but strongly biased model
Number of training cases:
• A small dataset consisting of 20 cases
• A large dataset consisting of 200 cases
Noise contamination:
• A clean dataset, in which the true values are totally uncontaminated
• A noisy dataset, in which the true values are heavily contaminated with random noise

The columns in the summary table are labeled as follows:
Raw: Mean squared error of a single model
Avg: The predictions are simply averaged
Uncons: Unconstrained linear regression
Unbias: Constrained linear regression with no bias offset term
Bias: Constrained linear regression including a bias offset term
VarWt: Variance weighting
Bag: Bagging, as discussed in Chapter 5
GRNN: General regression neural network smoothing"

👍1

97 viewsAnatoly Alekseev, edited 22:39

Aspiring Data Science

#ml #regression #astroml #cylowess #locfit #gaussianprocess

Как сделать Linear, Ridge, Lasso нелинейными переводом входов в гауссов базис.

https://notebook.community/DanielAndreasen/Programmers-Club/notebook/Regression%20-%20an%20example%20notebook

101 viewsAnatoly Alekseev, 12:57

Aspiring Data Science

Forwarded from Artem Ryblov’s Data Science Weekly (Artem Ryblov)

Thinking Clearly with Data: A Guide to Quantitative Reasoning and Analysis by Ethan Bueno de Mesquita, Anthony Fowler

An introduction to data science or statistics shouldn’t involve proving complex theorems or memorizing obscure terms and formulas, but that is exactly what most introductory quantitative textbooks emphasize. In contrast, Thinking Clearly with Data focuses, first and foremost, on critical thinking and conceptual understanding in order to teach students how to be better consumers and analysts of the kinds of quantitative information and arguments that they will encounter throughout their lives.

Among much else, the book teaches how to assess whether an observed relationship in data reflects a genuine relationship in the world and, if so, whether it is causal; how to make the most informative comparisons for answering questions; what questions to ask others who are making arguments using quantitative evidence; which statistics are particularly informative or misleading; how quantitative evidence should and shouldn’t influence decision-making; and how to make better decisions by using moral values as well as data.

- An ideal textbook for introductory quantitative methods courses in data science, statistics, political science, economics, psychology, sociology, public policy, and other fields
- Introduces the basic toolkit of data analysis―including sampling, hypothesis testing, Bayesian inference, regression, experiments, instrumental variables, differences in differences, and regression discontinuity
- Uses real-world examples and data from a wide variety of subjects
- Includes practice questions and data exercises

Link: https://www.amazon.com/Thinking-Clearly-Data-Quantitative-Reasoning/dp/0691214352

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #datascience #correlation #regression #causation #randomizedexperiments #statistics

@data_science_links

113 viewsAnatoly Alekseev, 00:25

Aspiring Data Science

#boostings #regression #trees #compositeregressor

В свете недавних разочарований по поводу неспособности деревянных моделей к хорошему прогнозу линейных комбинаций признаков сделал свою реализацию CompositeRegressor (совместимую с scikit-learn), которая призвана решать эту проблему (и решает) композицией сначала линейной, а на её невязках уже нелинейной "деревянной" модели.

Причём линейная может строиться не на всех исходных признаках, а лишь на их "устойчивом" подмножестве (чтобы не лезла в нелинейную часть, которую лучше оставить модельке верхнего уровня).

Сейчас пишу сопроводительную статью и (вопреки своей лени и дремучести) покрываю модуль тестами, т.к. хочу показать читателям не только полезный с точки зрения ds приём, но и грамотную программерскую реализацию, готовую к боевому внедрению, в т.ч. корпоративного уровня.

🔥6👍5

111 viewsAnatoly Alekseev, edited 17:20

Aspiring Data Science

#calibration #regression

Ну хорошо, двигаемся дальше, нельзя ли эту логику обобщить на регрессию? Электронный болван согласен.

Yes! The same logic that makes post-calibration beneficial for binary classification can also be applied to regression tasks—though it is rarely discussed. Here’s why post-calibration can be valuable for regression and how it can be implemented effectively.

Why Would Regression Benefit from Post-Calibration?
Regression models output continuous values, but just like classification models, they can suffer from systematic miscalibration in their predictions. The key idea is that a model’s predicted values might be biased or have the wrong spread compared to the actual target distribution.

There are two main ways a regression model can be miscalibrated:

Mean Shift (Bias)

The model’s predictions systematically overestimate or underestimate the true values.
Example: A house price prediction model always underestimates expensive houses.

Spread Distortion (Misestimated Variance)

The model’s predictions have the wrong level of dispersion:
Too narrow → Underestimated uncertainty (overconfident)
Too wide → Overestimated uncertainty (underconfident)
Example: A temperature prediction model might always predict values within ±2°C, even when actual deviations are much larger.

How to Perform Post-Calibration in Regression?

Instead of calibrating probabilities like in classification, you adjust the magnitude and distribution of the predicted values in regression. This can be done in a few ways:

1. Post-hoc Mean Correction (Bias Calibration)

Take a held-out calibration set (separate from validation).
Compute the difference between the predicted and actual values.
Fit a simple correction model (e.g., linear regression) to adjust the predictions.

This works like Platt scaling in classification but for correcting systematic bias.

2. Quantile Regression Calibration (Fixing Spread)

Instead of just predicting the mean, we fit a secondary model to predict quantiles (e.g., 10th, 50th, 90th percentile).
This helps in cases where the model’s uncertainty is wrong.

113 viewsAnatoly Alekseev, edited 21:24

About

Blog

Apps

Platform