Aspiring Data Science

#ml #featureengineering #geofeatures #advicewanted Есть задачка на генерацию геофичей. Юзер логинится в приложение в разных точках города, Известны его координаты при логине и метки времени. Какие бы интересные фичи построить из графа его перемещений? Пока…

#ml #gbm #catboost #quantileloss

https://towardsdatascience.com/a-new-way-to-predict-probability-distributions-e7258349f464

Medium

A New Way to Predict Probability Distributions

Exploring multi-quantile regression with Catboost

59 views07:56

Aspiring Data Science

#ml #uncertainty #catboost #medicine #blood

Всё-таки иногда попадаются и качественные научные работы ML-тематики. Зацените строгость подхода, всё сделано по лучшим практикам.

"Code for the analysis can be found at https://github.com/oizin/glucose-data-driven-prediction.
Model validation

The dataset is randomly split into a 70% training (13 279 ICU admissions) and 30% test (5682 ICU admissions) sets. Sample splits are performed by ICU admission ID to avoid potential information leakage. We evaluate all models on the test set only after finalization of hyperparameter settings to ensure unbiased assessments of model generalizability. As the algorithms were computationally expensive to train, we perform hyperparameter tuning by randomly splitting the training set into 80% development and 20% validation sets."

Ну разве что до SHAP всё-таки не дотянули. А сама работа меня заинтересовала тем, что там сравнивается мультиквантильная регрессия с "регрессией с неопределённостью" :

We develop 2 ML approaches using the Catboost gradient boosting library.39 These models were chosen as they present alternative approaches to predicting both a point estimate and uncertainty quantification through probabilistic forecasting. The first is a Catboost regression model with dual estimation of the expected outcome and the standard deviation of the prediction distribution, the ‘uncertainty regression’ model.43 This form of estimation can be performed using the class CatBoostRegressor with the argument loss_function=“RMSEWithUncertainty” in the Python version of Catboost 2.4. The second model is a combination of quantile regressions with models for quantiles of 0.025, 0.5, and 0.975, the “quantile regression” model.

Квантили дали вот какое преимущество:

In order to have clinical utility, it is important that the model can detect hyperglycemia and hypoglycemia. Detection of hyperglycemia was only slightly worse than values in the ICU normal blood glucose range. However, similar to previous research, our point estimates were unable to detect hypoglycemia at 2-hour forecasts.35 However, by forecasting an interval, we increase the potential to flag circumstances in which hypoglycemia is a risk, with 41% of hypoglycemic events captured within the prediction intervals.

Если Вы использовали одну из таких функций потерь в работе, буду рад, если поделитесь выводами об их полезности.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8324237/

GitHub

GitHub - oizin/glucose-data-driven-prediction: Code for paper: Incorporating real-world evidence into the development of patient…

Code for paper: Incorporating real-world evidence into the development of patient blood glucose prediction algorithms for the ICU - GitHub - oizin/glucose-data-driven-prediction: Code for paper: In...

124 viewsAnatoly Alekseev, edited 22:01

Aspiring Data Science

#ml #catboost #metrics #bugs

Утро прошло в жарких спорах о точности. Нашёл предположительный баг в том, как катбуст считает precision.

https://github.com/catboost/catboost/issues/2422

GitHub

Precision calculation error in Early Stopping. Request to add pos_label. · Issue #2422 · catboost/catboost

Problem: catboost version: 1.2 Operating System: Win CPU: + GPU: + Я думаю, в коде catboost вычисляющем precision где-то перепутаны предсказания и истинные значения, поэтому ранняя остановка по точ...

121 viewsAnatoly Alekseev, edited 11:11

Aspiring Data Science

#catboost

В Катбусте тоже всем пофигу на баги, похоже. Уже вторую неделю висит issue, что с Precision и F1 в early stopping модели не обучаются из-за неправильного дефолта при расчёте точности. Всем насрать, хотя и в чате у них этот вопрос обсудили, и даже с другим юзером из чата сами нашли причину. На производительность тоже пофиг, roc_auc у них считается даже немного медленнее, чем в sklearn. На мой пост о том, что с помощью numba и алгоритма из fastauc можно запросто ускорить её расчёт в 8 раз никто из команды не отреагировал. Я был об этой команде лучшего мнения, видимо, зря.

GitHub

Precision calculation error in Early Stopping. Request to add pos_label. · Issue #2422 · catboost/catboost

109 viewsAnatoly Alekseev, 19:41

Aspiring Data Science

#catboost

Спустя почти полгода в катбусте признали и исправили ошибку, о которой зарепортил ещё в июле и которая не позволяла валидироваться на точность (precision) - согласитесь, это жизненно необходимо для многих задач. Что делать, смеяться или плакать? И это наши лучшие разработчики. Такая скорость реакции при том, что я это в катбустовый чат ещё написал, и они это видели. И мы с другим пользователем чата нашли прямо там причину.

GitHub

Precision calculation error in Early Stopping. Request to add pos_label. · Issue #2422 · catboost/catboost

89 viewsAnatoly Alekseev, edited 19:02

About

Blog

Apps

Platform