Aspiring Data Science

#ml #metrics #brier

Как известно, оценка Бриера (Брайера?) для бинарного классификатора представляет собой по сути среднеквадратическую ошибку между реальными исходами и предсказанными вероятностями. В теории это число между 0 и 1, где 0 означает идеальную калибрацию (из всех событий, предсказанных с вероятностью 25%, реализовались точно 25%, и тд). Я на эту метрику в работе часто смотрю, т.к. откалиброванность модельки очень важна, особенно когда бизнес-решения принимаются на вероятностях. И вот сегодня узнал нечто новое. Задумался, а чего вообще можно ожидать от модели, идеально предсказывающей вероятности, в терминах оценки Бриера. Давайте для этого скрафтим реализации миллиона событий, следующие заранее известным вероятностям:

probs = np.random.uniform(size=1000_000)
realizations = np.random.uniform(size=len(probs))
realizations = (realizations < probs).astype(np.int8)

В теории, у нас теперь есть массив единичек и нулей realizations, порождённый "истинными" вероятностями probs. Если ситуацию перевернуть, рассмотреть probs как вероятности, предсказанные моделью машинного обучения, а realizations как то, что мы реально пронаблюдали в жизни, то подобная точность должна быть мечтой любого ML-щика!

Wikipedia

Brier score

measure of the accuracy of probabilistic predictions

69 viewsedited 22:03

Aspiring Data Science

#ml #classificaion #probabilistic #brierscore

Возвращаясь к недавнему посту про оценку Бриера, суммаризирую:

1) Бриер=0 достигается не просто когда вероятности идеально откалиброваны. Для "нулевых" примеров предсказанные вероятности должны быть строго равны нулю, для "единичных" - единице.
2) в реальной задаче Бриер даже очень хорошей модели никогда не достигнет 0
3) более того, в каждой задаче своё распределение таргета, соответственно, минимально и максимально достижимые Бриер скоры РАЗНЫЕ. Например, для упоминавшегося выше равномерного распределения, Бриер идеальной модели стремится к 0.166, нерелевантной модели к 0.333, "антимодели" к 0.5
4) вещи становятся страннее, когда меняется распределение таргета. для "ненормального" и уж точно не равномерного таргета с картинки в комментах Бриер идеальной модели 0.221, Бриер перемешанных примеров 0.238, Бриер DummyClassifier (всегда предсказывает фактическую частоту класса 1) 0.230.

Т.е. абсолютная разница в оценках Бриера может быть мизерная, хотя на самом деле сравниваются идеальная модель и "почти случайное" угадывание.

Вывод: в каждом случае оценивайте границы оценок Бриера, хотя бы косвенными методами, прежде чем принимать решение о качестве модели.

Aspiring Data Science

44 viewsedited 04:41

Aspiring Data Science

#ml #catboost #metrics #bugs

Утро прошло в жарких спорах о точности. Нашёл предположительный баг в том, как катбуст считает precision.

https://github.com/catboost/catboost/issues/2422

GitHub

Precision calculation error in Early Stopping. Request to add pos_label. · Issue #2422 · catboost/catboost

Problem: catboost version: 1.2 Operating System: Win CPU: + GPU: + Я думаю, в коде catboost вычисляющем precision где-то перепутаны предсказания и истинные значения, поэтому ранняя остановка по точ...

121 viewsAnatoly Alekseev, edited 11:11

Aspiring Data Science

#optimization #ml #metrics #python #numba #codegems

В общем, sklearn-овские метрики оказались слишком медленными, пришлось их переписать на numba. Вот пример classification_report, который работает в тысячу раз быстрее и поддерживает почти всю функциональность (кроме весов и микровзвешивания). Также оптимизировал метрики auc (алгоритм взят из fastauc) и calibration (считаю бины предсказанные vs реальные, потом mae/std от их разностей). На 8M сэмплов всё работает за ~30 миллисекунд кроме auc, та ~300 мс. Для сравнения, scikit-learn-овские работают от нескольких секунд до нескольких десятков секунд.

@njit()
def fast_classification_report(
    y_true: np.ndarray, y_pred: np.ndarray, nclasses: int = 2, zero_division: int = 0
):
    """Custom classification report, proof of concept."""

    N_AVG_ARRAYS = 3  # precisions, recalls, f1s

    # storage inits
    weighted_averages = np.empty(N_AVG_ARRAYS, dtype=np.float64)
    macro_averages = np.empty(N_AVG_ARRAYS, dtype=np.float64)
    supports = np.zeros(nclasses, dtype=np.int64)
    allpreds = np.zeros(nclasses, dtype=np.int64)
    misses = np.zeros(nclasses, dtype=np.int64)
    hits = np.zeros(nclasses, dtype=np.int64)

    # count stats
    for true_class, predicted_class in zip(y_true, y_pred):
        supports[true_class] += 1
        allpreds[predicted_class] += 1
        if predicted_class == true_class:
            hits[predicted_class] += 1
        else:
            misses[predicted_class] += 1

    # main calcs
    accuracy = hits.sum() / len(y_true)
    balanced_accuracy = np.nan_to_num(hits / supports, copy=True, nan=zero_division).mean()

    recalls = hits / supports
    precisions = hits / allpreds
    f1s = 2 * (precisions * recalls) / (precisions + recalls)

    # fix nans & compute averages
    i=0
    for arr in (precisions, recalls, f1s):
        np.nan_to_num(arr, copy=False, nan=zero_division)
        weighted_averages[i] = (arr * supports).sum() / len(y_true)
        macro_averages[i] = arr.mean()
        i+=1

    return hits, misses, accuracy, balanced_accuracy, supports, precisions, recalls, f1s, macro_averages, weighted_averages

153 viewsAnatoly Alekseev, edited 19:00

Aspiring Data Science

#sklearn #metrics #optimization #numba

В гитхабе sklearn-а началась некая дискуссия о том, нужны ли быстрые метрики или даже использование Numba в sklearn. Возможно, у Вас тоже есть своё мнение?

GitHub

Speed up classification_report · Issue #26808 · scikit-learn/scikit-learn

Describe the workflow you want to enable I'm concerned with slow execution speed of the classification_report procedure which makes it barely suitable for production-grade workloads. On a 8M sa...

114 viewsAnatoly Alekseev, edited 18:41

Aspiring Data Science

#ml #mlops #mlflow #me #metrics #multimodel

Очень срезонировало это выступление. Я сейчас разрабатываю как раз такую систему, с мультиметриками, несколькими моделями разных классов. Даже ещё добавляю сразу ансамбли. Про ME (Maximum Error) как обязательную regression-метрику кажется очень полезно, никогда раньше не слышал. От себя бы добавил в обязательные метрики классификации что-то калибрационное: MAE/std над бинами калибрационной кривой, к примеру.

https://www.youtube.com/watch?v=VJWrSTAlxEs

YouTube

Андрей Зубков - Без чего с ML в проде жизнь не мила

Data Fest 2023:
https://ods.ai/events/datafestonline2023
Трек "MLOps":
https://ods.ai/tracks/df23-mlops

Наши соц.сети:
Telegram: https://t.me/datafest
Вконтакте: https://vk.com/datafest

178 viewsAnatoly Alekseev, edited 21:53

Aspiring Data Science

#metrics #mse

Why is MSE so popular? The reasons are mostly based on theoretical properties, although there are a few properties that have value in some situations. Here are some of the main advantages of MSE as a measure of the performance of a model:

• It is fast and easy to compute.
• It is continuous and differentiable in most applications. Thus, it will be well behaved for most optimization algorithms.
• It is very intuitive in that it is simply an average of errors. Moreover, the squaring causes large errors to have a larger impact than small errors, which is good in many situations.
• Under commonly reasonable conditions (the most important being that the distribution is normal or a member of a related family), parameter estimates computed by minimizing MSE also have the desirable statistical property of being maximum likelihood estimates. This loosely means that of all possible parameter values, the one computed is the most likely to be correct.

We see that MSE satisfies the theoretical statisticians who design models, it satisfies the numerical analysts who design the training algorithms, and it satisfies the intuition of the users. All of the bases are covered.

122 viewsAnatoly Alekseev, edited 17:21

Aspiring Data Science

#ml #metrics #masters

It has been seen that good average or total performance in the training set is not the only important optimization consideration. Consistent performance is also important.
It encourages good performance outside the training set, and it provides stability as models are evolved by selective updating. An often easy way to encourage consistency is to stratify the training set, compute for each stratum a traditional optimization criterion like one of those previously discussed, and then let the final optimization criterion be a function of the values for the strata.

The power of stratification can sometimes be astounding. It is surprisingly easy to train a model that seems quite good on a training set only to discover later that its performance is spotty. It may be that the good average training performance was based on spectacular performance in part of the training set and mediocre performance in the rest. When the conditions that engender mediocrity appear in real life and the model fails to perform up to expectations, it is only then that the researcher may study its historical performance more closely and discover the problem. It is always better to discover this sort of problem early in the design process.

120 viewsAnatoly Alekseev, 10:01

Aspiring Data Science

#ml #metrics #regression #masters

В задачах регрессии Мастерс советует резервировать, помимо тестового множества, отдельное confidence set, репрезентативное к генеральной совокупности, на котором считать доверительные интервалы ошибок. Если распределение ошибок не нормальное, можно использовать просто квантили. Дополнительно он считает вероятности, что сами доверительные интервалы не нарушатся, используя для этого неполное бета-распределение.

130 viewsAnatoly Alekseev, 16:45

Aspiring Data Science

#trading #backtesting #metrics

https://www.youtube.com/watch?v=3WXHFjQFGYs

YouTube

4 - Performance Metrics | Quant Trading in Futures

How do we evaluate a trading strategy?
What metrics measure risk-adjusted returns in terms of volatility?
What metrics measure risk-adjusted returns in terms of worst-case-scenario loss?
What metrics measure risk-adjusted returns in terms of correlation to…

127 viewsAnatoly Alekseev, 13:09

About

Blog

Apps

Platform