Aspiring Data Science

#ensembling #hpo #hpt #autosklearn

Вот какой интересный метод ансамблирования опробовали авторы оптимизатора auto-sklearn:

"Two important problems in AutoML are that (1) no single machine learning method performs best on all datasets and (2) some machine learning methods (e.g., non-linear SVMs) crucially rely on hyperparameter optimization.

While Bayesian hyperparameter optimization is data-efficient in finding the bestperforming hyperparameter setting, we note that it is a very wasteful procedure when the goal is simply to make good predictions: all the models it trains during the course of the search are lost, usually including some that perform almost as well as the best.

Rather than discarding these models, we propose to store them and to use an efficient post-processing method (which can be run in a second process on-the-fly) to construct an ensemble out of them. This automatic ensemble
construction avoids to commit itself to a single hyperparameter setting and is thus more robust (and less prone to overfitting) than using the point estimate that standard hyperparameter optimization yields. To our best knowledge, we are the first to make this simple observation, which can be applied to improve any Bayesian hyperparameter optimization method.

It is well known that ensembles often outperform individual models [24, 31], and that effective ensembles can be created from a library of models [9, 10]. Ensembles perform particularly well if the models they are based on (1) are individually strong and (2) make uncorrelated errors [6]. Since this is much more likely when the individual models are different in nature, ensemble building is particularly well suited for combining strong instantiations of a flexible ML framework.

However, simply building a uniformly weighted ensemble of the models found by Bayesian optimization does not work well. Rather, we found it crucial to adjust these weights using the predictions of all individual models on a hold-out set. We experimented with different approaches to optimize these weights: stacking [44], gradient-free numerical optimization, and the method ensemble selection [10].

While we found both numerical optimization and stacking to overfit to the validation set and to be computationally costly, ensemble selection was fast and robust . In a nutshell, ensemble selection (introduced by Caruana et al. [10]) is a greedy procedure that starts from an empty ensemble and then iteratively adds the model that minimizes ensemble validation loss (with uniform weight, but allowing for repetitions). We used this technique in all our experiments—building an ensemble of size 50 using selection with replacement [10]. We calculated the ensemble loss using the same validation set that we use for Bayesian optimization."

100 viewsAnatoly Alekseev, edited 05:48

#autosklearn #hpo #hpt #automl

"Fig. 6.3 Average rank of all four Auto-sklearn variants (ranked by balanced test error rate (BER)) across 140 datasets.

Note that ranks are a relative measure of performance (here, the rank of all methods has to add up to 10), and hence an improvement in BER of one method can worsen the rank of another.

Due to the small additional overhead that meta-learning and ensemble selection cause, vanilla Auto-sklearn is able to achieve the best rank within the first 10s as it produces predictions before the other Auto-sklearn variants finish training their first model. After this, meta-learning quickly takes off."

Получается, лучше всего "докидывает" как раз мета обучение, а потом ещё и ансамблирование.

109 viewsAnatoly Alekseev, edited 15:41

Aspiring Data Science

#timeseries #ensembling #hetboost #pmdarima #todo

Вот попался классный пример, где идея "гетерогенного бустинга" отлично отрабатывает.

Лектор на синтетике сравнивает ариму и ансамбль линрег+дерево.

В задачах на временные ряды декомпозиция на тренд, сезонность и остаточные нерегулярный сигнал очевидна и необходима, но можно посмотреть на проблему в общем - классы моделей имеют свои ограничения (деревянные модели регрессии, к примеру, плохо моделируют линейные зависимости), и обучение модели одного класса на невязках модели другого класса способно показать отличные результаты.

В то же время, сейчас самыми распространёнными методами ансамблирования являются стэкинг (когда для моделей последующего уровня меняется признаковое пространство) и гомогенный бустинг (например, градиентный над деревьями в catboost/xgboost/lightgbm), а вот идею бустинга гетерогенного как будто никто и не рассматривает, и как будто бы нет опенсорсных реализаций.

Истоки такого предубеждения, похоже, растут из ранних статей о бустинговом подходе над "слабыми моделями" (weak learners). Выбор именно слабых моделей аргументировался контролем переобучения, равномерностью шагов процесса обучения, фокусом на сложных для предсказания примерах (которые более сильная модель могла бы просто запомнить).

Мне кажется, "слабость" и одинаковость участников ансамбля не всегда благо, и на практике есть смысл в каждой конкретной задаче проверять (на CV) наиболее выгодный подход к ансамблированию, от простого усреднения моделей и ensemble selection (который мы недавно рассматривали) до стэкинга и двух видов бустинга, одно- и разнородного.

На этот год планирую сравнительное исследование )

Видимо, относительно небольшая статья о том, как стать лучше в DS, которую я подготовил, столкнувшись с неспособностью современных библиотек градиентного бустинга хорошо смоделировать простую зависимость Y=X, вырастет в большое сравнение алгоритмов ансамблирования.

Постараюсь захватить Ensemble Selection (1, 2, 3), опции ансамблирования рассмотренные в #kagglebook (1, 2, 3), и Cascade Generalization/Cascading/Delegating (or Selective Routing)/Arbitrating.

Aspiring Data Science

🔥4👍1

136 viewsAnatoly Alekseev, edited 18:14

About

Blog

Apps

Platform