Aspiring Data Science

#featureselection #masters #mlgems

Нашёл в книге Тима Мастерса "Data Mining Algorithms in C++" такую любопытную модификацию Forward Selection:

Forward Selection Preserving Subsets

"There is a straightforward extension of forward stepwise selection that can often produce a significant improvement in performance at little cost. We simply preserve the best few candidates at each step, rather than preserving just the single best. For example, we may find that X4, X7, and X9 are the three best single variables. (Three is an arbitrary choice made by the developer, considering the trade-off between quality and compute time.) We then test X4 paired with each remaining candidate, X7 paired with each, and finally X9 paired with each. Of these many pairs tested, we identify the best three pairs. These pairs will each be tested with the remaining candidates as trios, and so forth. The beauty of this algorithm is that we gain a lot with relatively little cost. The chance of missing an important combination is greatly reduced, while compute time goes up linearly, not exponentially. I highly recommend this approach."

144 viewsAnatoly Alekseev, edited 06:41

#featureselection #masters #mlgems #chisquare #cramerv

The chi-square test need not be restricted to categorical variables. It is legitimate to partition the range of numeric variables into bins and treat these bins as if they were categories. Of course, this results in some loss of information because variation within each bin is ignored. But if the data is noisy or if one wants to detect relationship patterns of any form without preconceptions, a chi-square formulation may be appropriate.

Chi-squared itself has little intuitive meaning in terms of its values. It is highly dependent on the number of cases and the number of bins for each variable, so any numeric value of chi-squared is essentially uninterpretable. This can be remedied by a simple monotonic transformation to produce a quantity called Cramer’s V.

110 viewsAnatoly Alekseev, edited 06:52

Aspiring Data Science

#ml #masters #cv #refit

Читаю книги Тима Мастерса "Data Mining Algorithms in C++" и "Assessing and Improving Prediction and Classification". Он не классический МЛ-щик, мне кажется, до всего дошёл сам, порой его идеи очень глубоки.

"The importance of consistent performance is often ignored, with average performance being the focal point instead. However, in most cases, a model that performs fairly well across the majority of training cases will ultimately outperform a model that performs fabulously most of the time but occasionally fails catastrophically. Properly designed training sets and optimization criteria can take consistency into account."

А ведь и правда, ну кто смотрит на разброс метрик по фолдам CV? Да никто. Даже в процедурах модуля model_selection (GridSearchCV итд) по умолчанию берутся просто среднеарифметические метрики по тестовым кускам. А ведь постоянство метрик может быть очень важным качеством в реальных приложениях. Наверное, лучшим подходом будет при сравнении моделек от средних значений метрик отнимать их стандартное отклонение (с неким коэффициентом, например, делённое на 2). В sklearn это можно сделать, если передать в процедуру model_selection кастомный refit.

scikit-learn

Custom refit strategy of a grid search with cross-validation

This examples shows how a classifier is optimized by cross-validation, which is done using the GridSearchCV object on a development set that comprises only half of the available labeled data. The p...

2.2K viewsAnatoly Alekseev, edited 04:00

Aspiring Data Science

#ml #masters #bayes #hypothesistesting

Ещё интересные мысли.

"Many applications require not only a prediction but a measure of confidence in the decision as well. Some developers prefer a hypothesis-testing approach, while others favor Bayesian methods. The truth is that whenever possible, both methods should be used, as they provide very different types of information. And most people ignore a critical additional step: computing confidence in the confidence figure!"

"Often, an essential part of development is estimating model parameters for examination. The utility of these estimates is greatly increased if you can also compute the bias and variance of the estimates, or even produce confidence intervals for them."

126 viewsAnatoly Alekseev, edited 04:20

Aspiring Data Science

#ml #masters #ensembling #featureengineering #entropy

Продолжаем.

"A common procedure is to train several competing models on the same training set and then choose the best performer for deployment. However, it is usually advantageous to use all of the competing models and intelligently combine their predictions or class decisions to produce a consensus opinion."

"It is not widely known that the entropy of a predictor variable can have a profound impact on the ability of many models to make effective use of the variable. Responsible researchers will compute the entropy of every predictor and take remedial action if any predictor has low entropy."

Первая идея не нова, в соревах все стэкают модели. Но опять-таки, это до сих пор не стандарт в МЛ, и тот же sklearn просто отбрасывает все модели за исключением "лучшей", там даже опции нет сохранить остальные, или, упаси Боже, совместно их использовать.

А вот энтропийный подход к выбору и предобработке предикторов оригинален, такой идеи я нигде не встречал больше. Что нам предлагает классика? Генерить побольше потенциальных признаков произвольной природы, пока Ваша модель не захлебнётся по ресурсам. Но ведь можно действовать умнее. Эту идею можно использовать при комбинации нескольких признаков: к примеру, оставлять только те комбинации, чья энтропия превышает энтропии родителей.

210 viewsAnatoly Alekseev, edited 04:35

Aspiring Data Science

#trading #masters #aronson

Читаю книжку David Aronson, Timothy Masters - Statistically sound machine learning for algorithmic trading of financial instruments. Приводят весьма чёткие обоснования, почему торговая система должна быть автоматизированной.

•Intelligently designed automated trading systems can and often do outperform human-driven systems. An effective data-mining program can discover subtle patterns in market behavior that most humans would not have a chance of seeing.

• An automated system is absolutely repeatable, while a human-driven system is subject to human whims. Consistency of decision-making is a vital property of a system that can consistently show a profit. Repeatability is also valuable because it allows examination of trades in order to study operation and perhaps improve performance.

• Most properly designed automated trading systems are amenable to rigorous statistical analysis that can assess performance measures such as expected future performance and the probability that the system could have come into existence due to good luck rather than true power.

• Unattended operation is possible.

282 viewsAnatoly Alekseev, edited 04:57

Aspiring Data Science

#ensembling #masters

Объяснение полезности ансамблей:

"Because the different models will generally be based on different training data, the patterns of noise will be different in the training sets. As a result, if the models are overly powerful and thereby learn to predict noise in addition to authentic patterns, they will generally make different predictions of the noise component. When these predictions are combined via a committee, the noise predictions will tend to cancel, while the predictions based on authentic patterns will tend to reinforce."

Особенно ценно для CV. Как раз поэтому и не надо выбрасывать "промежуточные" модельки с фолдов CV.

138 viewsAnatoly Alekseev, edited 13:06

Aspiring Data Science

#ml #metrics #masters

It has been seen that good average or total performance in the training set is not the only important optimization consideration. Consistent performance is also important.
It encourages good performance outside the training set, and it provides stability as models are evolved by selective updating. An often easy way to encourage consistency is to stratify the training set, compute for each stratum a traditional optimization criterion like one of those previously discussed, and then let the final optimization criterion be a function of the values for the strata.

The power of stratification can sometimes be astounding. It is surprisingly easy to train a model that seems quite good on a training set only to discover later that its performance is spotty. It may be that the good average training performance was based on spectacular performance in part of the training set and mediocre performance in the rest. When the conditions that engender mediocrity appear in real life and the model fails to perform up to expectations, it is only then that the researcher may study its historical performance more closely and discover the problem. It is always better to discover this sort of problem early in the design process.

120 viewsAnatoly Alekseev, 10:01

Aspiring Data Science

#ml #metrics #regression #masters

В задачах регрессии Мастерс советует резервировать, помимо тестового множества, отдельное confidence set, репрезентативное к генеральной совокупности, на котором считать доверительные интервалы ошибок. Если распределение ошибок не нормальное, можно использовать просто квантили. Дополнительно он считает вероятности, что сами доверительные интервалы не нарушатся, используя для этого неполное бета-распределение.

130 viewsAnatoly Alekseev, 16:45

Aspiring Data Science

#ml #modelling #evaluation #masters

Как оценить вероятность, что обученная модель достигла своих результатов не благодаря случайному совпадению?

Мастерс предлагает N раз переобучить модель, каждый раз перемешивая таргет, и подсчитать процент случаев, когда метрика "перемешанной" модели лучше, чем оригинальной. Что-то вроде Boruta, но не для предикторов, а для таргета.

Более того, он раскладывает преимущества от использования модели на 3 фактора: обобщающую способность, смещение данных, и смещение модели:

TrainingGain = Ability + InherentBias +TrainingBias

Так как "перемешанные" модельки не получают аутентичных связей предикторов с таргетом, в них первый компонент отсутствует:

PermutedGain = InherentBias +TrainingBias

Соответственно, сравнивая метрики оригинальных и перемешанных моделей, можно уже по трейн сету оценить истинную обобщающую способность и вклад шума в результат.

Зачастую inherent bias можно получить аналитически, например, в финансовых задачах это может быть результат от buy & hold на растущем рынке. Тогда можно оценить

 TrainingBias = PermutedGain - InherentBias

"It tells us how much the process of training optimistically inflates the gain. If the TrainingBias is large, this constitutes evidence that the model may be too powerful relative to the amount of noise in the data. This information can be particularly handy early in model development, when it is used to compare some competing modeling methodologies to judge whether additional data is required.

Remember that permutation can also be used with cross validation, walk-forward testing, and most other out-of-sample evaluation methods in order to compute the probability that good observed out-of-sample results could have arisen from a worthless model. The principle is exactly the same, although of course it makes no sense to eliminate training bias from out-of-sample results."

Пример вычислений со слабым классификатором:

"Called fraud 141 of 100000 (0.14 percent)
Actual fraud 0.94 percent
p = 0.57600
Original gain = 0.55370 with original inherent bias = 0.52818
Mean permuted gain = 0.56517
Mean permuted inherent bias = 0.50323
Training bias = 0.06194 (0.56517 minus 0.50323)
Unbiased actual gain = 0.49176 (0.55370 minus 0.06194)
Unbiased gain above inherent bias = -0.01147 (0.49176 minus 0.50323)

In this no-power example, the number of cases called fraud (0.14 percent) is quite different from the actual number (0.94 percent). Such variation is common when there is little predictive power. The most important number in this table is the p-value, the probability that a worthless model could have performed as well or better just by luck. This p-value is 0.576, a strong indication that the model could be worthless. You really want to see a tiny p-value, ideally 0.01 or less, in order to have solid confidence in the model.

The per-case gain of the unpermuted model is 0.55370, which may seem decent until you notice that 0.52818 of it is inherent bias. So even after optimization of performance, it barely beats what a coin-toss model would achieve. Things get even worse when you see that the mean gain for permuted data is 0.56517, better than the original model!"

121 viewsAnatoly Alekseev, edited 11:18

About

Blog

Apps

Platform