Aspiring Data Science

#books #kagglebook #ctf

Читаю The Kaggle book, оказывается , Энтони Голдблюм по образованию экономист, как и я )
А Джереми Ховард перед основанием fast.ai трудился в Kaggle.

Хинтон тоже не избежал участия в соревах каггл, и даже выиграл MerckActiviy. Уже интересно!

"Professor Donoho does not refer to Kaggle specifically, but to all data science competition platforms. Quoting computational linguist Mark Liberman, he refers to data science competitions and platforms as being part of a Common Task Framework (CTF) paradigm that has been silently and steadily progressing data science in many fields during the last decades. He states that a CTF can work incredibly well at improving the solution of a problem in data science from an empirical point of view, quoting the Netflix competition and many DARPA competitions as successful examples. The CTF paradigm has contributed to reshaping the best-in-class solutions for problems in many fields.

The system works the best if the task is well defined and the data is of good quality. In the long run, the performance of solutions improves by small gains until it reaches an asymptote. The process can be sped up by allowing a certain amount of sharing among participants (as happens on Kaggle by means of discussions, and sharing Kaggle Notebooks and extra data provided by the datasets found in the Datasets section). According to the CTF paradigm, competitive pressure in a competition suffices to produce always-improving solutions. When the competitive pressure is paired with some degree of sharing among participants, the improvement happens at an even faster rate – hence why Kaggle introduced many incentives for sharing."

105 viewsAnatoly Alekseev, edited 09:46

#books #kagglebook

Закончил чтение The Kaggle Book (English Edition). Общие замечания к книге:

Много места потеряно ради сомнительной "академической широты". Зачем было тратить десятки страниц на определение метрик? Лучше бы вместо этого рассказали про трюк с ансамблированием моделек, затюненных на разные метрики. Сами же в начале сказали, что книга подразумевает наличие определённой базы, и "основы линейной регрессии" рассказывать не будут.

А тюнеры? Зачем было приводить код использования GridSearchCV? К чему эта академическая широта картины, не лучше ли было дать совет, каким тюнером пользоваться и в чем его практические преимущества? Зачем рекламировать skopt, который на момент написания книги не имел коммитов уже 2 года (а на текущий момент 5 лет)?

Ну ладно, раз вы потратили десятки страниц на описание этих тюнеров (80% из которых в реале никто не будет использовать) и примеры кода, почему не удосужились их все запустить на каком-то датасете и сравнить хотя бы для примера?

Теперь, их объяснения какие параметры тюнить у бустингов, ну честно, это на уровне школьников, не гроссмейстеров каггл.

В то же время, некоторые главы действительно изобилуют ценным личным опытом и советами, особенно глава про ансамбли, это как раз то, чего я ждал.

Понравились главы по компьютерному зрению (CV) и обработке текстов (NLP), в первой много внимания уделено аугментации изображений, в последней приведены хорошие примеры конвейеров (pipelines).

Преимуществ в целом больше, чем недостатков, и книгу я рекомендую к прочтению для начального и среднего уровня в DS.

Далее размещу несколько постов с идеями, которые мне понравились, показались полезными или неожиданными. Иногда будут мои комментарии. Основной контент на английском.

Custom losses in boosting
Metrics, Dimensionality reduction, Pseudo-labeling
Denoising with autoencoders, Neural networks for tabular competitions
Ensembling часть 1
Ensembling часть 2
Stacking variations

Также понравилась серия постов/мини-интервью с гроссами каггл, приведу интересное:

Часть 1
Часть 2
Часть 3
Часть 4

194 viewsAnatoly Alekseev, edited 01:41

Aspiring Data Science

#books #kagglebook #interviews

Paweł Jankiewicz

I tend to always build a framework for each competition that allows me to create as many experiments as possible.
You should create a framework that allows you to change the most sensitive parts of the pipeline quickly.

Я тоже пытаюсь сделать свой фреймворк, чтобы не начинать каждый раз с нуля. для области DS этим неизбежно становится automl.

What Kaggle competitions taught me is the importance of validation, data leakage prevention, etc. For example, if data leaks happen in so many competitions, when people who prepare them are the best in the field, you can ask yourself what percentage of production models have data leaks in training; personally, I think 80%+ of production models are probably not validated correctly, but don’t quote me on that.

Software engineering skills are probably underestimated a lot. Every competition and problem is slightly different and needs some framework to streamline the solution (look at https://github.com/bestfitting/ instance_level_recognition and how well their code is organized). Good code organization helps you to iterate faster and eventually try more things.

Andrew Maranhão

While libraries are great, I also suggest that at some point in your career you take the time to implement it yourself. I first heard this advice from Andrew Ng and then from many others of equal calibre. Doing this creates very in-depth knowledge that sheds new light on what your model does and how it responds to tuning, data, noise, and more.

Over the years, the things I wished I realized sooner the most were:
1. Absorbing all the knowledge at the end of a competition
2. Replication of winning solutions in finished competitions

Это сильнейшая идея. Развивая её дальше, можно сказать, что учиться надо и по закончившимся соревам, в которых ты НЕ участвовал, и даже по синтетическим, которые ты сам создал!

In the pressure of a competition drawing to a close, you can see the leaderboard shaking more than ever
before. This makes it less likely that you will take risks and take the time to see things in all their detail.
When a competition is over, you don’t have that rush and can take as long as you need; you can also
replicate the rationale of the winners who made their solutions known.

If you have the discipline, this will do wonders for your data science skills, so the bottom line is: stop when
you are done, not when the competition ends. I have also heard this advice from an Andrew Ng keynote,
where he recommended replicating papers as one of his best ways to develop yourself as an AI practitioner.

Martin Henze

In many cases, after those first few days we’re more than 80% on the way to the ultimate winner’s solution, in terms of scoring metric. Of course, the fun and the challenge of Kaggle are to find creative ways to get those last few percent of, say, accuracy. But in an industry job, your time is often more efficiently spent in tackling a new project instead.

I don’t know how often a hiring manager would actually look at those resources, but I frequently got the impression that my Grandmaster title might have opened more doors than my PhD did. Or maybe it was a combination of the two. In any case, I can much recommend having a portfolio of public Notebooks.

Even if you’re a die-hard Python aficionado, it pays off to have a look beyond pandas and friends every once in a while. Different tools often lead to different viewpoints and more creativity.

Andrada Olteanu

I believe the most overlooked aspect of Kaggle is the community. Kaggle has the biggest pool of people, all gathered in one convenient place, from which one could connect, interact, and learn from. The best way to leverage this is to take, for example, the first 100 people from each Kaggle section (Competitions, Datasets, Notebooks – and if you want, Discussions), and follow on Twitter/LinkedIn everybody that has this information shared on their profile. This way, you can start interacting on a regular basis with these amazing people, who are so rich in insights and knowledge.

80 viewsAnatoly Alekseev, edited 01:57

Aspiring Data Science

#books #kagglebook #interviews

Yifan Xie

In terms of techniques, I have built up a solid pipeline of machine learning modules that allow me to quickly apply typical techniques and algorithms on most data problems. I would say this is a kind of competitive advantage for me: a focus on standardizing, both in terms of work routine and technical artifacts over time. This allows for quicker iteration and in turn helps improve efficiency when conducting data experiments, which is a core component of Kaggle.

I am a very active participant on Numerai. For me, based on my four reasons to do data science, it is more for profit, as they provide a payout via their cryptocurrency. It is more of a solitary effort, as there is not really an advantage to teaming; they don’t encourage or forbid it, but it is just that more human resources don’t always equate to better profit on a trading competition platform like Numerai.

Ryan Chesler

For me, error analysis is one of the most illuminating processes; understanding where the model is failing and trying to find some way to improve the model or input data representation to address the weakness.

I started from very little knowledge and tried out a Kaggle competition without much success at first. I went to a local meetup and found people to team up with and learn from. At the time, I got to work with people of a much higher skill level than me and we did really well in a competition, 3rd/4500+ teams. After this, the group stopped being as consistent and I wanted to keep the community going, so I made my own group and started organizing my own events. I’ve been doing that for almost 4 years and I get to be on the opposite side of the table teaching people and helping them get started.

Bojan Tunguz

For a while I was really into the NLP competitions, but those have always been rare on Kaggle. One constant over the years, though, has been my interest in tabular data problems. Those used to be the quintessential Kaggle competition problems but have unfortunately become extinct. I am still very interested in that area of ML and have moved into doing some basic research in this domain. Compared to the other areas of ML/DL, there has been very little progress on improving ML for tabular data, and I believe there is a lot of opportunity here.

Some of Kaggle techniques are also applicable to my day-to-day modeling, but there one important aspect is missing – and that’s the support and feedback from the community and the leaderboard. When you are working on your own or with a small team, you never know if what you are building is the best that can be done, or if a better solution is possible.

Поэтому, думаю, должен быть еще 1 этап жизненного цикла ML модели (или даже бизнес-задачи): вывод её на уровень kaggle или подобной платформы, с хорошими призами, чтобы понять границы возможного.

The single biggest impact on your model’s performance will come from very good features. Unfortunately, feature engineering is more of an art than a science and is usually very model- and dataset-dependent. Most of the more interesting feature engineering tricks and practices are rarely, if ever, taught in standard ML courses or resources. Many of them cannot be taught and are dependent on some special problem-specific insights. But the mindset of looking into feature engineering as default is something that can be cultivated. It will usually take many years of practice to get good at it.

GitHub

bestfitting - Overview

bestfitting has 2 repositories available. Follow their code on GitHub.

73 viewsAnatoly Alekseev, edited 01:57

Aspiring Data Science

#books #kagglebook #interviews #todo

Jean-François Puget

I like competitions with a scientific background, or a background I can relate to. I dislike anonymous data and synthetic data, unless the data is generated via a very precise physics simulation. More generally, I like Kaggle competitions on domains I don’t know much about, as this is where I will learn the most. It is not the most effective way to get ranking points, but it is the one I entertain most.

What I often do is plot samples using two features or derived features on the x and y axis, and a third feature for color coding samples. One of the three features can be the target. I use lots of visualization, as I believe that human vision is the best data analysis tool there is.

К этому совету стоит отнестись серьёзно. Я сам видел, как в одной сореве Джиба что-то странное заметил в повторении некоторых чисел в дальних столбцах огромной матрицы, что привело к выявлению утечки. Ну кто в работе вообще смотрит на таблицу данных, тем более здоровую?

Hyperparameter tuning is one of the best ways to overfit, and I fear overfitting a lot.

Kazuki Onodera

-In your experience, what do inexperienced Kagglers often overlook?
-Target analysis.

-What mistakes have you made in competitions in the past?
-Target analysis. Top teams always analyze the target better than others.

Xavier Conort

-Tell us about a particularly challenging competition you entered, and what insights you used to tackle the task.

-My favorite competition is GE Flight Quest, a competition organised by GE where competitors had to predict arrival time of domestic flights in the US.

I was very careful to exclude the name of the airport from my primary feature lists. Indeed, some airports hadn’t experienced bad weather conditions during the few months of history. So, I was very concerned that my favorite ML algorithm, GBM, would use the name of the airport as a proxy for good weather and then fail to predict well for those airports in the private leaderboard. To capture the fact that some airports are better managed than others and improve my leaderboard score slightly, I eventually did use the name of the airport, but as a residual effect only. It was a feature of my second layer of models that used as an offset the predictions of my first layer of models. This approach can be considered a two-step boosting, where you censor some information during the first step. I learnt it from actuaries applying this approach in insurance to capture geospatial residual effects.

I would advise inexperienced Kagglers not to look at the solutions posted during the competition but to try to find good solutions on their own. I am happy that competitors didn’t share code during the early days of Kaggle. It forced me to learn the hard way.

-What mistakes have you made in competitions in the past?
-One mistake is to keep on competing in competitions that are badly designed with leaks. It is just a waste of time. You don’t learn much from those competitions.

-What’s the most important thing someone should keep in mind or do when they’re entering a competition?
-Compete to learn. Compete to connect with other passionate data scientists. Don’t compete only to win.

Chris Deotte

I enjoy competitions with fascinating data and competitions that require building creative novel models.
My specialty is analyzing trained models to determine their strengths and weaknesses. Feature engineering and quick experimentation are important when optimizing tabular data models. In order to accelerate the cycle of experimentation and validation, using NVIDIA RAPIDS cuDF and cuML on GPU are essential.

Laura Fink

My favorite competitions are those that want to yield something good to humanity. I especially like all
healthcare-related challenges.

✍1

60 viewsAnatoly Alekseev, edited 02:07

Aspiring Data Science

#books #kagglebook #interviews

Shotaro Ishihara

Kaggle assumes the use of advanced machine learning, but this is not the case in business. In practice, I try to find ways to avoid using machine learning. Even when I do use it, I prefer working with classical methods such as TF-IDF and linear regression rather than advanced methods such as BERT.

Gilberto Titericz

I always end a competition with at least one Gradient Boosted Tree model and one deep learning-based approach. A blend of such diverse approaches is very important to increase diversity in the predictions and boost the competition metric.

Headhunters all around the world look at Kaggle to find good matches for their positions and the knowledge and experience gained from competitions can boost any career.

The number of medals I have in Kaggle competitions is my portfolio; up to now (11/2021) it’s 58 gold and 47 silver, which summarizes well the ML experience I got from Kaggle. Taking into account that each competition runs for at least 1 month, this is more than 105 consecutive months of experience doing competitive ML.

The data scientist must take into account how the model is going to be used in the future, and make the validation as close as possible to that.

Keep in mind that what makes a competition winner is not just replicating what everyone else is doing, but thinking out of the box and coming up with novel ideas, strategies, architectures, and approaches.

Gabriel Preda

There are also potential employers that make very clear that they do not consider Kaggle relevant. I disagree with this view; personally, before interviewing candidates, I normally check their GitHub and Kaggle profiles. I find them extremely relevant.

A good Kaggle profile will demonstrate not only technical skills and experience with certain languages, tools, techniques, or problem-solving skills, but also how well someone is able to communicate through discussions and Notebooks. This is a very important quality for a data scientist.

Jeong-Yoon Lee

In 2012, I used Factorization Machine, which was introduced by Steffen Rendle at KDD Cup 2012, and improved on prediction performance by 30% over an existing SVM model in a month after I joined a new company. At a start-up I co-founded, our main pitch was the ensemble algorithm to beat the market-standard linear regression. At Uber, I introduced adversarial validation to address covariate shifts in features in the machine learning pipelines.

The more ideas you try, the better chance you have to do well in a competition. The principle applies to my day-to-day work as well.

67 viewsAnatoly Alekseev, edited 02:07

Aspiring Data Science

#books #kagglebook #todo

Теперь пройдёмся по основному контенту и поглядим, чего там было интересного.

Custom losses in boosting

If you need to create a custom loss in LightGBM, XGBoost, or CatBoost, as indicated in their respective documentation, you have to code a function that takes as inputs the prediction and the ground truth, and that returns as outputs the gradient and the hessian. Подумал, что хорошо бы учесть разные лоссы при написании свего тюнера гиперпараметров.

From a code implementation perspective, all you have to do is to create a function, using closures
if you need to pass more parameters beyond just the vector of predicted labels and true labels.

Here is a simple example of a focal loss (a loss that aims to heavily weight the minority class in
the loss computations as described in Lin, T-Y. et al. Focal loss for dense object detection: https://
arxiv.org/abs/1708.02002) function that you can use as a model for your own custom functions:


from scipy.misc import derivative
import xgboost as xgb

def focal_loss(alpha, gamma):
    def loss_func(y_pred, y_true):
        a, g = alpha, gamma

        def get_loss(y_pred, y_true):
            p = 1 / (1 + np.exp(-y_pred))
            loss = (
                -(a * y_true + (1 - a) * (1 - y_true))
                * ((1 - (y_true * p + (1 - y_true) * (1 - p))) ** g)
                * (y_true * np.log(p) + (1 - y_true) * np.log(1 - p))
            )
            return loss

        partial_focal = lambda y_pred: get_loss(y_pred, y_true)
        grad = derivative(partial_focal, y_pred, n=1, dx=1e-6)
        hess = derivative(partial_focal, y_pred, n=2, dx=1e-6)
        return grad, hess

    return loss_func


xgb = xgb.XGBClassifier(objective=focal_loss(alpha=0.25, gamma=1))

In the above code snippet, we have defined a new cost function, focal_loss, which is then fed into an XGBoost instance’s object parameters. The example is worth showing because the focal loss requires the specification of some parameters in order to work properly on your problem (alpha and gamma). The more simplistic solution of having their values directly coded into the function is not ideal, since you may have to change them systematically as you are tuning your model. Instead, in the proposed function, when you input the parameters into the focal_loss function, they reside in memory and they are referenced by the loss_func function that is returned to XGBoost. The returned cost function, therefore, will work, referring to the alpha and gamma values that you have initially instantiated.

Another interesting aspect of the example is that it really makes it easy to compute the gradient and the hessian of the cost function by means of the derivative function from SciPy. If your cost function is differentiable, you don’t have to worry about doing any calculations by hand. However, creating a custom objective function requires some mathematical knowledge and quite a lot of effort to make sure it works properly for your purposes. You can read about the difficulties that Max Halford experienced while implementing a focal loss for the LightGBM algorithm, and how he overcame them, here: https://maxhalford.github.io/blog/lightgbm-focal-loss/. Despite the difficulty, being able to conjure up a custom loss can really determine your success in a Kaggle competition where you have to extract the maximum possible result from your model.

✍1

65 viewsAnatoly Alekseev, edited 02:16

Aspiring Data Science

#books #kagglebook #todo

Metrics

what you care the most about when using RMSLE is the scale of your predictions with respect
to the scale of the ground truth.

By far, at the moment, RMSLE is the most used evaluation metric for regression in Kaggle competitions.

In terms of downside, using MAE as an objective function results in much slower convergence, since you are actually optimizing for predicting the median of the target (also called the L1 norm),
instead of the mean (also called the L2 norm), as occurs by MSE minimization. This results in more complex computations for the optimizer, so the training time can even grow exponentially
based on your number of training cases (see, for instance, this Stack Overflow question: https://stackoverflow.com/questions/57243267/why-is-training-a-random-forest-regressorwith-mae-criterion-so-slow-compared-to).

Dimensionality reduction

t-SNE and UMAP are two techniques, often used by data scientists, that allow you to project multivariate data into lower dimensions. They are often used to represent complex sets of data in two dimensions. 2-D UMAP and t-SNE plots can reveal the presence of outliers and relevant clusters for your data problem.

In fact, if you can plot the scatter graph of the resulting 2-D projection and color it by target value, the plot may give you hints about possible strategies for dealing with subgroups. You can find a useful example of applying both UMAP and t-SNE with a RAPIDS implementation and a GPU for data exploration purposes.

t-SNE/UMAP надо тоже использовать (как препроцессоры) при написании "умного" оптимизатора гиперпараметров

Pseudo-labeling

In competitions where the number of examples used for training can make a difference, pseudo-labeling can boost your scores by providing further examples taken from the test set. The idea is to add examples from the test set whose predictions you are confident about to your training set.

Unfortunately, you cannot know for sure beforehand whether or not pseudo-labeling will work in a competition (you have to test it empirically), though plotting learning curves may provide you with a hint as to whether having more data could be useful.

Stack Overflow

Why is training a random forest regressor with MAE criterion so slow compared to MSE?

When training on even small applications (<50K rows <50 columns) using the mean absolute error criterion for sklearn's RandomForestRegress is nearly 10x slower than using mean squared error. To

61 viewsAnatoly Alekseev, edited 02:16

Aspiring Data Science

#books #kagglebook #noise #augmentaion #todo

Denoising with autoencoders

In his famous post (https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/ discussion/44629), Michael Jahrer describes how a DAE can not only remove noise but also automatically create new features, so the representation of the features is learned in a similar way to what happens in image competitions. In the post, he mentions the secret sauce for the DAE recipe, which is not simply the layers, but the noise you put into the data in order to augment it. He also made clear that the technique requires stacking together training and test data, implying that the technique would not have applications beyond winning a Kaggle competition.

In order to help train any kind of DAE, you need to inject noise that helps to augment the training data and avoid the overparameterized neural network just memorizing inputs (in other words, overfitting). In the Porto Seguro competition, Michael Jahrer added noise by using a technique called swap noise, which he described as follows: Here I sample from the feature itself with a certain probability “inputSwapNoise” in the table above. 0.15 means 15% of features replaced by values from another row.

What is described is basically an augmentation technique called mixup (which is also used in image augmentation: https://arxiv.org/abs/1710.09412). In mixup for tabular data, you decide a probability for mixing up. Based on that probability, you change some of the original values in a sample, replacing them with values from a more or less similar sample from the same training data.

Большой вопрос, на который нет ответа - почему аугментации всегда используются в задачах машинного зрения, и никогда - в табличных данных? А ведь это отличный метод создания diversity. Нужно исследование.

Neural networks for tabular competitions

Use activations such as GeLU, SeLU, or Mish instead of ReLU; they are quoted in quite a few papers as being more suitable for modeling tabular data and our own experience confirms that they tend to perform better.

Use augmentation with mixup (discussed in the section on autoencoders).
Use quantile transformation on numeric features and force, as a result, uniform or Gaussian distributions.

Leverage embedding layers, but also remember that embeddings do not model everything. In fact, they miss interactions between the embedded feature and all the others (so you have to force these interactions into the network with direct feature engineering).

Remember that embedding layers are reusable. In fact, they consist only of a matrix multiplication that reduces the input (a sparse one-hot encoding of the high cardinality variable) to a dense one of lower dimensionality. By recording and storing away the embedding of a trained neural network, you can transform the same feature and use the resulting embeddings in many other different algorithms, from gradient boosting to linear models.

Kaggle

Porto Seguro’s Safe Driver Prediction

Predict if a driver will file an insurance claim next year.

70 viewsAnatoly Alekseev, edited 02:21

Aspiring Data Science

#books #kagglebook #ensembling

Ensembling

Ensemble of tuned models always performs better than an ensemble of untuned ones.

The logarithmic mean: Analogous to the geometric mean, you take the logarithm of your submission, average them together, and take the exponentiation of the resulting mean.

The mean of powers: Where you take the average of the nth power of the submissions, then you take the 1/nth power of the resulting average.

Giving more importance to more uncorrelated predictions is an ensembling strategy that is often successful. Even if it only provides slight improvements, this could suffice to turn the competition to your advantage.

In blending, the kind of meta-learner you use can make a great difference. The most common choices are to use a linear model or a non-linear one. One limit to using these kinds of meta-learners is that they may assign some models a negative contribution, as you will be able to see from the value of the coefficient in the model. When you encounter this situation, the model is usually overfitting, since all models should be contributing positively to the building of the ensemble (or, at worst, not contributing at all). The most recent versions of Scikit-learn allow you to impose only positive weights and to remove the intercept. These constraints act as a regularizer and prevent overfitting.

Non-linear models as meta-learners are less common because they tend to overfit in regression and binary classification problems, but they often shine in multiclass and multilabel classification problems since they can model the complex relationships between the classes present. They also generally perform better if, aside from the models’ predictions, you also provide them with the original features, since they can spot any useful interactions that help them correctly select which models to trust more.

✍1

72 viewsAnatoly Alekseev, edited 02:56

Aspiring Data Science

#books #kagglebook #ensembling

Ensembling

An alternative to using a linear or non-linear model as a meta-learner is provided by the ensemble selection technique formalized by Caruana, Niculescu-Mizil, Crew, and Ksikes. (Мы уже обсуждали этот метод не так давно)
The ensemble selection is actually a weighted average, so it could simply be considered analogous to a linear combination. However, it is a constrained linear combination (because it is part of a hill-climbing optimization) that will also make a selection of models and apply only positive weights to the predictions. All this minimizes the risk of overfitting and ensures a more compact solution, because the solution will involve a model selection. From this perspective, ensemble selection is recommended in all problems where the risk of overfitting is high (for instance, because the training cases are few in number or the models are too complex) and in real-world applications because of its simpler yet effective solution.
When using a meta-learner, you are depending on the optimization of its own cost function, which may differ from the metric adopted for the competition. Another great advantage of ensemble selection is that it can be optimized to any evaluation function, so it is mostly suggested when the metric for the competition is different from the canon of those typically optimized in machine learning models.

Having obtained OOF predictions for all your models, you can proceed to build a meta-learner that predicts your target based on the OOF predictions (first-level predictions), or you can keep on producing further OOF predictions on top of your previous OOF predictions (second- or higher-level predictions), thus creating multiple stacking layers. This is compatible with an idea presented by Wolpert himself: by using multiple meta-learners, you are actually imitating the structure of a fully connected feedforward neural network without backpropagation, where the weights are optimally calculated in order to maximize the predictive performance at the level of each layer separately. From a practical point of view, stacking multiple layers has proven very effective and works very well for complex problems where single algorithms are unable to obtain the best results.

Moreover, one interesting aspect of stacking is that you don’t need models of comparable predictive power, as in averaging and often in blending. In fact, even worse-performing models may be effective as part of a stacking ensemble. A k-nearest neighbors model may not be comparable to a gradient boosting solution, but when you use its OOF predictions for stacking it may contribute positively and increase the predictive performance of the ensemble.

When you have trained all the stacking layers, it is time to predict. As far as producing the predictions used at various stacking stages, it is important to note that you have two ways to do this.
The original Wolpert paper suggests re-training your models on all your training data and then using those re-trained models for predicting on the test set. In practice, many Kagglers don’t retrain, but directly use the models created for each fold and make multiple predictions on the test set that are averaged at the end. In our experience, stacking is generally more effective with complete re-training on all available data before predicting on the test set when you are using a low number of k-folds. In these cases, the sample consistency may really make a difference in the quality of the prediction because training on less data means getting more variance in the estimates. As we discussed in Chapter 6, when creating OOF predictions it is always better to use a high number of folds, between 10 to 20. This limits the number of examples that are held out, and, without re-training on all the data, you can simply use the average of predictions obtained from the cross-validation trained models for obtaining your prediction on the test set.

77 viewsAnatoly Alekseev, edited 02:56

Aspiring Data Science

#books #kagglebook #ensembling

Stacking variations

The main variations on stacking involve changing how test data is processed across the layers, whether to use only stacked OOF predictions or also the original features in all the stacking layers, what model to use as the last one, and various tricks in order to prevent overfitting.

We discuss some of the most effective here that we have personally experimented with:
• Optimization may or may not be used. Some solutions do not care too much about optimizing single models; others optimize only the last layers; others optimize on the first layers. Based on our experiences, optimization of single models is important and we prefer to do it as early as possible in our stacking ensemble.

• Models can differ at the different stacking layers, or the same sequence of models can be repeated at every stacking layer. Here we don’t have a general rule, as it really depends on the problem. The kind of models that are more effective may vary according to the problem. As a general suggestion, putting together gradient boosting solutions and neural networks has never disappointed us.

• At the first level of the stacking procedure, just create as many models are possible.
For instance, you can try a regression model if your problem is a classification one, and vice versa. You can also use different models with different hyperparameter settings, thus avoiding too much extensive optimization because the stacking will decide for you. If you are using neural networks, just changing the random initialization seed could suffice to create a diverse bag of models. You can also try models using different feature engineering and even use unsupervised learning (like Mike Kim did when he used t-SNE dimensions in a solution of his: https://www.kaggle.com/c/otto-group-product-classificationchallenge/discussion/14295). The idea is that the selection of all such contributions is done during the second level of the stacking. This means that, at that point, you do not have to experiment any further and you just need to focus on a narrower set of better-performing models. By applying stacking, you can re-use all your experiments and let the stacking decide for you to what degree you should use something in your modeling pipeline.

• Some stacking implementations take on all the features or a selection of them to further stages, reminiscent of skip layers in neural networks. We have noticed that bringing in features at later stages in the stacking can improve your results, but be careful: it also brings in more noise and risk of overfitting.

• Ideally, your OOF predictions should be made from cross-validation schemes with a high number of folds, in other words, between 10 to 20, but we have also seen solutions working with a lower number, such as 5 folds.

• For each fold, bagging the data (resampling with repetition) multiple times for the same model and then averaging all the results from the model (OOF predictions and test predictions) helps to avoid overfitting and produces better results in the end.

The possibilities are endless. Once you have grasped the basic concept of this ensembling technique, all you need is to apply your creativity to the problem at hand.

81 viewsAnatoly Alekseev, edited 02:58

Aspiring Data Science

#timeseries #ensembling #hetboost #pmdarima #todo

Вот попался классный пример, где идея "гетерогенного бустинга" отлично отрабатывает.

Лектор на синтетике сравнивает ариму и ансамбль линрег+дерево.

В задачах на временные ряды декомпозиция на тренд, сезонность и остаточные нерегулярный сигнал очевидна и необходима, но можно посмотреть на проблему в общем - классы моделей имеют свои ограничения (деревянные модели регрессии, к примеру, плохо моделируют линейные зависимости), и обучение модели одного класса на невязках модели другого класса способно показать отличные результаты.

В то же время, сейчас самыми распространёнными методами ансамблирования являются стэкинг (когда для моделей последующего уровня меняется признаковое пространство) и гомогенный бустинг (например, градиентный над деревьями в catboost/xgboost/lightgbm), а вот идею бустинга гетерогенного как будто никто и не рассматривает, и как будто бы нет опенсорсных реализаций.

Истоки такого предубеждения, похоже, растут из ранних статей о бустинговом подходе над "слабыми моделями" (weak learners). Выбор именно слабых моделей аргументировался контролем переобучения, равномерностью шагов процесса обучения, фокусом на сложных для предсказания примерах (которые более сильная модель могла бы просто запомнить).

Мне кажется, "слабость" и одинаковость участников ансамбля не всегда благо, и на практике есть смысл в каждой конкретной задаче проверять (на CV) наиболее выгодный подход к ансамблированию, от простого усреднения моделей и ensemble selection (который мы недавно рассматривали) до стэкинга и двух видов бустинга, одно- и разнородного.

На этот год планирую сравнительное исследование )

Видимо, относительно небольшая статья о том, как стать лучше в DS, которую я подготовил, столкнувшись с неспособностью современных библиотек градиентного бустинга хорошо смоделировать простую зависимость Y=X, вырастет в большое сравнение алгоритмов ансамблирования.

Постараюсь захватить Ensemble Selection (1, 2, 3), опции ансамблирования рассмотренные в #kagglebook (1, 2, 3), и Cascade Generalization/Cascading/Delegating (or Selective Routing)/Arbitrating.

Aspiring Data Science

#ensembling #hpo #hpt #autosklearn

Вот какой интересный метод ансамблирования опробовали авторы оптимизатора auto-sklearn:

"Two important problems in AutoML are that (1) no single machine learning method performs best on all datasets and (2) some machine…

🔥4👍1

136 viewsAnatoly Alekseev, edited 18:14

About

Blog

Apps

Platform