#masters #featureselection #redundancy #multicollinearity
"In any application involving a large number of variables, it’s nice to be able to identify sets of variables that have significant redundancy. Of course, we may be unlucky and have a situation in which the small differences between largely redundant variables contain the useful information. However, this is the exception. In most applications,it is the redundant information that is most important; if some type of effect impacts multiple variables, it’s probably important. Because dealing with fewer variables is always better, if we can identify groups of variables that have great intra-group redundancy, we may be able to eliminate many variables from consideration, focusing on a weighted average of representatives from each group, or perhaps focusing on a single factor that is highly correlated with a redundant group. Or we might just be interested in the fact of redundancy, garnering useful insight from it."
О таком же подходе, когда несколько коллинеарных факторов заменяются одним, говорил и Эрни Чан. Тим для поиска групп связанных факторов использует PCA.
"In any application involving a large number of variables, it’s nice to be able to identify sets of variables that have significant redundancy. Of course, we may be unlucky and have a situation in which the small differences between largely redundant variables contain the useful information. However, this is the exception. In most applications,it is the redundant information that is most important; if some type of effect impacts multiple variables, it’s probably important. Because dealing with fewer variables is always better, if we can identify groups of variables that have great intra-group redundancy, we may be able to eliminate many variables from consideration, focusing on a weighted average of representatives from each group, or perhaps focusing on a single factor that is highly correlated with a redundant group. Or we might just be interested in the fact of redundancy, garnering useful insight from it."
О таком же подходе, когда несколько коллинеарных факторов заменяются одним, говорил и Эрни Чан. Тим для поиска групп связанных факторов использует PCA.
#featureselection #multicollinearity
"This difference has an impact on a corner case in feature importance analysis: the correlated features. Imagine two features perfectly correlated, feature A and feature B. For one specific tree, if the algorithm needs one of them, it will choose randomly (true in both boosting and Random Forests™).
However, in Random Forests™ this random choice will be done for each tree, because each tree is independent from the others. Therefore, approximatively, depending of your parameters, 50% of the trees will choose feature A and the other 50% will choose feature B. So the importance of the information contained in A and B (which is the same, because they are perfectly correlated) is diluted in A and B. So you won’t easily know this information is important to predict what you want to predict! It is even worse when you have 10 correlated features…
In boosting, when a specific link between feature and outcome have been learned by the algorithm, it will try to not refocus on it (in theory it is what happens, the reality is not always that simple). Therefore, all the importance will be on feature A or on feature B (but not both). You will know that one feature has an important role in the link between the observations and the label. It is still up to you to search for the correlated features to the one detected as important if you need to know all of them."
https://cran.r-project.org/web/packages/xgboost/vignettes/discoverYourData.html
"This difference has an impact on a corner case in feature importance analysis: the correlated features. Imagine two features perfectly correlated, feature A and feature B. For one specific tree, if the algorithm needs one of them, it will choose randomly (true in both boosting and Random Forests™).
However, in Random Forests™ this random choice will be done for each tree, because each tree is independent from the others. Therefore, approximatively, depending of your parameters, 50% of the trees will choose feature A and the other 50% will choose feature B. So the importance of the information contained in A and B (which is the same, because they are perfectly correlated) is diluted in A and B. So you won’t easily know this information is important to predict what you want to predict! It is even worse when you have 10 correlated features…
In boosting, when a specific link between feature and outcome have been learned by the algorithm, it will try to not refocus on it (in theory it is what happens, the reality is not always that simple). Therefore, all the importance will be on feature A or on feature B (but not both). You will know that one feature has an important role in the link between the observations and the label. It is still up to you to search for the correlated features to the one detected as important if you need to know all of them."
https://cran.r-project.org/web/packages/xgboost/vignettes/discoverYourData.html
cran.r-project.org
Understand your dataset with XGBoost