Decision Tree Ensembles, also referred to as random forests, are useful for feature selection in addition to being effective classifiers. One approach to dimensionality reduction is to generate a large and carefully constructed set of trees against a target attribute and then use each attribute’s usage statistics to find the most informative subset of features.
Specifically, we can generate a set of very shallow trees, with each tree being trained on a small fraction of the total number of attributes. Hence, first we trained e.g. 2000 trees. Each tree is trained independently on 2 levels and 3 attributes: these settings provide for good feature selection in a reasonable amount of time for this data set. If an attribute is often selected as best split, it is most likely an informative feature to retain.
Once the tree ensemble is generated, we calculated a score for each attribute, by counting how many times it has been selected for a split and at which rank (level) among all available attributes (candidates) in the trees of the ensemble.
Score = #splits(lev. 0)/#candidates(lev. 0) + #splits(lev. 1)/#candidates(lev. 1)
This score tells us ‒ relative to the other attributes ‒ which are the most predictive. Only the input features scoring higher than a given threshold are retained. This techniques produces a strong reduction rate, while minimally affecting the original accuracy.