The Tree Ensemble Learner node builds an ensemble of decision trees, as a variant of the random forest. Each of the decision tree models is trained on a different subset of rows and/or on a different subset of columns, randomly selected at each iteration. The output model is then an ensemble of differently trained decision tree models.
The Tree Ensemble Predictor node applies all decision trees to each data row and uses the simple majority vote for prediction.
The Tree Ensemble Learner node requires many settings, reflecting its complexity. Indeed, it has three setting tabs in the configuration window: one to select the attributes (that is the data columns), one to select the decision trees training parameters (like the entropy measure), and one to select the ensemble settings (like the number of trees to be trained).
Attribute Selection Tab
The "Attribute Selection" tab sets the data columns information.
In particular, it selects the input data columns and the target column to train the decision trees of the ensemble. A random subset of the input data columns is then used to train each decision tree. Two variants are possible for the input data columns:
- Fingerprint attributes use the different bit positions in the selected bit vector as learning attributes (for instance a bit vector of length 1024 is expanded to 1024 binary attributes). All bit vectors in the selected column must have the same length.
- Column attributes use nominal and numeric data columns as descriptors.
A few more settings are also required to ignore nominal columns without domain information (i.e. probably with too many nominal values anyway) and to enable hilighting (which is computationally more expensive).
The "Attribute Selection" tab in the figure below does not use bitvector descriptors, includes all available data columns as training features, and uses the "Document Class" column as the target variable for the decision trees training.
The second tab, named "Tree Options", sets the decision trees learning parameters, as suggested by the Random Forest™ classifier described by Leo Breiman and Adele Cutler. This tab defines:
- the decision tree entropy measure. The Gini index is the one used in CART; the information gain is the one used in C4.5; the information gain ratio is an improvement over the standard information gain, since it normalizes the standard information gain by the split entropy to overcome some unfair preference for nominal splits with many child nodes.
- the split point for numeric attributes. This can be either the mean value or the largest value of the lower partition (like in C4.5). This last option is the default. By the way, nominal columns are split by creating child nodes for each one of the nominal values.
- the maximum number of tree levels (tree depth) limits the level growth.
For instance, a value of 1 would split only the (single) root node.
- the minimum size of nodes (split nodes) and leaves (child nodes). This parameter will affect the size of your decision trees. The bigger the minimum node size, the smaller the tree.
- The root attribute selects a data column to be used as root split attribute in all decision trees.
The "Tree Options" tab in the figure below uses the Gini Index as entropy measure, the mean attribute value between two partitions as split point, and does not impose any limit on the tree size and growth nor on the starting data column.
The last configuration tab is the "Ensemble Configuration" and defines the ensemble settings. Here we set:
- the number of decision trees.
- The row sampling strategy to feed each decision tree with a different subset of data rows.
- The column sampling strategy to feed each decision tree with a different subset of data columns. In addition, the selected data column subset can apply to the whole decision tree (option "Use same set of attributes for entire tree") or change for each node. Option "Use different set of attributes for each tree node" uses many column subsets for each tree node and selects the optimal one to perform the split.
- The use a random seed to make the whole random selection repeatable, in case of random sampling.
The "Ensemble Configuration" tab in the figure below defines an ensemble with 1000 decision trees, bootstrapping with replacement as row sampling strategy, the square root of the original number of columns as the size of the data column subset, the usage different column subsets for different nodes, and a random seed to repeat the whole random selection.
The decision tree construction takes place in main memory (all data and all models are kept in memory). The statistics for each split (class distribution) is saved on disc. This means that the class distributions are visible in the node view but that the model size (in terms of MB on disc) can grow considerably. Currently this node cannot handle missing values.