This technique was proposed by Breiman in 1994 and can be used with many classification methods. The final effect of this technique is to reduce the variance associated with prediction, and thereby improve the prediction process.
The Bagging technique
Here are the steps:
1. X bootstrap samples are drawn from the available training data,
2. A classification model is trained on each bootstrap sample,
3. All classification models are run on the test set
4. For each test pattern, the results are combined by simple voting: the class with the majority of results across all classifiers is chosen as the final class.
Bagging can also be applied to regression methods. In this case, though, the final result is the average of the results of all regression models.
The “Bagging” meta-node
KNIME offers a meta-node to implement bagging for classification: the “Bagging” meta-node.
The “Bagging” meta-node applies only to classifiers, i.e. to mining techniques outputting nominal class values. A version for numerical output values, like for regression methods, with averaging of the final results is not yet available out of the box. However, customization of this “Bagging” meta-node should be straightforward.
The “Bagging” meta-node has two input ports and one output port.
The output port presents the results for all the trained models and the final result after the majority vote.
The sub-workflow of the “Bagging” meta-node
The “Bagging” meta-node consists of two steps: training a number X of models and applying the X models to the test set.
The goal of the training step is to create X partitions of the training data and to train X models on them.
- This step starts with shuffling the data. Shuffling is necessary, otherwise the X partitions of training data might only represent a subset of the whole data universe.
- The “Chunk Loop Start” node then divides the training data set in a number of X partitions and loops around them.
- At each iteration, a partition is used to train a model. In this case, the model is a decision tree, but it could be any other classifier scheme.
- Finally, the “Model Loop End” node collects all models resulting from the learner node at each iteration.
The training loop then produces a list of X models, each one trained on a different partition of the training set.
The test step
This is also a loop running across the X developed models.
- The “Model Loop Start” node starts a loop on the models listed in the input data. At each iteration, one model of the list is loaded.
- The predictor node applies the current model to the test set. Notice that the predictor node is connected to the lower input port of the “Bagging” meta-node. This is because it has to operate on the test set.
- The loop is closed by a “Voting Loop End” node. This node collects on different columns the results of the predictor node using different models. Therefore its output data table shows as many columns as many classification models. In addition, the node also performs the majority vote, that is it outputs the final classification result for each test data row on the basis of how many classifiers agree about a given output class.
Below the output data table is shown for a “Bagging” node using 10 decision trees.
The parameter defining the number of classifiers to use is the number of data chunks in the “Chunk Loop Start” node.
The “Bagging” node can be easily parameterized by using a Quickform “Integer Input” node. This Quickform node defines the number of models to be created and controls the number of data chunks in the “Chunk Loop Start” node (see sub-workflow below).
In the following workflow, we compare the performance of our parameterized “Bagging” node using decision trees vs. a simple decision tree configured similarly to the decision trees in the “Bagging” node.
Notice that the “Bagging” node includes both training and testing phases, unlike the simple decision tree technique.
The “bagging” strategy display an accuracy of 0.835, slightly better than the accuracy of 0.821 displayed by the simple decision tree, as it was to be expected.