Boosting

01/15/2013

2 Comments

 
And after bagging, we do need to talk about boosting.
Boosting is another committee-based approach. It works with weights in both steps: learning and prediction.

Learning.
During the learning phase, the boosting procedure trains a learning algorithm a number of times, each time using a slightly different composition of the training set.
At each iteration, the boosting algorithm:
      -          Starts with the training set built in the previous iteration,
      -          Trains a new model,
      -          Evaluates the model error on the training patterns,
      -          Calculates the model weight based on such error,
      -          Finally, builds a new training set by over-sampling /under-sampling the incorrectly/correctly classified training patterns. The over-sampling/under-sampling factor derives from the model weight.

The training set for the first iteration is the training set provided for the whole learning procedure.
The algorithm stops when a maximum number of iterations has been reached or the model error is too big (that is the weight is too close to 0 and therefore the corresponding model is ineffective).
The output of this learning phase is a number of models, lower or equal to the selected number of maximum iterations.

Notice that boosting can be applied to any training algorithm. However, it is particularly helpful in the case of weak classifiers. In fact, boosting techniques are quite sensitive to noise and outliers, that is to overfitting.

Prediction.
The prediction phase loops on all models, available from the learning phase, and provides a prediction based on the majority vote for classifiers and on a weighted average for regression techniques.

KNIME Implementation
KNIME implements Adaboost, one of the most commonly used boosting algorithms, with two meta-nodes in the “Mining-> Ensemble Learning” category: the “Boosting Learner” and the “Boosting Predictor” meta-node.

The “Boosting Learner” meta-node (see figure below) implements the learning loop via the “Boosting Learner Loop Start” node and the “Boosting Learner Loop End” node.
The “Boosting Learner Loop End” node (see figure below) sets the maximum number of iterations, the target column, and the predicted column. The target column and the predicted column are used to:
    -          Identify the mis-classified patterns
    -          Calculate the model error
    -          Calculate the model weight
The “Boosting Learner Loop Start” node uses the model weight and the mis-classified patterns to alter the composition of the training set.
The loop body includes any supervised training algorithm node, like a “Decision Tree Learner” or a “Naïve Bayes Learner”, and its corresponding predictor node. The predictor node is necessary, even though this is a learning meta-node, because at each iteration the identification of correctly and incorrectly classified patterns and the model error calculation are needed.

For each iteration the boosting loop outputs the model, its error, and its weight.

The “Boosting Predictor” meta-node receives the model list from the learner node and the test set patterns. For each test pattern, it loops on all models and weighs their prediction result.
The “Boosting Predictor Loop Start” node starts the boosting predictor loop by identifying the weight column and the model column (see settings in its configuration window).

The “Boosting Predictor Loop End” node implements the majority vote on all model results and assigns the final value to the test pattern. Its configuration window requires the identification of the prediction column.

The loop body just includes the predictor of the mining model selected for the learning phase.

Below you can see our implementation of boosting in KNIME, with a decision tree on the cars-85.csv data set. The task here was to predict a car fuel system based on its number of doors, wheel base, and width, by using maximum 5 models. Boosting produced 5 models (the maximum number allowed) and 0.714 accuracy.
If we used a simple decision tree alone for the same task, we would have also scored 0.714 accuracy. This shows that boosting should be used only for complex problems or weak classifiers. For strong classifiers and simple problems, the additional models only model noise and outliers.
 
 
For two months, from May 27 2010 to July 27  2010, I recorded the number of visitors of this blog. I then fed this data into the "Statistics" node of the KNIME data mining tool.
Here are the results.

The minimum number of visitors per day is zero on Saturday May 29. The maximum number of visitors per day is 33 on Monday July 12, which is the date when I posted the statistics and fit measure for the Soccer World Cup 2010.

The average number of visitors per day is 12.387. I must say it is unbelievable that 12 people in average read my blog, especially considering the irregularity of my posting deadlines. The standard deviation is 7.376. This means that every day the number of visitors falls between 5 and 19.

The number of visitors is higher from Monday through Thursday and decreases sensibly on Friday and especially Saturday and Sunday. This means that most of you read my blog during their business hours :-)

Thus I decided to train a Bayesian Network to differentiate between weekend and not weekend days based on the number of visitors. I mapped the weekday into a binary variable (weekend/not weekend) and removed it from the original data. The Bayesian Predictor was then applied to the training data itself. I know it is not orthodox, but I just wanted to play with the KNIME Baysian classifier and I did not have much data at hand.

The Scorer node gave me an accuracy of 0.742. The confusion matrix tells me that 6 weekend days (out of 18) have been classified as not weekend and 10 not weekend days (out of 44) as weekend. I used the "Interactive Table" node to isolate such days and I saw that the mistaken weekend days were following some posting. The number of visitors was then higher than the average number of weekend visitors.  Again the "Interactive Table" node told me that those days were far away from posting times, like yesterday, and the number of visitors was then exceptionally low.

The whole work was only an exercise to test KNIME for the implementation of:
- Bayesian Network (Learner and Predictor node)
- Interactive brushing ("Interactive Table" node)
- accuracy measures ("Scorer" node)

As you can see I have learned a lot of things about the average reader's weekly habits.
Thanks for following my blog. I will try to post more often to keep your interest and number of visits high.