Information in Data Columns with Low Variance
Indeed, another way of measuring how much information a data column has, is to measure its variance. In the limit case where the column cells assume a constant value, the variance would be zero and the column would be of no help in the discrimination of different groups of data.
The Low Variance Filter node calculates each column variance and removes those columns with a variance value below a given threshold.
Notice that the variance can only be calculated for numerical columns, i.e. this dimensionality reduction method applies only to numerical columns. Note, too, that the variance value depends on the column numerical range. Therefore data column ranges need to be normalized before calculating their variance, in order to make variance values independent from the column domain range.
Low Variance Filter Sub-Workflow
First a Normalizer node normalizes all column ranges to [0, 1]; next, a Low Variance Filter node calculates the columns variance and filters out the columns with a variance lower than a set threshold; finally, all remaining columns are de-normalized to return to their original numerical range.
As for the previously published method (Removing Data Columns with Too Many Missing Values), the optimal threshold can be defined through an optimization loop maximizing the classification accuracy on a validation set for the best out of three classification algorithms: MLP, decision tree (C4.5), and Naïve Bayes.
Using this approach and using the small KDD data set from the KDD 2009 challenge, the best threshold value was found to be 0.03 corresponding to a classification accuracy of 82% by the MLP on the evaluation set and a dimensionality reduction of 73%.
Higher threshold values - i.e. more tolerant methods - actually produce worse accuracy values, proving that dimensionality reduction is not only necessary for execution time but also for performance improvement.