I am currently working on a whitepaper listing and describing seven classical methods for dimensionality reduction. Extracts from that whitepaper will be posted on this blog regularly. This is the first post of the series.
Information in Data Columns with too many Missing Values
The most straightforward way to reduce a dataset dimensionality runs through the count of missing values. While there are many strategies to interpolate missing values in a data column (see blog post “How to deal with Missing Values” in the KNIME blog), it is undeniable that if the majority of data values is missing, the data column itself cannot carry that much information. For example, if a data column fills only 5-10% of the possible cells, it will likely not help in the classification of most records. The goal, then, becomes to remove those data columns with too many missing values, i.e. with more missing values in percent than a given threshold.
Counting Missing Values
To count the number of missing values, we can either use a Statistics or a GroupBy node.
The Statistics node produces three data tables at the output ports: a statistics table, a nominal histogram table, and an occurrences table. The nominal histogram table contains the number of missing values for each data column. The statistics table contains, among other statistical measures, the total number of rows in the input data table. Based on these two values, a Math Formula node can calculate the ratio of missing values for each column as:
Ratio of missing values = number of missing values / total number of rows
If we use the GroupBy node, in the “Pattern Based Aggregation” tab, we have to set “.*” as the RegEx search pattern for the column names where to count the missing values. After transposing the results, the same Math Formula node calculates the ratio of missing values, as described in the formula above.
A Rule-based Row Filter node applies the rule for a too high missing value ratio (that is ratio > threshold value) and filters out all columns fitting the rule.
The threshold value selection is of course a crucial topic, since a too aggressive threshold will reduce dimensionality at the expense of performance, while a too soft threshold might not get the best reduction ratio.
To find the best threshold value, we used an optimization loop and maximized the classification algorithm accuracy on the test set. However, in some particular situations one classification algorithm might perform better than others. To be on the safe side, we ran the optimization loop on three possible classification algorithms ‒ neural network (MLP), decision tree, and Naïve Bayes – selecting the best method accuracy as the value to be maximized in the Parameter Optimization Loop End node.
Full Process and Final Remarks
The full process of eliminating data columns with too many missing values is shown in the figure below. The procedure to actually count the number of missing values and filter out the columns can be found in the “Column Selection by Missing Values” metanode.
Figure 1. Content of "Best Threshold on Missing Values" metanode: optimization loop to find best threshold on ratio of missing values with maximum accuracy
Using this approach and using the small KDD data set from the KDD 2009 challenge, the best threshold value was found at 0.4, leading to a dimensionality reduction of 71% and a corresponding accuracy of 76% by the MLP. However, by checking the accuracy results associated with each loop threshold value, we noticed that already by cutting out the data columns with more than 90% missing values, i.e. reaching a 67% reduction, we kept a 76% accuracy on the evaluation set. Which is already quite impressive, considering that we were just counting!