The "crosstab" node has been introduced with KNIME 2.4 under the "Statistics" category. It performs a number of different tasks and calculates a number of statistically interesting variables opening the door to the usage of KNIME for statistical tests.
First of all, the "crosstab" node selects two columns, let's say col1 and col2, from the input data table and builds a r x c matrix of observations. That is a matrix with r rows and c columns where r is the number of distinct values in col1 and c is the number of distinct values in col2. Each cell then reports the number of observation for each pair of values from col1 and col2. This kind of matrix is called a "contingency table".
The 2x2 contingency table
The simplest and best known example of contingency table is a 2x2 contingency table.
To check how the new "crosstab" node works, I built a 2x2 contingency table on the "sex" and "income" columns of the adult data set (UCI Machine Learning Repository).
The resulting contingency table should look something like that:
The data here is not organized as a matrix, but all necessary information required for a 2x2 contingency table is present:
- "sex" contains the list of distinct values for original column "sex" (Female, Male)
- "income" contains the list of distinct values for original column "income" (<=50K, >50K)
- "Frequency" contains the number of observations for each pair of values ("sex", "income"); that is the count of rows in the original data table for each pair of values
- "Percent" is like "Frequency" but expressed as percent
- "Column Percent" and "Row Percent" express the ratio i of Frequency and the column/row total respectively.
- "Total Row Count" contains the totals by row
- "Total Column Count" contains the totals by column
- "Total Count" contains the total number of rows/observations
Notice that a contingency table is a normal crosstab table built on the count of rows. A similar table, with the right matrix-look, could have also been built with a "Pivoting" node, using "sex" as the group column, "income" as the pivot column, and "count" or "percent" as the aggregation method.
The cross table at the output port of the "Crosstab" node contains more than just the number of observations and their totals to build a contingency table. But we'll talk about that in the next post.