This post terminates my series of posts on the "Crosstab" node. If we could measure how powerful a node is by the number of posts that are necessary to describe it, we would definitely conclude that the newborn "Crosstab"  node is one of the most powerful nodes in the KNIME data analysis platform.

In one of the previous posts we have shown how to perform a test of independence among two variables by using the chi-square test. However, we have also said that the chi-square statistic only approximatively follows a chi-square distribution, which means that the chi-square test is only an approximation fo the independence test. While this approximation is acceptable for large amount of data, is starts falling short for small data samples. In this case, it is preferrable to use the Fisher exact test.

The Fisher exact test is, as the name says, exact and not approximate as long as the row and column totals are fixed. The mathematics behind, though, becomes quickly computationally prohibitive for large amount of data. The Fisher exact test fits well small data samples.

The null hypothesis in the Fisher test is again that two variable are independent.
Fisher observations were based on the contingency table. Fisher showed that the probability p of obtaining the observed set of values in the contingency table is given by a hypergeometric distribution and could be calculated accordingly (http://en.wikipedia.org/wiki/Fisher's_exact_test). This formula gives the exact probability of observing this particular arrangement of data, assuming the given marginal totals and under the null hypothesis that the two variables are independent.

Fisher then showed that to generate a significance level for the test, we need to consider only those contingency table cases where the marginal totals are the same as in the observed contingency table and, among those  cases, we need to consider only those tables like the current one or more extreme. Let's suppose that there are n such cases, the probability of each contingency table can be calculated by means of the hypergeometrical distribution, as p1, ... pn. The sum of such probabilities, sum(p), is taken as the statistic variable.

A low value of the sum(p) indicates that only a few more extreme configurations of the contingency table are possible and/or with very low probability and therefore the two variables can not be considered independent. Setting the threshold to 0.5, for example, allows us to reject the null hypothesis of statistical independence if sum(p) < 0.5.

If, to calculate sum(p), we consider the extreme configurations of the contingency table only in one direction, then we have a one-tail Fisher exact test. If we also consider the equally extreme configurations but in the opposite direction, then we have a two-tails Fisher exact test.

Let's now check the results of the "crosstab" node when applied to the adult data set of the UCI Machine Learning Repository, in particular to test the independence between "sex" and "income".

The "crosstab" node produces a "statistics table" containg mainly the chi-square values. The last cell of the statistics table contains the sum(p) value for the Fisher exact test (2 tails). In our case, sum(p) is 0 which allows us to refuse the null hypothesis of statstical independence between "sex" and "income". This is the same conclusion that we reached with chi-square test.
Picture
Just a small note to conclude. The adult data set is a very large one. In this case, the chi-square test could be more appropriate than the Fisher exact test. However, we used the Fisher exact test in this post just to show the information made available by the "crosstab" node in its output "statstics table".