In the output cross-table, there is also the following columns: "Expected", "Deviation" and "Chi-square". These columns refer to the Chi-square test of statistical independence.

**The Chi-square test of independence**

The null Hypothesis: two variables are independent

The Chi-square test of independence assumes as null hypothesis that two categorical variables are independent.

The null Hypothesis: two variables are independent

As in the previous post, let's use the adult data set from the UCI Machine Learning Repository and let's test the independence of variables "sex" and "income". The null hypothesis is that "sex" and "income" are independent; that is "income" has nothing to do with the "sex" distribution in the data set.

**The Expected frequency values**

Such a hypothesis of total independence expects random equally distributed values; that is the frequency of each cell (i,j) in the contingency table to be:

*E(i,j) = total row(i) * total column (j) / N*

where N is the total number of observations.

E(i,j) are then the expected frequency values reported in column "Expected" in the cross-table at the output port of the "crosstab" node.

Column "Deviation" reports the deviations between the expected frequency and the observed frequency:

*O(i,j) - E(i,j)*

For each cell a variable can be calculated as: (O(i,j) -E(i,j)) * (O(i,j) -E(i,j)) /E(i,j)

The values of this variable are reported in the "Cell Chi-Square" column of the output cross-table.

We report here the contingency table calculated in the last post for "sex" and "income".

**The Chi-square statistic variable**

If we sum up all together the "Cell Chi Square" values, we obtain the following variable:

*chi-square = sum(i,j) (O(i,j) -E(i,j)) *(O(i,j)-E(i,j)) / E(i,j)*

which has an approximate chi-square distribution with

*(r-1) x (c-1)*degrees of freedom.

The "Crosstab" node offers a second data table at the bottom output port. This data table, named "Statistics Table", contains a few summary statistical variables useful for our test of independence.

First of all, the "Chi Square" cell contains the chi-square value as calculated by the formula above and approximately following the chi-square distribution with DF degrees of freedom. The DF degrees of freedom are reported in cell "Chi Square (DF)" of the same "Statistics Table". In our case, we have a chi-square value as 1518 and a distribution with 1 degrees of freedom.

**The p-value**

Notice that if the two variables are really totally independent, then the observed frequencies should be very close to the expected frequencies and the total chi-square variable should take a value very close to zero. The farther is chi-square from zero, the less the two variables seem to be independent. Cell "Chi Square (prop)" gives the probability P(x >= X) where X = "Chi Square" in the "Statistics Table". In our case

N

*P(x>=1518) = 0*.

If we fix an acceptance threshold at 5% (0.05), if P(x>=X) lies below this threshold the null hypothesis of statistical independence between the two variables can be rejected.

In our case, "income" is unfortunately not independent from "sex".

In the figure below the "Statistics Table" is shown, as produced by the "Crosstab" node.