Let’s imagine now that we want to cluster some data by country; that is, that we want to build a cluster set for each country represented in the data.
First of all, we need the list of unique values in the “country” column.
Then, we need to filter out only the rows referring to the country of interest and run, for example, a k-means clustering on the filtered data.
We also want to save the clusters into a PMML file. The clustering should be repeated for all countries present in the data.
If we have a few countries we might build a few meta-nodes containing the same procedure but processing the data sets coming from different countries. However, if the data contains too many countries it might be cumbersome to build too many identical meta-nodes. We need to build a loop that isolates different data rows and forces them through the same procedure.
KNIME has a few nodes in the “Flow Control” -> “Loop Support” category, dedicated to loop management. There are a number of different loops available in KNIME: a loop on a list of unique values, a loop on chunks of data rows of predefined size, a generic loop, and a counting loop. All loops start with a "… Loop Start” node and end with “Loop End” node: the “… Loop Start” nodes define the kind of loop, while the “Loop End” nodes collect the results.
For our country-based k-means analysis, we need to loop across values in a list and to feed those values, one at a time, to a “Row Filter” node to perform the required country filtering. The “Row Filter” node can use a flow variable as the pattern to match in the “country” column. The loop then has to:
- Define a new flow variable,
- Iterate on the list of unique country values
- Assign the current country value to the flow variable in each iteration
The “TableRow To Variable Loop Start” node does all that.
We now need to build the body of the loop with the "Row Filter" node and the "k-means" node and close the loop.
To close the loop and to collect the final results, the nodes “Loop End” are available. Everything in between the “… Loop Start” and the “Loop End” node is iterated through all values of the “… Loop Start” node.
A “Loop End” node also acts as a results collector: all data produced in the loop body through all required iterations are collected by the “Loop End” node.
In our workflow (see figure below), after reading and preparing the data, we used a “GroupBy” node to extract the list of unique values in the “country” column.
We reduced the original data to only one column (“country”), whose values were then grouped together by the “GroupBy” node. The output data was then a data table with only one column containing the list of “country” unique values.
At this point we introduced a “TableRow To Variable Loop Start” node, to create a new workflow variable named “country”, iterate through the list of “country” unique values, and assign the current “country” value to the “country” workflow variable at each loop iteration.
In addition, we wanted each loop iteration to train the k-Means algorithm only on the data for the given “country” and to write the final clusters into a PMML file. This means thatafter the “TableRow To Variable Loop Start” node and before the “Loop End” node we introduced :
- A “Row Filter” node to select only those data rows referring to the current value of “country”
- A “k-Means” node to cluster the filtered data
- A “Java Edit Variable” node to create the output path for the PMML file to contain the clusters built for that “country” value
- A “PMML Writer” node to write the clusters into a PMML file
Finally we placed a “Loop End” node to close the loop and collect the results. Indeed, the last option of the context menu of the “Loop End “ node is named “Collected Results” and visualize all the original data from all countries, with iteration number and cluster number.