More and more often I am having a curious problem when implementing my workflows.
The amount of data is getting bigger at every project, so much that sometimes I have to wait for the workflow to fully execute. In my last project, for example, I had to deal with 176 millions rows!!!!
Now it is understandable that I want to use a smaller data set while building my workflow and the real data set just in the last full workflow execution. I need then to implement a switch, controlled by a flow variable, that allows the workflow to run on all data or alternatively only on a subset of data depending on the flow variable value.
The "String Radio Buttons" Quickform
To create a two-value flow variable is easy by using the "String Radio Buttons" Quickform node. In this case the "String Radio Buttons" Quickform creates a flow variable named "port" with two possible values: "partial" or "full". If "partial" is selected, the workflow runs on a subset of data, if "full" is selected the workflow runs on the full set of data. Here below is the configuration window of the "String Radio Buttons" Quickform node.
Reading the Data
The data consists of 6 files. Reading the full data set means to loop across all 6 files and collect the read data. Reading a subset of the data means reading only one of the files.
The input then comes from a "List Files" node. The "List Files" node produces the list of files available in a selected location on your machine (see picture below).
The selected location is inserted in the configuration window of the "List Files" node and the node produces a data table where each row contains the file path to one of the files in the selected location.
At this point, a "TableRow To Variable Loop Start" node translates each file path into a variable and starts a loop across them, where a "File Reader" node reads the file, and the "Loop End" node collects the results. This would be one branch of the switch.
How to connect a File Reader to a switch node?
The other branch just reads one of the possible file. The problem here is: how do I force a "File Reader" node to follow the enabling/disabling of the switch node?
The switch node outputs a data table and all connected nodes will be disabled/enabled depending on the switch node status. However, the "File Reader" node does not take any input, so it cannot be connected to the switch node.
The solution to this problem is to make the connection via a Flow Variable. The data table with all paths, output by the switch node, is then converted into a flow variable with a "TableRow To Variable" node. The output flow variable, containing only the first path of the switch output data table, is passed to the "File Reader" node and used to set the file path. In this way, if the switch output port is disabled, the whole branch, including the "File Reader" node, is disabled.
Finishing the workflow
The switch block is then closed by and "END IF" node to collect the resulting data.
A "Java Edit variable" node is used to transform the initial selection between "full" and "partial" and to control the switch node output ports.
The final workflow is shown in the figure below.
This little workflow implements a selective reading.
If we choose "full", all data are read by looping on all available files.
If we choose "partial", only the content of the first file is read and imported into KNIME.
Let’s imagine now that we want to cluster some data by country; that is, that we want to build a cluster set for each country represented in the data.
First of all, we need the list of unique values in the “country” column.
Then, we need to filter out only the rows referring to the country of interest and run, for example, a k-means clustering on the filtered data.
We also want to save the clusters into a PMML file. The clustering should be repeated for all countries present in the data.
If we have a few countries we might build a few meta-nodes containing the same procedure but processing the data sets coming from different countries. However, if the data contains too many countries it might be cumbersome to build too many identical meta-nodes. We need to build a loop that isolates different data rows and forces them through the same procedure.
KNIME has a few nodes in the “Flow Control” -> “Loop Support” category, dedicated to loop management. There are a number of different loops available in KNIME: a loop on a list of unique values, a loop on chunks of data rows of predefined size, a generic loop, and a counting loop. All loops start with a "… Loop Start” node and end with “Loop End” node: the “… Loop Start” nodes define the kind of loop, while the “Loop End” nodes collect the results.
For our country-based k-means analysis, we need to loop across values in a list and to feed those values, one at a time, to a “Row Filter” node to perform the required country filtering. The “Row Filter” node can use a flow variable as the pattern to match in the “country” column. The loop then has to:
- Define a new flow variable,
- Iterate on the list of unique country values
- Assign the current country value to the flow variable in each iteration
The “TableRow To Variable Loop Start” node does all that.
We now need to build the body of the loop with the "Row Filter" node and the "k-means" node and close the loop.
To close the loop and to collect the final results, the nodes “Loop End” are available. Everything in between the “… Loop Start” and the “Loop End” node is iterated through all values of the “… Loop Start” node.
A “Loop End” node also acts as a results collector: all data produced in the loop body through all required iterations are collected by the “Loop End” node.
In our workflow (see figure below), after reading and preparing the data, we used a “GroupBy” node to extract the list of unique values in the “country” column.
We reduced the original data to only one column (“country”), whose values were then grouped together by the “GroupBy” node. The output data was then a data table with only one column containing the list of “country” unique values.
At this point we introduced a “TableRow To Variable Loop Start” node, to create a new workflow variable named “country”, iterate through the list of “country” unique values, and assign the current “country” value to the “country” workflow variable at each loop iteration.
In addition, we wanted each loop iteration to train the k-Means algorithm only on the data for the given “country” and to write the final clusters into a PMML file. This means thatafter the “TableRow To Variable Loop Start” node and before the “Loop End” node we introduced :
- A “Row Filter” node to select only those data rows referring to the current value of “country”
- A “k-Means” node to cluster the filtered data
- A “Java Edit Variable” node to create the output path for the PMML file to contain the clusters built for that “country” value
- A “PMML Writer” node to write the clusters into a PMML file
Finally we placed a “Loop End” node to close the loop and collect the results. Indeed, the last option of the context menu of the “Loop End “ node is named “Collected Results” and visualize all the original data from all countries, with iteration number and cluster number.