Big Data and KNIME
There is a lot of talking these days about big data.
I have always been kind of skeptikal about it. All of my projects so far have been working very well using KNIME alone, even with very large amounts of data (up to 50 millions rows).
176 Mio Rows processed with KNIME
However, recently, I have worked on a very particular project on electricity usage data, where I had to aggregate and transform 175 millions of data rows (take or give a million of data rows).
The project included reading the time series of electricity usage sampled every half an hour between July 15th 2009 and January 1st 2011 for around 6000 monitoring meter ID (household or business) placed around Ireland. The data, indeed, came from a pilot that the electricity and gas Irish company ran to evaluate how much information such meter ID could bring.
The final goal of the project was to predict the electricity usage for clusters of meter IDs.
The half-an-hour electricity usage data was aggregated into an hourly, daily, and monthly time series for each meter ID. Aggregated information (or KPIs) was extracted from each time series to describe the electricity consumption behavior of each meter ID. On the basis of these KPIs, the 6000 meter IDs had been clustered into 28 clusters only. I then worked on the prediction of the hourly, daily, and monthly time series of the 28 cluster prototypes.
The project is distributed over three KNIME workflows. The first workflow imports, aggregates, and transforms the raw data, as it is shown in the picture below.
The workflow ran on a 4-core laptop, with 8 GB RAM, 64-bit, 2.20 GHz, and Windows 7.
The first meta-node, named "Read all Data", loops over all files to read and concatenate their content. The output of that meta-node is a massive data table of 176 millions rows containing all available raw data. In order to read all those rows from 6 files, the loop takes only half an hour. Which is really astonishing considering the amount of data!
The second meta-node, named "String to Datetime", converts all date/ time values (days, months, year, hours, and minutes) from the original proprietary format into a KNIME DateTime object. The last node of the sub-flow contained in this meta-node is a "Sorter" node to sort all rows in ascending order by time. All date/time transformation nodes were executed relatively quickly, using all together up to 2 hours. The "Sorter" node on the opposite represented the bottle neck of the whole sub-workflow, using almost 5 hours.
Finally, the aggregations. Two parallel meta-nodes aggregate the data rows into a daily, monthly, weekly, yearly time series and into an hourly time series respectively and calculate the amount of energy used during week-end vs. businees days, during each day of the week, and during morning, afternoon, evening, night, early morning, and late afternoon in average. These two meta-nodes also calculate new time series for each meter ID on a daily, monthly, weekly, yearly, and hourly scale. Both nodes are massively time consuming, due to all aggregations performed in terms of "GroupBy" and "Pivoting" nodes.
The "GroupBy", "Sorter", and "Pivoting" all share the same sorting algorithm and might subtract resources from each other, if running at the same time. This effect is not noticeable for smaller data, even for 50 millions rows. But with 176 millions rows it
quickly becomes a problem. In this case, you need to execute one node after the other, either manually or setting artificial dependencies by means of flow variable connections
across nodes. Still, even after inserting dependencies, these two meta-nodes took almost 3 days in total to run.
All calculated values are then joined together and a few additional percentages are also inserted. The execution of this meta-node, named "% values", adds just a few hours to the total execution of the workflow.
The good news is that KNIME does not break. If you have enough time and patience you can still run your workflow on 176 millions rows, slowly but without major issues. In other projects using very large amounts of data, for example 50 millions rows, KNIME has never crashed and the execution times on the laptop above have always been acceptable.The same Workflow using "RushAccelerator for KNIME"
However, for this project I had only one week available to present some kind of results. Three days of execution time were not really possible. For the first time, then I decided to approach the KNIME solution for Big Data, named RushAccelerator for KNIME
and produced by Actian (http://bigdata.pervasive.com/Products/RushAccelerator-for-KNIME.aspx
You can download a free trial or buy a license from their web site. Actually I downloaded what is now an old version of the "RushAccelerator for KNIME" software, when the company was still called Pervasive. Nowadays there is a new version with many more nodes available and a brand new company name "Actian".
"RushAccelerator for KNIME" installs a new category in your KNIME "Node Repository", named "Pervasive DataRush" that contains a few specialized nodes to run on a big data platform (see picture on the left).
With those, I rebuilt my workflow using the RushAccelerator for KNIME nodes for datetime manipulation, values calculation/extraction, and most of all aggregations.
On the plus side, you can mix most of KNIME nodes and RushAccelerator nodes without problems and some nodes have a similar GUI as their corresponding KNIME nodes.
On the minus side, the configuration window of some RushAccelerator nodes is quite different with respect to the configuration window of the corresponding KNIME nodes, like for example the "Rows to Columns" node which is supposed to perform the same task as the "Pivoting" node. However, the highly time consuming tasks were only a few, repeated many times. In addition, RushAccelerator is not for free, but license costs are affordable.
RushAccelerator nodes also have a new tab in the configuration window, named "Job Manager Selection". Here you can set the engine you intend to run the workflow on: either the default engine or the DataRush executor.
The new workflow built with RushAcceletor nodes is shown in the figure below. The structure remained the same as the original workflow and most of the nodes contained in the meta-nodes actually mirror the original KNIME nodes.
The new Execution Time
And now about the execution time!
The result was quite impressive. The execution time of the whole workflow went down from circa 3 days to one hour and 16 minutes!
That was amazing and moved my attitude towards Big Data from full skepticism into cautiously positive.
I did find an example here where Big Data really made the difference between the success of the project and a total failure!
This does not mean that all workflows need to run on some Big Data platform. However, depending on the amount of data, available time and machine, sometimes you might wish to speed up execution times enough to get results in a reasonable time!
To continue on the line of what is new in KNIME 2.6, this post is dedicated to a special new Quick Form node: The "Column Filter Quickform".
The "Column Filter Quickform" node presents a UI mask to select one or more data columns during a workflow execution. Practically, it is like a dynamic "Column Filter" node, where data columns can be selected on the fly at each workflow run.
In a workflow, for example, I was modelling fund returns based on 3 proxies returns, one fund of choice at a time. In the following data table I have the choice of modelling Fund 1, Fund 2, Fund 3, Fund 4, or Fund 5 based on Proxy 1, Proxy 2, and Proxy 3. In order to select a different fund each time, I need to use a "Column Filter" Quickform" node.
So, after the "File Reader" node I placed a "Column Filter Quickform" node. Its configuration window allows for the selection of one or more data columns via the usual Include/Exclude frame with the Add/Remove buttons. And so far it looks like a "Column Filter" node.
In addition, it also asks for an explanatory label about what we have to select here, some description of what the selection is for, a weight, and a variable name. The node in fact, when executed, creates a variable with the name given in the box "variable name" and the name(s) of the selected data column(s) as content. The "weight" setting ... well, let's keep the "weight" setting in mind for later.
After execution then this node creates:
1. a flow variable named "kept_columns" containing value "Fund 4"
2. an empty data table with the selected column
Since the "Column Filter Quickform" produces an empty data table but with right structure, I still need to fill it with the original values of column "Fund 4". To do that I can use a "Reference Column Filter" node using the output of the Quick Form as a reference for the original data table. At this point I have an output table containing only the selected data column with the original values.
If I pack this sequence of nodes ("Column Filter Quickform" + "Reference Column Filter" + and maybe some additional nodes) into a meta-node, the meta-node acquires a "Configure" option in the context menu. The configuration window of the meta-node now shows the include/Exclude frame of the Quick Form to select the data column(s) plus its label and description.
If I want to make the UI of the meta-node for the column selection cleaner, I can place a "Column Filter" node (a static one) before the "Column Filter Quickform" to hide the non selectable columns from the Quick Form UI. In this case, I would only leave the data columns "Fund *" available for the Quickform node.
Similarly to the meta-node, the KNIME server shows the "Column Filter Quickform" UI when executing the workflow. Indeed, the KNIME Server step-wise execution stops at each Quick Form node, showing the corresponding UI mask.
Remember the "Weight" setting? If you pack more than one Quick Form node into a meta-node, the meta-node configuration window as well as the UI mask in the Server step-wise execution will show more than one selecting frames. The positioning of those frames inside the window/mask, depends on the "weight" setting. The bigger the "weight", the heavier, and therefore the lower positioned, the corresponding frame in the window/mask.
And this is the final content of the meta-node.
I just released an update of the book "KNIME Beginner's Luck" that includes the latest changes in KNIME 2.6.
Here are the most important changes:
- New options to customize the GUI
in the KNIME workbench (section 1.8)
- the new Skip Lines
advanced option in the File Reader (section 2.3: Advanced
- the JFreeChart nodes
- new nodes for statistical hypothesis testing
(section 4.4: Hypothesis Testing)
- the new Java Snippet strategy
including two Java Snippet nodes: one for more
advanced Java programmers and one, easier, for non Java experts (section 5.4)
- the new feature to reconfigure existing meta-nodes
(section 5.6: Collpase
pre-existing nodes into a meta-node)If you have a two-year subscriptions and you will not receive the updated book in the next few days, please let me know.Enjoy the new book!
I work a lot with meta-nodes. When a workflow becomes too crowded, I semplify it by grouping together nodes that work for the same task into a meta-node.
In general, on my KNIME Desktop version, I reuse my meta-nodes, in the sense that I copy and paste the same meta-node in other workflows. There I also reconfigure the meta-node, if necessary.
This meta-node copy and paste procedure can be avoided if you are connected to a KNIME Server.
I usually develop my workflows on my local KNIME workspace. As I said, from time to time I clean up the workflow by inserting meta-nodes to collect specific nodes in the workflow. Some of these meta-nodes turn out to be useful in other contexts as well. Thus, I place these meta-nodes on the KNIME Server as meta-node templates to become available for other workflows as well.
In order to place a meta-node template on the KNIME Server, I right-click my local meta-node, select the option "Save as Meta-Node Template", and select a suitable location on the KNIME Server for my meta-node template (of course I must have writing permissions on this KNIME Server workflow group).
At this point:
- the meta-node template appears in the defined location on the KNIME Server
- my local meta-node has become just a link to the meta-node template on the KNIME Server
- to create a new instance of the meta-node, I just drag and drop the meta-node template from the list of workflows on the KNIME Server to the workflow editor
- the local option "Disconnect Meta-Node Link" copies the meta-node instance into a local meta-node.
In addition, if I want to make my meta-node template parametric, so that it can be configured differently in different contexts, I just insert a Quickform node in the meta-node template.
In KNIME 2.4 you can now:
- select a number of nodes (in Windows with the usual Shift-Click),
- right-click any of them and in the context menu
- select "Collapse into Meta Node".
A new metanode containing the selected nodes will automatically be created and added to the workflow. Very practical!