Big Data and KNIME
There is a lot of talking these days about big data.
I have always been kind of skeptikal about it. All of my projects so far have been working very well using KNIME alone, even with very large amounts of data (up to 50 millions rows).

176 Mio Rows processed with KNIME
However, recently, I have worked on a very particular project on electricity usage data, where I had to aggregate and transform 175 millions of data rows (take or give a million of data rows).
The project included reading the time series of electricity usage sampled every half an hour between July 15th 2009 and January 1st 2011 for around 6000 monitoring meter ID (household or business) placed around Ireland. The data, indeed, came from a pilot that the electricity and gas Irish company ran to evaluate how much information such meter ID could bring.

The final goal of the project was to predict the electricity usage for clusters of meter IDs.
The half-an-hour electricity usage data was aggregated into an hourly, daily, and monthly time series for each meter ID. Aggregated information (or KPIs) was extracted from each time series to describe the electricity consumption behavior of each meter ID. On the basis of these KPIs, the 6000 meter IDs had been clustered into 28 clusters only. I then worked on the prediction of the hourly, daily, and monthly time series of the 28 cluster prototypes.

The project is distributed over three KNIME workflows. The first workflow imports, aggregates, and transforms the raw data, as it is shown in the picture below.
The workflow ran on a 4-core laptop, with 8 GB RAM, 64-bit, 2.20 GHz, and Windows 7.
The first meta-node, named "Read all Data", loops over all files to read and concatenate their content. The output of that meta-node is a massive data table of 176 millions rows containing all available raw data. In order to read all those rows from 6 files, the loop takes only half an hour. Which is really astonishing considering the amount of data!

The second meta-node, named "String to Datetime", converts all date/ time values (days, months, year, hours, and minutes) from the original proprietary format into a KNIME DateTime object. The last node of the sub-flow contained in this meta-node is a "Sorter" node to sort all rows in ascending order by time. All date/time transformation nodes were executed relatively quickly, using all together up to 2 hours. The "Sorter" node on the opposite represented the bottle neck of the whole sub-workflow, using almost 5 hours.

Finally, the aggregations. Two parallel meta-nodes aggregate the data rows into a daily, monthly, weekly, yearly time series and into an hourly time series respectively and calculate the amount of energy used during week-end vs. businees days, during each day of the week, and during morning, afternoon, evening, night, early morning, and late afternoon in average. These two meta-nodes also calculate new time series for each meter ID on a daily, monthly, weekly, yearly, and hourly scale. Both nodes are massively time consuming, due to all aggregations performed in terms of "GroupBy" and "Pivoting" nodes.

The "GroupBy", "Sorter", and "Pivoting" all share the same sorting algorithm and might subtract resources from each other, if running at the same time. This effect is not noticeable for smaller data, even for 50 millions rows. But with 176 millions rows it
quickly becomes a problem. In this case, you need to execute one node after the other, either manually or setting artificial dependencies by means of flow variable connections
across nodes. Still, even after inserting dependencies, these two meta-nodes took almost 3 days in total  to run.

All calculated values are then joined together and a few additional percentages are also inserted. The execution of this meta-node, named "% values", adds just a few hours to the total execution of the workflow.

The good news is that KNIME does not break. If you have enough time and patience you can still run your workflow on 176 millions rows, slowly but without major issues. In other projects using very large amounts of data, for example 50 millions rows, KNIME has never crashed and the execution times on the laptop above have always been acceptable.

The same Workflow using "RushAccelerator for KNIME"
However, for this project I had only one week available to present some kind of results. Three days of execution time were not really possible. For the first time, then I decided to approach the KNIME solution for Big Data, named RushAccelerator for KNIME and produced by Actian (http://bigdata.pervasive.com/Products/RushAccelerator-for-KNIME.aspx).
You can download a free trial or buy a license from their web site. Actually I downloaded what is now an old version of the "RushAccelerator for KNIME" software, when the company was still called Pervasive. Nowadays there is a new version with many more nodes available and a brand new company name "Actian".
Picture
"RushAccelerator for KNIME" installs a new category in your KNIME "Node Repository", named "Pervasive DataRush" that contains a few specialized nodes to run on a big data platform (see picture on the left).
With those, I rebuilt my workflow using the RushAccelerator for KNIME nodes for datetime manipulation, values calculation/extraction, and most of all aggregations.

On the plus side, you can mix most of KNIME nodes and RushAccelerator nodes without problems and some nodes have a similar GUI as their corresponding KNIME nodes.

On the minus side, the configuration window of some RushAccelerator nodes is quite different with respect to the configuration window of the corresponding KNIME nodes, like for example the "Rows to Columns" node which is supposed to perform the same task as the "Pivoting" node. However, the highly time consuming tasks were only a few, repeated many times. In addition, RushAccelerator is not for free, but license costs are affordable.

RushAccelerator nodes also have a new tab in the configuration window, named "Job Manager Selection". Here you can set the engine you intend to run the workflow on: either the default engine or the DataRush executor.

The new workflow built with RushAcceletor nodes is shown in the figure below. The structure remained the same as the original workflow and most of the nodes contained in the meta-nodes actually mirror the original KNIME nodes.

The new Execution Time
And now about the execution time!
The result was quite impressive. The execution time of the whole workflow went down from circa 3 days to one hour and 16 minutes!
That was amazing and moved my attitude towards Big Data from full skepticism into cautiously positive.
I did find an example here where Big Data really made the difference between the success of the project and a total failure!

This does not mean that all workflows need to run on some Big Data platform. However, depending on the amount of data, available time and machine, sometimes you might wish to speed up execution times enough to get results in a reasonable time!
 
 
A few days ago I was in the middle of a pre-sales meeting, when a young sales person turned to me for explanation about all the acronyms that had been flying around for the last 15 minutes.
I myself am not very good at keeping track what exactly each acronym means. So I thought it would be interesting to report and fix on virtual paper the exact wording and a rough explanation of the most frequent acronyms in sales, marketing and other very talkative meetings.

Here are three acronyms that kept popping up at the last meeting:

ETL means Extract, Transform, Load. Itrefers to the data processing in datawarehousing architectures:
     1. Extract data from source files/Databases
     2. Transform it to fit your goal (data cleaning, aggregation,etc ...)
     3. Load it on your final target structure usually a database

KPI means Key Performance Indicator. It represents a measure of performance for your problem. KPIs can be plain numbers to measure the financial status of a company, dates to measure the advancement status of a project, percentages to monitor time evolution, till to more complex ad hoc developed measures. KPIs are the most common contents for company internal and external reports.

MIS stays for Management Information System. Itrepresents the process to produce the KPIs or other important company's information on a report.

DWH means Data WareHouse. This indicates the process to update the repository where the necessary information, including KPIs, is stored for the MIS to run and produce the MIS reports.