<![CDATA[DMR - Data Mining and Reporting - Blog]]>Fri, 24 May 2013 07:23:41 -0800Weebly<![CDATA[KNIME on the Road]]>Wed, 22 May 2013 15:58:41 GMThttp://www.dataminingreporting.com/2/post/2013/05/knime-on-the-road.htmlKNIME is preparing a number of events around the world this year!
I will report the list here or  you can check directly the KNIME web site.

Training Courses
KNIME Beginners and Advanced Course
In Zurich on June 10-12 there will be the next Beginners and Advanced Course.
This course is an old acquaintance. It has been running regularly circa every two months for a couple of years now. It covers all KNIME aspects from introductory to more and more advanced to enable people to become knowledgeable KNIME users.

KNIME server Course
Still in Zurich on June 14 there will be the KNIME Server Course.
This is a new entry in the KNIME course list!
This course is thought for the KNIME users who are actually using the server, to empower them with the best practices in the transfer from predictive analytics model prototyping into predictive analytics model usage in a productive environment.

Chemistry in KNIME Webinar

Via Internet on July 3rd. This is also a new entry!
This 2-hour webinar is thought especially for the chemists. The traditional user training, in fact, being for everybody, does not access specific chemistry topics. This webinar will not cover all chemistry tools available in KNIME, but will give a quick introduction to the most basic and important ones!

KNIME Beginners and Advanced Course in San Francisco
Finally, there will be the first of hopefully a long series of User Trainings in USA.
In San Francisco on January 13-15 2014 there will be a Beginners and Advanced Course. The course will follow the same structure as the same course based in Zurich. If you live in California (or closeby) and you always wanted to learn more about KNIME, you should attend this course!

KNIME Meetups and KNIME User Day
KNIME Meetup Germany
KNIME Meetup Germany on June 5th in Stuttgart. This is the first public event in German language. If you live around Stuttgart and speak German, do not miss it!
Presentations from KNIME and Dymatrix will cover mainly new features in KNIME, text mining, and KNIME enterprise applications.

KNIME User Day UK
In London on June 25th there will be the first KNIME User Day UK. The workshop is free, lasts a full day, and will be packed with interesting presentations.
Hope to see you there!

KNIME User Day Boston

And finally, the KNIME User Day in Boston! After the course on the west coast, KNIME could not miss an event on the east coast! This KNIME User Day is scheduled for October 22nd in Boston. The program is not yet defined but we are trying to put together the most interesting presentations ever about KNIME!

So, KNIME is growing up and moving outside of Europe boundaries!
]]>
<![CDATA[KNIME and Big Data]]>Thu, 09 May 2013 19:16:39 GMThttp://www.dataminingreporting.com/2/post/2013/05/knime-and-big-data.htmlBig Data and KNIME
There is a lot of talking these days about big data.
I have always been kind of skeptikal about it. All of my projects so far have been working very well using KNIME alone, even with very large amounts of data (up to 50 millions rows).

176 Mio Rows processed with KNIME
However, recently, I have worked on a very particular project on electricity usage data, where I had to aggregate and transform 175 millions of data rows (take or give a million of data rows).
The project included reading the time series of electricity usage sampled every half an hour between July 15th 2009 and January 1st 2011 for around 6000 monitoring meter ID (household or business) placed around Ireland. The data, indeed, came from a pilot that the electricity and gas Irish company ran to evaluate how much information such meter ID could bring.

The final goal of the project was to predict the electricity usage for clusters of meter IDs.
The half-an-hour electricity usage data was aggregated into an hourly, daily, and monthly time series for each meter ID. Aggregated information (or KPIs) was extracted from each time series to describe the electricity consumption behavior of each meter ID. On the basis of these KPIs, the 6000 meter IDs had been clustered into 28 clusters only. I then worked on the prediction of the hourly, daily, and monthly time series of the 28 cluster prototypes.

The project is distributed over three KNIME workflows. The first workflow imports, aggregates, and transforms the raw data, as it is shown in the picture below.
The workflow ran on a 4-core laptop, with 8 GB RAM, 64-bit, 2.20 GHz, and Windows 7.
The first meta-node, named "Read all Data", loops over all files to read and concatenate their content. The output of that meta-node is a massive data table of 176 millions rows containing all available raw data. In order to read all those rows from 6 files, the loop takes only half an hour. Which is really astonishing considering the amount of data!

The second meta-node, named "String to Datetime", converts all date/ time values (days, months, year, hours, and minutes) from the original proprietary format into a KNIME DateTime object. The last node of the sub-flow contained in this meta-node is a "Sorter" node to sort all rows in ascending order by time. All date/time transformation nodes were executed relatively quickly, using all together up to 2 hours. The "Sorter" node on the opposite represented the bottle neck of the whole sub-workflow, using almost 5 hours.

Finally, the aggregations. Two parallel meta-nodes aggregate the data rows into a daily, monthly, weekly, yearly time series and into an hourly time series respectively and calculate the amount of energy used during week-end vs. businees days, during each day of the week, and during morning, afternoon, evening, night, early morning, and late afternoon in average. These two meta-nodes also calculate new time series for each meter ID on a daily, monthly, weekly, yearly, and hourly scale. Both nodes are massively time consuming, due to all aggregations performed in terms of "GroupBy" and "Pivoting" nodes.

The "GroupBy", "Sorter", and "Pivoting" all share the same sorting algorithm and might subtract resources from each other, if running at the same time. This effect is not noticeable for smaller data, even for 50 millions rows. But with 176 millions rows it
quickly becomes a problem. In this case, you need to execute one node after the other, either manually or setting artificial dependencies by means of flow variable connections
across nodes. Still, even after inserting dependencies, these two meta-nodes took almost 3 days in total  to run.

All calculated values are then joined together and a few additional percentages are also inserted. The execution of this meta-node, named "% values", adds just a few hours to the total execution of the workflow.

The good news is that KNIME does not break. If you have enough time and patience you can still run your workflow on 176 millions rows, slowly but without major issues. In other projects using very large amounts of data, for example 50 millions rows, KNIME has never crashed and the execution times on the laptop above have always been acceptable.

The same Workflow using "RushAccelerator for KNIME"
However, for this project I had only one week available to present some kind of results. Three days of execution time were not really possible. For the first time, then I decided to approach the KNIME solution for Big Data, named RushAccelerator for KNIME and produced by Actian (http://bigdata.pervasive.com/Products/RushAccelerator-for-KNIME.aspx).
You can download a free trial or buy a license from their web site. Actually I downloaded what is now an old version of the "RushAccelerator for KNIME" software, when the company was still called Pervasive. Nowadays there is a new version with many more nodes available and a brand new company name "Actian".
Picture
"RushAccelerator for KNIME" installs a new category in your KNIME "Node Repository", named "Pervasive DataRush" that contains a few specialized nodes to run on a big data platform (see picture on the left).
With those, I rebuilt my workflow using the RushAccelerator for KNIME nodes for datetime manipulation, values calculation/extraction, and most of all aggregations.

On the plus side, you can mix most of KNIME nodes and RushAccelerator nodes without problems and some nodes have a similar GUI as their corresponding KNIME nodes.

On the minus side, the configuration window of some RushAccelerator nodes is quite different with respect to the configuration window of the corresponding KNIME nodes, like for example the "Rows to Columns" node which is supposed to perform the same task as the "Pivoting" node. However, the highly time consuming tasks were only a few, repeated many times. In addition, RushAccelerator is not for free, but license costs are affordable.

RushAccelerator nodes also have a new tab in the configuration window, named "Job Manager Selection". Here you can set the engine you intend to run the workflow on: either the default engine or the DataRush executor.

The new workflow built with RushAcceletor nodes is shown in the figure below. The structure remained the same as the original workflow and most of the nodes contained in the meta-nodes actually mirror the original KNIME nodes.

The new Execution Time
And now about the execution time!
The result was quite impressive. The execution time of the whole workflow went down from circa 3 days to one hour and 16 minutes!
That was amazing and moved my attitude towards Big Data from full skepticism into cautiously positive.
I did find an example here where Big Data really made the difference between the success of the project and a total failure!

This does not mean that all workflows need to run on some Big Data platform. However, depending on the amount of data, available time and machine, sometimes you might wish to speed up execution times enough to get results in a reasonable time!
]]>
<![CDATA[Cumulative Sums in KNIME]]>Fri, 19 Apr 2013 10:59:37 GMThttp://www.dataminingreporting.com/2/post/2013/04/cumulative-sums-in-knime.htmlHow to do cumulative sums in KNIME?

At the moment the "GroupBy" node and its similar do not offer a cumualtive sum aggregation method (for now). We need to recur to a "Java Snippet" node.

Even though people do not usually like to use the "Java Snippet" nodes, the code required to perform a cumulative sum is really minimal.
  1. Let's use a "Java Snippet (simple)" node which is simpler to configurate
  2. Now let's define a global java variable in the upper part of the configuration window, that is in the frame named "Global Variable Declaration", like for example:      double sum = 0.0;          This means that we create a variable named "sum" of type double and initial value 0.0.
  3. Let's add the following code in the frame named "Method Body":
                          sum = sum + <data column>;
                          return sum;

Here at each row we add the current value of the selected data column to the java variable "sum" and we return "sum"'s value for the current row.

Moving to the next rows, "sum" is not reset because it is declared in the global java variable space and keep adding data column values row by row, that is the cumulative sums.

The only thing missing in the previous piece of code is <data column>.
Double-clicking a column in the "Column List" panel in the upper left corner automatically inserts the data column value in the java code ready to be used. So, select the wording "<data-column>" in the code and double-click the column values you would actually like to se from the "Column List" panel.

Below is an example of a "Java Snippet (simple)" node implementing cumulative sums.
And here the same example but implemented with a "Java Snippet" node.

]]>
<![CDATA[The Explorer Nodes]]>Wed, 10 Apr 2013 14:27:02 GMThttp://www.dataminingreporting.com/2/post/2013/04/the-explorer-nodes.htmlIn KNIME there are a few nodes dedicated to work with and on the KNIME server.
We have already seen the Quickform nodes to design a GUI for the Web Portal of the KNIME Server. Today we will check the "Explorer" nodes.

If you type "Explorer" in the search box of the KNIME workbench, you will get these two nodes: "Explorer Browser" and "Explorer Writer". These two nodes respectively defines a path for a file (existing or not) on a KNIME Server and copies a file onto a location on a KNIME Server.
In today's example I used "Explorer Browser" to find and read a node on a KNIME server and "Explorer Writer" to copy a local file to a location on the KNIME server.

First of all, we need a KNIME Server, with Team Space installed.
Then you need Team Space installed on your local machine (just installed, not enabled with a license file).
Picture
f you have data files on the server (we will see in another post how to place data files on the server), you should see them in the KNIME Explorer when you log in.
Now, if you drag and drop any of the data files into your workflow editor, you will automatically get a "File Reader" node configured to read this file directly from the server. Notice that the URL used in the "File Reader" node does not use the "file://" protocol anymore, but the new "knime://" protocol which refers to a KNIME server. You should get something like:    
       knime://knime-server-demo/KNIME_Server/data/Sentiment_Rating.csv

However, if you want to read the file on the server in a more dynamic way, you can use the "Explorer Browser" to browse your server and look for the directory containing your file. Below, you can see the configuration window of the "Explorer Browser" node. The "Browse" button allows you to browse the workflows and files on the server. The "Explorer Browser" node outputs a flow variable containing the defined path. This path can then be used as setting in a "File Reader" node.

Once you have your data, at the end of your workflow execution, you might want to save them again on the server. To do that, you need:
1. to create a temporary local folder with the "Create Temp Dir" node
2. to transform the folder path into a file path by adding a filename (I used here a "Java Edit Variable" node)
3. to write the data into your just defined local filepath via a "CSV Writer" node
4. to upload the temporary file to its final destination on the server using an "Explorer Writer" node.
Below is the configuration window of the "Explorer Writer" node. It requires:
- the filepath of the file to upload in the form of a flow variable
- the target location on the server
And below you can see the full workflow that I used to read and write data from and back to the KNIME server.
]]>
<![CDATA[Switch to read one file or another]]>Thu, 04 Apr 2013 18:42:52 GMThttp://www.dataminingreporting.com/2/post/2013/04/switch-to-read-one-file-or-another.htmlMore and more often I am having a curious problem when implementing my workflows.
The amount of data is getting bigger at every project, so much that sometimes I have to wait for the workflow to fully execute. In my last project, for example,  I had to deal with 176 millions rows!!!!

Now it is understandable that I want to use a smaller data set while building my workflow and the real data set just in the last full workflow execution. I need then to implement a switch, controlled by a flow variable, that allows the workflow to run on all data or alternatively only on a subset of data depending on the flow variable value.

The "String Radio Buttons" Quickform
To create a two-value flow variable is easy by using the "String Radio Buttons" Quickform node. In this case the "String Radio Buttons" Quickform creates a flow variable named "port" with two possible values: "partial" or "full". If "partial" is selected, the workflow runs on a subset of data, if "full" is selected the workflow runs on the full set of data. Here below is the configuration window of the "String Radio Buttons" Quickform node.
Reading the Data
The data consists of 6 files. Reading the full data set means to loop across all 6 files and collect the read data. Reading a subset of the data means reading only one of the files.
The input then comes from a "List Files" node. The "List Files" node produces the list of files available in a selected location on your machine (see picture below).
The selected location is inserted in the configuration window of the "List Files" node and the node produces a data table where each row contains the file path to one of the files in the selected location.
At this point, a "TableRow To Variable Loop Start" node translates each file path into a variable and starts a loop across them, where a "File Reader" node reads the file, and the "Loop End" node collects the results. This would be one branch of the switch.

How to connect a File Reader to a switch node?
The other branch just reads one of the possible file. The problem here is: how do I force a "File Reader" node to follow the enabling/disabling of the switch node?
The switch node outputs a data table and all connected nodes will be disabled/enabled depending on the switch node status. However, the "File Reader" node does not take any input, so it cannot be connected to the switch node.

The solution to this problem is to make the connection via a Flow Variable. The data table with all paths, output by the switch node, is then converted into a flow variable with a "TableRow To Variable" node. The output flow variable, containing only the first path of the switch output data table, is passed to the "File Reader" node and used to set the file path. In this way, if the switch output port is disabled, the whole branch, including the "File Reader" node, is disabled.

Finishing the workflow
The switch block is then closed by and "END IF" node to collect the resulting data.

A "Java Edit variable" node is used to transform the initial selection between "full" and "partial" and to control the switch node output ports.
The final workflow is shown in the figure below.
This little workflow implements a selective reading.
If we choose "full", all data are read by looping on all available files.
If we choose "partial", only the content of the first file is read and imported into KNIME.
]]>
<![CDATA[KNIME UGM 2013]]>Tue, 26 Mar 2013 10:38:03 GMThttp://www.dataminingreporting.com/2/post/2013/03/knime-ugm-2013.htmlThat was a week at the last KNIME User Group Meeting!
There were 149 participants (I actually predicted 150 when talking to the organizers a few months before :-) ) from all over the world, ranging from the very expert users to the node developers, from the curious ones to the enterprise adopting the KNIME Server as their enterprise backbone.

The event was filled with constant chances to learn, network, and inspirations! Your brain could not afford remaining idle with all ideas and inspirations coming from the talks, with the deeper learning offered by the workshops, with the stimulating conversations over lunch and dinners, and with the endless networking opportunities! You can have a quick pick via the KNIME Community web site: http://www.knime.org/ugm2013_news.

Talks covered many different disciplines  - from banking to fiinance, from social media to life sciences - and many different topics - from automatic reporting to new nodes, from big data to RESTful web serices, from the KNIME Server to including R into KNIME.

I really learned something from all of the talks.
If I have to choose a few, I might say I particularly liked Man-Ling Lee's talk (Genentech). She presented a work where  KNIME was used as control console to run, import, and monitor a number of other software tools. Particularly impressive was the dynamic command line node generation!
The "Social CRM at Telekom" talk showed a practical and useful implementation of the combination of text mining and network analytics to explore customers attitude and influence.
Text Processing and Network Mining were also the main actors in Frank Dullweber's talk from Boehringer Ingelheim.
The work presented in the "Report automatization in KNIME" talk was controlling the final report layout with a few Quickform variables in the workflow. The final very different reports were very impressive!
And many more! I must say all of the talks had at least a very interesting idea to take back home!

All in all an extremely stimulating experience!
I hope to meet more of you at the next KNIME User Group Meeting in 2014!

]]>
<![CDATA[Double Numbers in KNIME Decision Trees]]>Mon, 11 Feb 2013 21:44:05 GMThttp://www.dataminingreporting.com/2/post/2013/02/double-numbers-in-knime-decision-trees.htmlHave you ever noticed that a KNIME decision tree sometimes exhibits a non integer number of patterns covered by a node? How is it possible that 24.5 patterns end up in a node of the decision tree?

This is due to missing values. If we are for example at a  two-ways split on gender and the current pattern has a missing value for  gender, then the pattern gets assigned to each node following the node's probability. If the other patterns, the ones with no missing value for gender, are equally distributed among the two nodes, the pattern is  assigned as 0.5 to one node and 0.5 to the other node. One branch then would cover 49.5 patterns and the other branch 49.5 patterns out of 99.
]]>
<![CDATA[Boosting]]>Tue, 15 Jan 2013 14:25:23 GMThttp://www.dataminingreporting.com/2/post/2013/01/boosting.html And after bagging, we do need to talk about boosting.
Boosting is another committee-based approach. It works with weights in both steps: learning and prediction.

Learning.
During the learning phase, the boosting procedure trains a learning algorithm a number of times, each time using a slightly different composition of the training set.
At each iteration, the boosting algorithm:
      -          Starts with the training set built in the previous iteration,
      -          Trains a new model,
      -          Evaluates the model error on the training patterns,
      -          Calculates the model weight based on such error,
      -          Finally, builds a new training set by over-sampling /under-sampling the incorrectly/correctly classified training patterns. The over-sampling/under-sampling factor derives from the model weight.

The training set for the first iteration is the training set provided for the whole learning procedure.
The algorithm stops when a maximum number of iterations has been reached or the model error is too big (that is the weight is too close to 0 and therefore the corresponding model is ineffective).
The output of this learning phase is a number of models, lower or equal to the selected number of maximum iterations.

Notice that boosting can be applied to any training algorithm. However, it is particularly helpful in the case of weak classifiers. In fact, boosting techniques are quite sensitive to noise and outliers, that is to overfitting.

Prediction.
The prediction phase loops on all models, available from the learning phase, and provides a prediction based on the majority vote for classifiers and on a weighted average for regression techniques.

KNIME Implementation
KNIME implements Adaboost, one of the most commonly used boosting algorithms, with two meta-nodes in the “Mining-> Ensemble Learning” category: the “Boosting Learner” and the “Boosting Predictor” meta-node.

The “Boosting Learner” meta-node (see figure below) implements the learning loop via the “Boosting Learner Loop Start” node and the “Boosting Learner Loop End” node.
The “Boosting Learner Loop End” node (see figure below) sets the maximum number of iterations, the target column, and the predicted column. The target column and the predicted column are used to:
    -          Identify the mis-classified patterns
    -          Calculate the model error
    -          Calculate the model weight
The “Boosting Learner Loop Start” node uses the model weight and the mis-classified patterns to alter the composition of the training set.
The loop body includes any supervised training algorithm node, like a “Decision Tree Learner” or a “Naïve Bayes Learner”, and its corresponding predictor node. The predictor node is necessary, even though this is a learning meta-node, because at each iteration the identification of correctly and incorrectly classified patterns and the model error calculation are needed.

For each iteration the boosting loop outputs the model, its error, and its weight.

The “Boosting Predictor” meta-node receives the model list from the learner node and the test set patterns. For each test pattern, it loops on all models and weighs their prediction result.
The “Boosting Predictor Loop Start” node starts the boosting predictor loop by identifying the weight column and the model column (see settings in its configuration window).

The “Boosting Predictor Loop End” node implements the majority vote on all model results and assigns the final value to the test pattern. Its configuration window requires the identification of the prediction column.

The loop body just includes the predictor of the mining model selected for the learning phase.

Below you can see our implementation of boosting in KNIME, with a decision tree on the cars-85.csv data set. The task here was to predict a car fuel system based on its number of doors, wheel base, and width, by using maximum 5 models. Boosting produced 5 models (the maximum number allowed) and 0.714 accuracy.
If we used a simple decision tree alone for the same task, we would have also scored 0.714 accuracy. This shows that boosting should be used only for complex problems or weak classifiers. For strong classifiers and simple problems, the additional models only model noise and outliers.
]]>
<![CDATA[Bagging]]>Thu, 27 Dec 2012 12:59:46 GMThttp://www.dataminingreporting.com/2/post/2012/12/bagging.html Bootstrap aggregation, or shortly said bagging, is a an ensemble meta-learning technique that includes training many classifiers on different partitions of the training data and using the majority vote on the results of all those classifiers to define the final answer for a test pattern.
This technique was proposed by Breiman in 1994 and can be used with many classification methods.  The final effect of this technique is to reduce the variance associated with prediction, and thereby improve the prediction process.

The Bagging technique
Here are the steps:
    1.   X bootstrap samples are drawn from the available training data,
    2.   A classification model is trained on each bootstrap sample,
    3.   All classification models are run on the test set
   4.   For each test pattern, the results are combined by simple voting: the class with the majority of results across all classifiers is chosen as the final class.

Bagging can also be applied to regression methods. In this case, though, the final result is the average of the results of all regression models.

The “Bagging” meta-node
KNIME offers a meta-node to implement bagging for classification: the “Bagging” meta-node.
The “Bagging” meta-node applies only to classifiers, i.e. to mining techniques outputting nominal class values. A version for numerical output values, like for regression methods, with averaging of the final results is not yet available out of the box. However, customization of this “Bagging” meta-node should be straightforward.
The “Bagging” meta-node has two input ports and one output port.
Picture
The upper input port receives the training set data, while the lower input port receives the test data.
The output port presents the results for all the trained models and the final result after the majority vote.


The sub-workflow of the “Bagging” meta-node
The “Bagging” meta-node consists of two steps: training a number X of models and applying the X models to the test set.


The training step
The goal of the training step is to create X partitions of the training data and to train X models on them.

    -    This step starts with shuffling the data. Shuffling is necessary, otherwise the X partitions of training data might only represent a subset of the whole data universe.
    -     The “Chunk Loop Start” node then divides the training data set in a number of X partitions and loops around them.
    -     At each iteration, a partition is used to train a model. In this case, the model is a decision tree, but it could be any other classifier scheme.
    -      Finally, the “Model Loop End” node collects all models resulting from the learner node at each iteration.

The training loop then produces a list of X models, each one trained on a different partition of the training set.

The test step
This is also a loop running across the X developed models.

     -        The “Model Loop Start” node starts a loop on the models listed in the input data. At each iteration, one model of the list is loaded.
     -        The predictor node applies the current model to the test set. Notice that the predictor node is connected to the lower input port of the “Bagging” meta-node. This is because it has to operate on the test set.
     -        The loop is closed by a “Voting Loop End” node. This node collects on different columns the results of the predictor node using different models. Therefore its output data table shows as many columns as many classification models. In addition, the node also performs the majority vote, that is it outputs the final classification result for each test data row on the basis of how many classifiers agree about a given output class.

The Results
Below the output data table is shown for a “Bagging” node using 10 decision trees.
The parameter defining the number of classifiers to use is the number of data chunks in the “Chunk Loop Start” node.
Parameterizing the “Bagging” meta-node

The “Bagging” node can be easily parameterized by using a Quickform “Integer Input” node. This Quickform node defines the number of models to be created and controls the number of data chunks in the “Chunk Loop Start” node (see sub-workflow below).
Below, the configuration window of the newly parameterized “Bagging” meta-node is shown.
Comparing with the simple version of the same classifier
In the following workflow, we compare the performance of our parameterized “Bagging” node using decision trees vs. a simple decision tree configured similarly to the decision trees in the “Bagging” node.
Notice that the “Bagging” node includes both training and testing phases, unlike the simple decision tree technique.
The “bagging” strategy display an accuracy of 0.835, slightly better than the accuracy of 0.821 displayed by the simple decision tree, as it was to be expected.

]]>
<![CDATA[KNIME 2.7 is out!]]>Sat, 08 Dec 2012 07:52:36 GMThttp://www.dataminingreporting.com/2/post/2012/12/knime-27-is-out.htmlKNIME version 2.7 is now available.
Here you can find the list of changes http://tech.knime.org/whats-new-in-knime-27]]>