Once established that it would be beneficial to integrate some big data processing into a KNIME workflow, the problem has just started. For once, it can be quite complex to connect to a big data platform, design the appropriate SQL query, and then retrieve the data accordingly. For second, there are so many different types of big data platforms as to make the choice of one or another quite a long and tedious task.
For both cases, however, KNIME can help.
A few nodes specific for big data are available in the commercial KNIME Big Data Extension, that can be purchased at: http://www.knime.org/knime-big-data-extension
Simplified Database Connectors for Big Data Platforms
KNIME offers a few connector nodes to connect to databases in general and to big data platforms in particular. Some connector nodes have been specifically designed for some particular big data platforms.
These dedicated connectors provide a very simple configuration window requiring only the basic access parameters, like credentials. Among those dedicated connectors, we find the Hive Connector to connect to Apache Hive and the Impala Connector to connect to the Impala database.
Apache Hive (https://hive.apache.org/) is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
Impala (http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html) is a fully integrated, state-of-the-art analytic database architected specifically to leverage the flexibility and scalability strengths of Hadoop - combining the familiar SQL support and multi-user performance of a traditional analytic database with the rock-solid foundation of open source Apache Hadoop and the production-grade security and management extensions of Cloudera Enterprise.
Soon, we will also find the connector for the Hortonworks (http://hortonworks.com/hdp/) big data platform.
In case a dedicated connector is not available for the big data platform of your choice, you can always connect using a generic Database Connector node. The only setting difference with a dedicated node is the JDBC driver for the chosen platform that has to be uploaded in the Preferences page and set in the configuration window of the Database Connector node. JDBC drivers must be required at the vendor site.
SQL Helpers Nodes
Once a connection to a database or to a big data platform has been established with one of the connector nodes, we can select the table to work on and build the SQL query to implement the required ETL operations.
Writing a complex SQL query is not for everybody. For the SQL less expert users, KNIME provides a few SQL transparent nodes, to set a function without ever touching the underlying SQL query. These SQL helper nodes make the implementation of ETL procedures on a big data platform extremely easy and fast.
They also make the transition from a big data platform to another one very easy: it is enough to change the connector node feeding the SQL helper nodes. The SQL helper nodes do not need to be changed. This preserves the agility feature of the KNIME Analytics Platform even after the integration of a big data platform into the workflow.
Step by Step the Integration of Big Data into KNIME
In summary, the integration of a big data platform into KNIME is very straightforward.
1. Drag and drop the appropriate connector node into the workflow to connect to the big data platform of choice
2. Configure the connector node with the parameters required to access the data on the big data platform, meaning credentials, server URL, and other platform specific settings
3. Define the SQL query to perform the ETL operations with the help of SQL manipulation nodes. The SQL manipulation nodes indeed help you building the correct SQL query even without knowing a thing about SQL queries.
4. Finally, the execution of a data retrieval node (Database Connection – Table Reader node) allows the user to retrieve the data using the previously built SQL query.
Such an easy approach opens the door to the introduction of big data platforms into KNIME, without the headache of configuring each tiny platform detail. It also preserves the quick prototyping feature of a KNIME workflow. Indeed, the user can change the big data platform of choice, just by changing the database connector node in step 1 and reconnecting it to the subsequent SQL builder nodes.
It is really that easy to integrate big data platforms in KNIME and to considerably speed up the whole ETL part of the data science discovery travel!