There has been a lot of talk about the Internet of Things lately, for example about intelligent household systems (http://www.wired.com/2014/01/googles-3-billion-nest-buy-finally-make-internet-things-real-us/) or smarter cities (http://www.smartsantander.eu/).
The Internet of Things poses a great challenge for data analysts, on the one hand because of the very large amounts of data created over time and on the other because of the algorithms that make the sensor-equipped object (house, or city) capable of learning and therefore smarter.
Together with Phil Winters and Aaron Hart, we have worked on an Internet of Things related problem.
The Data Set
The original data came from Capital BikeShare, a bike service for tourists and residents in the Washington DC downtown area. The site has publicly available data describing duration, start date, end date, start station, end station, bike #, and the customer type. The dimensionality of the data set is therefore not very high.
However, we do believe that by now you do not need to own all sensors to provide an Internet of Things use case. You can integrate the data you have with tons of complementary data sources available on the Internet.
In this case, we integrated our original bike data with topology, elevation, local weather, holiday schedule, traffic situation, business locations, touristic attractions, and other types of information widely available on the Internet via web or REST services. Google API alone provides tons of services with topological, calendar, and weather information.
The raw data is always a large amount of not very organized data, with low information density. At this point, we need to run some aggregation and transformation operations to expose a more meaningful set of data features, like: number of bikes available at each station over time, number of bike renters over time, and the total number of bikes on a given path.
The aggregation step is always important, both to reduce the data dimensionality and to produce a higher information density data set.
As usual, before moving to the analytics, we experimented with some visualization techniques using different graphical tools:
- traditional scatter plots with KNIME Views nodes
- traffic flow lines with R ggplot2 in KNIME R Integration extension
- station locations on geographical maps using KNIME Open Street Map Integration
- graphs to represent traffic flow using KNIME Network extension.
Time Series Prediction
We tackled two problems here:
- a restocking alert system to tell us one hour in advance when a bike restocking will be needed at a given bike station
- a prediction of the number of bike renters over the next hour
To solve the second problem we resolved to an auto-regression time series predictor with a seasonality correction factor.
The two models developed in the previous part are quite crude.
The first one uses all possible information we found available. Some information might definitely be useless, some might be even confusing. We run a backward feature selection algorithm to define the leanest restocking alert system possible.
The second model uses arbitrary numbers for the seasonality factor and for the past values to use for prediction. Here we run a brute force optimization loop to find the best values for the seasonality index (6 hours) and the past number of samples to use (20 hours).
The results of this work, together with whitepaper, data, and workflows, have been made available in the KNIME whitepapers pool at http://www.knime.com/white-papers#IoT.
In this kind of cutting edge problems, where a very large amount of data is generated, it is imperative to adopt a scalable approach that can grow together with the application. A scalable approach means not only handling bigger data faster, but also reaching out to new external data sources, integrating different complementary tools to refine the analytics with the newest emerging algorithms and techniques, and collaborating within the analyst team to exploit the group’s collective competence. Only an open architecture can provide such a flexible environment, to expand and adapt the tool bench in unpredictable ways (http://www.knime.org/open-for-innovation).
The Internet of Things is a very good example of the data explosion that is occurring in most fields, from social media to sensor-driven processes. But how much information can more data actually convey? This is of course highly dependable on the amount of intelligence we apply to it. Pure data plumbing and systematization do not generally produce more intelligent applications. Only the injection of data analytics algorithms from statistics and machine learning, can make applications capable of learning and therefore smarter.