Customer segmentation has undoubtedly been one of the most implemented applications in data analytics since the birth of customer intelligence and CRM data.
The concept is easy. Group your customers together based on some criteria, such as revenue creation, loyalty, demographics, buying behavior, or any combination of these criteria and more.
The group (or segment) definition can follow many ways, depending on the degree of expertise and domain knowledge of the data scientist.
- Grouping by rules. Somebody in the company already knows how the system works and how the customers should be grouped together with respect to a given task, like for example a campaign. This approach is highly interpretable, but not very portable to new analysis. In the presence of a new goal, new knowledge, or new data the whole rule system needs to be adjusted. A Rule Engine node would suffice to implement the series of experience based rules.
- Grouping as binning. Sometimes the goal is clear and not negotiable. One of the many features describing our customers is selected as the representative one, be it revenues, loyalty, demographics, or anything else. In this case, the operation of segmenting the customers in groups is reduced to a pure binning operations. Here boxes are built along one or more attributes by means of bins. Using one of the many binner nodes available in the KNIME Analytics Platform, this task can be implemented easily.
- Grouping with zero knowledge. Most of the times, though, it is safe to assume that the data scientist does not know enough of the business at tend to build his own customer segmentation rules. In this case, if no business analyst is at hand, he will resolve to a plain blind clustering procedure. The after-work for the cluster interpretation belongs to a business analyst, who is (or should be) the domain expert.
Available Clustering Algorithms
There are many clustering procedures and KNIME Analytics Platform makes most of them available under the category Analytics/Mining/Clustering in the Node Repository panel.
The most commonly known and used is the k-Means algorithm. The k-Means algorithm associates patterns according to the minimum distance criterion and calculates the prototypes of the new clusters as the average of all data points included. The distance used is, in most cases, the Euclidean distance and this generates spherical clusters around their prototypes. In KNIME Analytics Platform the k-Means clustering procedure is implemented by the k-Means node.
Other nodes are available to implement other clustering procedures, such as the nearest neighbors, DBSCAN, hierarchical clustering, SOTA, etc … They all use different similarity measures and aggregation algorithms and end up with differently shaped clusters on the same data.
We are using the same telco data set used for the churn prediction use case (https://www.knime.org/knime-applications/churn-prediction) which contains 2 files. One file has the contract data, while the other contains the operational (cell phone usage) data for each customer. Each record is uniquely identified by the cellular number and area code. Both files are read, one by a File Reader and one by an Excel Reader node, and then joined together to obtain a summary record for each customer.
Clustering algorithms work on distances, i.e. they take into account only numerical features. In order to expand the number of features to be included in the clustering or, contrarily, to exclude some features from the clustering, we need to work on their type.
Discretizing string values into numbers, such as text judgements in scores, adds numerical features to the clustering procedure. On the opposite, converting a number into a String excludes the attribute from the clustering procedure.
Again, clustering algorithms are based on distances calculated across numerical data values. The range of the columns thus plays a major role in making one data column more influential than the other. For example, data column “age” with range [1,100] will outweigh data column score in range [0,1]. To avoid that, all numerical data columns have to be normalized to fall into the same range, usually [0,1].
After string manipulation, conversion, discretization, and normalization, we apply the k-Means algorithm to produce a predefined number of data clusters. Let’s default to 10 clusters.
The basic workflow for customer segmentation consists of only 3 steps: data reading, data pre-processing, and k-Means clustering.
This workflow generates two data outputs: the cluster centers and the original data rows with cluster labels (see last column in Figure 2).
The next step would be to involve one or more business analysts to interpret the resulting clusters.
Notice that you can always change the segmentation engine by substituting the k-Means and its associated normalization/denormalization transformation with a Rule Engine or a Binner node.