General Information

This component can be used for solving one of the most popular unsupervised machine learning problems - Data Segmentation or Cluster Analysis. Unsupervised learning unites the algorithms that learn patterns from untagged data.

The primary goal of unsupervised machine learning is to discover some previously unknown patterns in data and use them for :

detecting any cases when the patterns are broken (anomaly detection),
Unlike Supervised learning that requires the labeled data for the model training and validation, in the case of Unsupervised learning, we do not have any ground trues or the golden standard, which could be used for the models' quality grounding. Still, understanding the business domain and the nature of data allows choosing the most appropriate approach for the patterns' extraction and decision making.dividing the data into groups, which are characterized by the significantly more substantial similarity between objects inside one group in comparison with the objects belonging to the different groups (data segmentation)

Untitled

Unlike Supervised learning that requires the labeled data for the model training and validation, in the case of Unsupervised learning, we do not have any ground trues or the golden standard, which could be used for the models' quality grounding, but the understanding of the business domain and the nature of data allows to choose the most appropriate approach for the patterns' extraction and the decision making.

Cluster Analysis

According to the definition, Segmentation is the dividing of something into parts, which are loosely connected. Data Segmentation has many business applications - from medicine to retail that is explained by the ability to apply data mining techniques for patterns discovery and cluster analysis.

Cluster analysis task can be solved via various algorithms that depend on:

way to measure the similarity/distance between objects. Distance or similarity measure depends on the data nature and their interpretation. For instance, when the objects are represented as an array of numerical features, we may use Euclidian or Cosine distance. At the same time, a set of categorical attributes characterizes the objects, Jaccard Similarity may be used.
Way to combine the objects to the clusters (clustering criteria), for instance, centroid-based clustering leads to the finding the optimal set of the centroids assign the objects to the nearest cluster center. In contrast, clusters are defined as areas of higher density than the remainder of the data set in density-based clustering.

The result of cluster analysis strongly depends on data and is regulized via algorithms' hyperparameters.

Data Segmentation Pipeline

Undoubtedly, cluster analysis as part of the Data Mining discipline requires an experienced data scientist who can combine domain knowledge and data mining experience to obtain the results that will have business value. Still, for most problems that relate to the marketing domain, the Data Segmentation pipeline can be represented as the following process:

Untitled Diagram (2).png

The Data Preprocessing stage depends on not only data but the Clustering algorithm as well. One of the popular clustering algorithms is K-Means Clustering, so it may be considered a universal solution. ****This algorithm belongs to the centroid-based clustering group and allows iteratively find the optimal clusters' centroids when the number of clusters is defined. K-means algorithm has the following advantages:

Scales to large data sets
Can warm-start the positions of centroids
Easily adapts to new examples