General information

DBSCAN (Density-based spatial clustering of applications with noise) is a popular unsupervised machine learning algorithm, which groups together points that are close to each other. It also marks points that are in low-density regions as outliers.

Its benefit is that it doesn’t require to specify the number of clusters and it can find arbitrarily shaped clusters.

DBSCAN algorithm classifies every data point into 3 categories: core point, border point, and outlier. It generally takes two parameters: ‘epsilon’ (the maximum distance between two samples for one to be considered as in the neighborhood of the other) and ‘minPoints’ (the number of samples in a neighborhood for a point to be considered as a core point including the point itself). For every data point, if it has at least ‘minPoints’ within ‘epsilon’ distance, it is a core point. If there are fewer than ‘minPoints’ within ‘epsilon’ distance but the point is in the neighborhood of a core point, then it’s a border point. In other case, the data point is an outlier. For locating data points in space, DBSCAN mostly uses Euclidean distance, although other methods can also be used.

Untitled

Example for minPoints=4

A cluster includes core points that are neighbors (i.e. reachable from one another) and all the border points of these core points.

Description

Brick Locations

BricksMachine Learning → DBSCAN Clustering

Brick Parameters

Brick Inputs/Outputs

Example of usage

Let’s try to segment data from the ‘segmentation_moons.csv’ dataset using DBSCAN algorithm. The dataset consists of 3 columns: ‘Unnamed: 0’, ‘0’ and ‘1’.

Untitled

So we connect this dataset directly to the DBSCAN Clustering Brick, set ‘Epsilon’ equal to 0.1, filter column ‘Unnamed: 0’ as it sets the index of the record and doesn’t represent any feature of the sample. After configuring the settings we can run the pipeline and see the results in the Output section on the right sidebar. We got the same dataset with an additional column ‘predicted_cluster’, which has values from -1 to 1. It means that we segmented our data into 2 clusters and points that were predicted as cluster ‘-1’ are the outliers.