Density Based Clustering

TL;DR

Groups points by regions of high local point density rather than by global shape assumptions.
Effective at finding clusters with irregular or elongated boundaries.
Can identify points that do not belong to any cluster as noise or outliers.

Definition

Density-based clustering is a method of clustering data points based on their density, or the number of points within a given area or region.

Explanation

Density-based clustering identifies clusters by locating regions in the data space where points are densely packed and separating those from regions of lower point density. This approach does not require clusters to have a specific geometric shape, making it suitable for elongated or irregular boundaries. Density-based algorithms can also mark points that do not belong to any dense region as noise, helping to detect outliers. Some algorithms in this family use parameters (such as a neighborhood radius or density thresholds) and distance-based measures to determine cluster membership.

Examples

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN uses a set of parameters to define the density of a cluster and to identify which points are considered to be part of a cluster. The algorithm works by starting with a seed point, which is a point that is considered to be part of a cluster, and then expanding the cluster by adding all points within a specified radius of the seed point. This process continues until all points within the cluster are connected, and the cluster is considered to be complete.

OPTICS (Ordering Points To Identify the Clustering Structure)

OPTICS uses a similar approach to DBSCAN but also incorporates a measure of reachability, or the distance between two points, to identify clusters of data points. This algorithm works by first creating an ordered list of data points, where each point is assigned a reachability distance based on its distance to other points in the dataset. The algorithm then identifies clusters by examining the reachability distances of each point and identifying points with a reachability distance below a specified threshold, which indicates that the point is part of a cluster.

Use cases

Identifying clusters that do not have a clear or well-defined shape (for example, elongated or irregular clusters).
Finding clusters within large datasets where other clustering methods may be less effective.
Detecting outliers or noise as points that do not belong to any cluster.

Notes or pitfalls

Density-based algorithms require setting parameters (for example, neighborhood radius, density thresholds, or reachability thresholds); these parameters determine which points are considered part of a cluster and therefore affect the resulting clustering.

DBSCAN
OPTICS