Cluster analysis

Cluster analysis is a technique used to group similar observations into a number of clusters based on the observed values of several variables for each individual observation. The group membership of the sample of observations is unknown prior to the algorithm being used, therefore clustering is a method of unsupervised learning.

Clustering algorithms identify clusters based on a distance metric that measures how similar different observations in the dataset are to each other.

Two major distance metrics are:

Euclidean distance

This is the distance between the two points which is the hypotenuse of a triangle ABC that connects the two points \(X\) and \(Y\). Euclidean distance is sensitive to the scales of the variables being used, and is blind to correlated variables.

\begin{equation*} D(i,j) = \sqrt{A^2+B^2} = \sqrt{(X_{2}-X_{1})^2+(Y_{2}-Y_{1})^2} \end{equation*}
Mahalanobis distance

This takes into account co-variances that lead to elliptic decision boundaries, as opposed to circular boundaries in the Euclidean case. As such, problems of scale and correlation in Euclidean distance are no longer an issue.

\begin{equation*} D = \sqrt{ (x-\mu)^\intercal \times C^{-1} \times (x-\mu)} \end{equation*}

where \(D\) is the Mahalanobis distance, \(x-\mu\) is the distance of the vector of points from the mean, \(C\) is the covariance matrix of independent variables, \(x\) is the vector of observations, and \(\mu\) are the mean values of the independent variables.

Several clustering algorithms (and more!) are available via scikit-learn clustering and we explain several below.

K-means clustering

This is the simplest and fastest of the clustering algorithms.

Algorithm

  1. Defining a set number of clusters/groups.

  2. Randomly initializing a set of center points for the dataset, known as group centers.

  3. Calculate the distance between each data point and the group center.

  4. Classify the data point based on the group center closest to it.

  5. Using the classified datapoints, recompute group centers by taking the mean of all vectors in the group.

  6. Repeat for set number of iterations or until the group centers do not change by much between iterations.

Advantages

  1. Fast.

  2. Linear complexity.

  3. Great for spherical clusters.

Disadvantages

  1. Need to define the number of groups in the dataset.

  2. Results may be different during each run due to random initial set of group centers.

  3. Does not work well for elongated clusters or manifolds with irregular shapes.

  4. Distance algorithm used is not a normalized metric and Euclidean distanes can be inflated in high-dimensional spaces. May use a dimensionality reduction methodology.

Hierarchical Agglomerative clustering (HAC)

Each data point starts a single cluster and is then successively merged (i.e., agglomerated) into pairs of clusters until all clusters are merged into a single cluster containing all points. The hierarchy of clusters can be represented as a tree (or dendogram). The root of the tree (at the top) gathers all the samples, while the leaves (at the bottom) are clusters with only single samples.

Algorithm

  1. Each datapoint is a single cluster.

  2. Calculate a distance metric between two clusters.

  3. Clusters with the smallest distance metric with each other are combined.

  4. Clusters are combined until there is one cluster that combines all datapoints.

Advantages

  1. Not sensitive to the choice of distance metric being used.

  2. Able to extract a hierarchical structure (if the data has one).

  3. Does not require specification of the number of clusers.

Disadvantages

  1. Lower efficiency.

  2. Time complexity of \(O(n^3)\).

Mean-Shift clustering

This is a sliding-window based algorithm that seeks to find dense clusters of data points. It is a centroid-based algorithm, thus it locates the center points of each cluster by updating the candidates for center points to be the mean of the points in the sliding window. Candidate windows are filtered in the post-processing stage to eliminate near-duplicates of clusters, to form a final set of center points and their corresponding clusters.

DBSCAN

Density-Based Special Clustering of Applications with Noise is a density-based clustered algorithm similar to mean-shift with a few advantages such that (i) it does not require a pre-set number of clusters (ii) identifiers outlier points as noise (iii) finds arbitrarily sized and shaped clusters well.

EM-GMM

Expectation Maximization Clustering using Gaussian Mixture Models is more flexible than K-means as it assumes data points are Gaussian distributed, which means that clusters can be defined by the mean and standard deviation. This means that any elliptical shape can be a cluster, rather than depending on spherical centroids.

Comments

Comments powered by Disqus