Cluster analysis
Cluster analysis is a technique used to group similar observations into a number of clusters based on the observed values of several variables for each individual observation. The group membership of the sample of observations is unknown prior to the algorithm being used, therefore clustering is a method of unsupervised learning.
Clustering algorithms identify clusters based on a distance metric that measures how similar different observations in the dataset are to each other.
Two major distance metrics are:
 Euclidean distance

This is the distance between the two points which is the hypotenuse of a triangle ABC that connects the two points \(X\) and \(Y\). Euclidean distance is sensitive to the scales of the variables being used, and is blind to correlated variables.
 Mahalanobis distance

This takes into account covariances that lead to elliptic decision boundaries, as opposed to circular boundaries in the Euclidean case. As such, problems of scale and correlation in Euclidean distance are no longer an issue.
where \(D\) is the Mahalanobis distance, \(x\mu\) is the distance of the vector of points from the mean, \(C\) is the covariance matrix of independent variables, \(x\) is the vector of observations, and \(\mu\) are the mean values of the independent variables.
Several clustering algorithms (and more!) are available via scikitlearn clustering and we explain several below.
Contents
Kmeans clustering
This is the simplest and fastest of the clustering algorithms.
Algorithm
Defining a set number of clusters/groups.
Randomly initializing a set of center points for the dataset, known as group centers.
Calculate the distance between each data point and the group center.
Classify the data point based on the group center closest to it.
Using the classified datapoints, recompute group centers by taking the mean of all vectors in the group.
Repeat for set number of iterations or until the group centers do not change by much between iterations.
Advantages
Fast.
Linear complexity.
Great for spherical clusters.
Disadvantages
Need to define the number of groups in the dataset.
Results may be different during each run due to random initial set of group centers.
Does not work well for elongated clusters or manifolds with irregular shapes.
Distance algorithm used is not a normalized metric and Euclidean distanes can be inflated in highdimensional spaces. May use a dimensionality reduction methodology.
Hierarchical Agglomerative clustering (HAC)
Each data point starts a single cluster and is then successively merged (i.e., agglomerated) into pairs of clusters until all clusters are merged into a single cluster containing all points. The hierarchy of clusters can be represented as a tree (or dendogram). The root of the tree (at the top) gathers all the samples, while the leaves (at the bottom) are clusters with only single samples.
Algorithm
Each datapoint is a single cluster.
Calculate a distance metric between two clusters.
Clusters with the smallest distance metric with each other are combined.
Clusters are combined until there is one cluster that combines all datapoints.
Advantages
Not sensitive to the choice of distance metric being used.
Able to extract a hierarchical structure (if the data has one).
Does not require specification of the number of clusers.
Disadvantages
Lower efficiency.
Time complexity of \(O(n^3)\).
MeanShift clustering
This is a slidingwindow based algorithm that seeks to find dense clusters of data points. It is a centroidbased algorithm, thus it locates the center points of each cluster by updating the candidates for center points to be the mean of the points in the sliding window. Candidate windows are filtered in the postprocessing stage to eliminate nearduplicates of clusters, to form a final set of center points and their corresponding clusters.
DBSCAN
DensityBased Special Clustering of Applications with Noise is a densitybased clustered algorithm similar to meanshift with a few advantages such that (i) it does not require a preset number of clusters (ii) identifiers outlier points as noise (iii) finds arbitrarily sized and shaped clusters well.
EMGMM
Expectation Maximization Clustering using Gaussian Mixture Models is more flexible than Kmeans as it assumes data points are Gaussian distributed, which means that clusters can be defined by the mean and standard deviation. This means that any elliptical shape can be a cluster, rather than depending on spherical centroids.
Comments
Comments powered by Disqus