Theory, Examples & Exercises
Section: Numerical classification
Cluster analysis is a hierarchical and agglomerative method of numerical classification, creating hierarchically nested groups (clusters) of samples by agglomerating individual samples into larger clusters. Two main choices need to be done: select the distance metric, which measures dissimilarity or similarity (in terms of species composition) between samples, and clustering algorithm, which searches for boundaries among groups of samples. For types of distances or similarities, consult Ecological resemblance. Overview of most often used clustering algorithms is below. Note that not all distances/similarities can be paired with all clustering algorithms. For example, Ward’s method requires distances which are metric (i.e. can be projected into Euclidean space, with which the Ward algorithm is operating), and therefore non-metric distances (like Bray-Curtis) cannot be directly used (but some can be converted into metric; e.g., Bray-Curtis distance by square-root transformation).
Clustering algorithm represents a set of rules which decide to which group (cluster) will be the sample assigned. I like to imagine this problem as an analogy to me coming to the pub full of people (samples) and making a decision to which table (group, cluster) I will sit. Some of the people are my good friends (high similarity. low distance), some I really don’t like (low similarity, high distance). Sitting at the table with people I do not like will perhaps ruin the night, so I need to choose carefully. I will call this “where to sit in the pub” analogy below.
Results or hierarchical cluster analysis is often displayed as a dendrogram, which represents the hierarchy on which samples and/or clusters are connected together. The orientation of branches around nodes is arbitrary (it does not matter whether the samples are right or left, Fig. 1).
In the single linkage algorithm, samples join the group in which is the sample the most similar to them. In the “where to sit in the pub” analogy, this is like joining the table where sits my best friend (highest similarity, lowest distance), without considering the other people sitting with him/her. The example of how this algorithm works is on Fig. 2 (from Legendre & Legendre 2012) and example of resulting dendrogram on Fig. 3 (left panel).
The sample joins the group in which the furthest sample is the most similar. In the “where to sit in the pub” analogy, I will join the table in which the person I don’t like the most is still not that bad (each of other tables has someone I don’t like even more). Example of dendrogram is on Fig. 3 (central panel).
The sample joins the group to which it has the lowest average distance. In the “where to sit in the pub” analogy, I will take table by table and average how much I like people sitting at that table; then, I join the table with the highest average (in average I like them most). The average linkage includes a range of methods in between single and complete linkage, and in ecology, these are the most useful (e.g. UPGMA, unweighted pair group method with arithmetic mean). Example dendrogram is on Fig. 3 (right panel).
Ward’s clustering method is based on minimizing increases in the total error sum of squares, which is defined as the sum of squared distances of each sample to the centroid of its group. Ward’s method searches for the solution, which merges two groups of samples (clusters) with the lowest overall increase in the error sum of squares. In the beginning, each sample is forming an independent cluster, and the error sum of squares is equal to zero. In each step, the method merges a sample with the sample, sample with cluster or cluster with the cluster, and the error sum of squares is increasing. Since the distances of samples to the centroids are calculated in Euclidean space, only metric (Euclidean) distances can be used in this method; the use of non-metric distances (e.g. Bray-Curtis) is possible, but these need to be either converted into metric ones (e.g. by square-rooting in the case of Bray-Curtis dissimilarity, Fig. 4), or one has to acknowledge that the results may not follow the underlying principles of minimizing within-group sum of squares.
The total error sum of squares, used in Ward’s and K-means clustering algorithms (Fig. 5): for each cluster k, it calculates the sum of square distances of individual samples (yij(k)) to the centroid of the cluster (mj(k); left formula), or, alternatively, as the mean squared distance among all pairs of samples (Dhi2) within the cluster (right formula; both calculations yield the same result). These within-cluster error sums of squares are summed across all K clusters (formula at the bottom).
Flexible clustering algorithm, which allows influencing the level of chaining by modifying the parameter β (β ≥ -1 and β < 1). With β = -1 the chaining is the lowest, with beta approaching +1 the chaining is highest (similar to simple linkage method, Fig. 6). Optimal representation of distances among samples is for β = -0.25 (results similar to Ward’s algorithm). The method can be used with both metric and non-metric coefficients of similarity/dissimilarity.
In most clustering algorithms based on quantitative similarity/dissimilarity coefficient, the transformation of species composition data (e.g. logarithmic) can considerably influence resulting dendrogram (like on Fig. 7 in the case of single linkage algorithm applied on Bray-Curtis distances; Danube dataset). This does not apply if the original species composition data are presence-absence (in that case, the transformation does not change the relative distribution of the values).