User Tools

Site Tools


en:hier-agglom_examples

Section: Numerical classification

Cluster analysis (hierarchical agglomerative classification)

Example 1: Comparison of hclust and agnes using Vltava river valley data

This exercise compares two main clustering functions, hclust and agnes (from library cluster), shows how they work and how they differ in use.

hclust

Requires two arguments: d - matrix of distances among samples (see Ecological resemblance), and method - name of clustering algorithm. Clustering algorithm are principally of three types: single linkage, complete linkage and average linkage - the third one is the one most often used in ecology (average linkage includes also popular Ward method or beta flexible). Single linkage method produce chained dendrograms, complete linkage dendrograms look a bit like rakes - some clusters merge together at the highest dissimilarity. Ward method is favourite, as it produces nicely compact clusters - note, however, that Ward method should not be combined with distance measures, which are not strictly metric, which is e.g. popular Bray-Curtis distance. Beta flexible method is popular, because setting beta coefficient allows one to modify the chaining of the dendrogram (beta flexible is not available in hclust, but in function agnes in the packages cluster).

vltava.spe <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/vltava-spe.txt', row.names = 1)
library (vegan)
dis <- vegdist (sqrt (vltava.spe), method = 'bray') # percentage cover data are transformed by square root
cluster.single <- hclust (d = dis, method = 'single')
cluster.complete <- hclust (dis, 'complete')
cluster.average <- hclust (dis, 'average')

plot.hclust

Draws the dendrogram.

par (mfrow = c (1,3)) # will draw all dendrogram into one figure
 
plot (cluster.single, main = 'Single linkage')
plot (cluster.complete, main = 'Complete linkage')
plot (cluster.average, main = 'Average linkage')

rect.hclust

Divides dendrogram into given number of groups (or in specified height, respectively)1).

plot (cluster.average, main = 'Average linkage')
rect.hclust (cluster.average, k = 3)
rect.hclust (cluster.average, k = 8, border = 'blue') # argument border specifies the colour of the rectangle

cutree

Returns a vector, containing assignment of samples into clusters.

clusters <- cutree (cluster.average, k = 5)
clusters
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
 1  1  1  1  2  2  3  3  1  1  2  1  4  4  2  3  3  3  1  3  1  2  3 
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 
 1  1  3  3  3  3  2  2  4  4  4  4  4  1  1  1  1  1  1  3  2  1  1 
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 
 1  3  2  3  3  3  3  3  2  3  3  3  3  3  3  3  3  2  3  3  3  3  3 
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 
 3  3  3  3  3  3  1  1  2  3  3  3  2  3  3  1  2  1  1  1  3  1  3 
93 94 95 96 97 
 3  5  1  1  5 

agnes (library cluster)

Includes six clustering algorithms, some not included in function hclust. Important and interesting is the method flexible, known as beta flexible, which is often used for ecological data. Setting parameter beta influences the chaining of the dendrogram (for beta ~ +1 the chaining is maximal and result is similar to method single linkage, for beta = -1 is result similar to complete linkage). In function agnes the setting beta parameter is, however, not so simple: you need to set up one to four parameters (see ?agnes for details). The simplest options is to assign to the argument par.method only one value, so called alpha, while it applies: beta = 1 - 2*alpha, the value for alpha can be thus calculated as alpha = (1-beta)/2. If, for example, I want to calculate beta flexible method with beta = -0.25 (which is said to optimally represent distances among samples), than alpha = (1-(-0.25))/2 = 1.25/2 = 0.625.

library (cluster)
cluster.flexible <- agnes (x = dis, method = 'flexible', par.method = 0.625)

Use plot function to draw the dendrogram (the function plots several figures, you need to click into the figure to move forward). If you want to use plot.hclust function, transform the agnes object into hclust object using as.hclust function.

cluster.flexible.hclust <- as.hclust (cluster.flexible)
plot (cluster.flexible.hclust)

Example 2: Ward cluster algorithm applied on Barro Colorado Island data

In this example, we will use BCI dataset (data from tropical forest permanent plot) to conduct agglomerative cluster analysis, combining Ward's cluster algorithm with Bray-Curtis distance method. We will display resulting dendrogram with five distinguished vegetation groups, project their spatial configuration and also their composition differences in ordination diagram based on the same distance measure (NMDS using Bray-Curtis distances) and different distance measure (DCA based on chi-square distances).

First, let's load the data and log transform them (original data contain numbers of individuals):

library (vegan)
data (BCI)  # example using Baro Colorado data
BCI.log <- log1p (BCI)  # first, log transform species data, which contains numbers of individuals
 

Function vegdist calculates distance matrix based on Bray-Curtis distances:

bc.dist <- vegdist (BCI.log, method = 'bray')
bc.dist
 
           1         2         3         4         5         6         7         8         9
2  0.2561624                                                                                
3  0.2955959 0.2749839                                                                      
4  0.3152203 0.2937687 0.2853559                                                            
5  0.3143658 0.3201234 0.3012306 0.2827513                                                  
6  0.3098983 0.3210060 0.3413611 0.3439200 0.3427188                                        
7  0.3002090 0.3015008 0.3206233 0.3309095 0.3647759 0.2671652                              
8  0.3188611 0.2869892 0.2763554 0.2994629 0.3063330 0.3584762 0.3432045                    
9  0.3511634 0.3116517 0.2886386 0.3294022 0.3157618 0.3580870 0.3265907 0.2696480          
10 0.3531552 0.3146364 0.2758954 0.2928692 0.3166512 0.3672627 0.3865676 0.2989141 0.2831257
...

Btw, if you type bc.dist, you actually use the function print.dist2), which has several useful arguments (?print.dist). For example, if we want to print also diagonal values (zeros in this case, since we display distances, not similarites), we may use:

print (bc.dist, diag = TRUE)
 
           1         2         3         4         5         6         7         8         9
1  0.0000000                                                                                
2  0.2561624 0.0000000                                                                      
3  0.2955959 0.2749839 0.0000000                                                            
4  0.3152203 0.2937687 0.2853559 0.0000000                                                  
5  0.3143658 0.3201234 0.3012306 0.2827513 0.0000000                                        
6  0.3098983 0.3210060 0.3413611 0.3439200 0.3427188 0.0000000                              
7  0.3002090 0.3015008 0.3206233 0.3309095 0.3647759 0.2671652 0.0000000                    
8  0.3188611 0.2869892 0.2763554 0.2994629 0.3063330 0.3584762 0.3432045 0.0000000          
9  0.3511634 0.3116517 0.2886386 0.3294022 0.3157618 0.3580870 0.3265907 0.2696480 0.0000000
...

The Ward's algorithm is available in both the base function hclust and also the function agnes from the package cluster; the implementation, however, slightly differs. Murtagh & Legendre (2014) pointed out that in literature and software applications there are two versions of Ward's cluster algorithm, with only one being the correct implementation of the method originally published by Ward. The method ward in the function agnes contains the correct implementation, while default method ward in hclust does not (newly, this method is equal to ward.D); if hclust should calculate the correct Ward, the argument method need to be changed into ward.D2. Check ?hclust to see the difference between both algorithm, and optionally search for more details in Murtagh & Legendre (2014).

Here, we will use agnes function from cluster package, with method set to ward. Since Ward algorithm requires distances to be metric (i.e. possible to display in metric Euclidean space) and Bray-Curtis is not metric, instead of raw Bray-Curtis distances I use their square-root transformation (see Ecological resemblance > Distance indices for details why).

#install.packages ('cluster') # install if necessary
library (cluster)
clust <- agnes (sqrt (bc.dist), method = 'ward') # calculate Ward's algorithm 
# on square-rooted Bray-Curtis distances
 

Resulting cluster dendrogram can be plotted using the function plot. Note that the object clust is of the class agnes, so plot.agnes will be used in case that generic plot function is applied on it; in case that it is the result of hclust, the function plot.hclust would be used. The function plot.agnes by default plots two figures and works in interactive mode (needs to hit enter in order to proceed to the next one (the second one is the dendrogram). By adding argument which.plot = 2, only the dendrogram will be plotted (for meaning of which.plot, check ?plot.agnes).

plot (clust, which.plot = 2)

However, since below we will be using functions which are designed to be used for the objects of class hclust, we will transform the object clust into the class hclust by function as.hclust, and then plot it (in this case, R will use plot.hclust function):

clust.hclust <- as.hclust (clust)
plot (clust.hclust)

(I will not include the figure of the dendrogram below the code, since it looks identical to the one above, just with different labels).

To cut the tree into few clusters, we use the function cutree (sounds like to cut tree, but with only a single t) can cut the dendrogram according to predefined criteria, either into given number of clusters (argument k), or according to given level of similarity/dissimilarity (argument h). Here, we will cut the dendrogram into five groups of samples:

groups <- cutree (clust, k = 5)
groups 
 
 [1] 1 1 1 1 1 1 1 1 1 1 2 3 4 1 1 2 3 4 1 1 2 3 4 1 1 2 3 3 3 3 2 3 3 3
[35] 3 2 2 3 3 3 2 2 2 2 2 5 5 2 2 2

The values in the vector are ordered according to the order of samples in the species composition matrix, and reflect the assignment of each sample to the cluster (1 to 5). The first value is always cluster 1.

Now, we will display these clusters onto the cluster dendrogram produced above by function plot. The issue here is that in the dendrogram, the order of clusters is always from left to right (i.e. cluster 1 is the first one on the left), but this may not fit the results of the function clustree, which always assigns the first sample in the matrix into cluster 1. That's why we need to figure out which sample on the dendrogram belongs to which cluster. For this, we need to look into the object generated by clustering function (agnes here) into the order of samples in the dendrogram. This is easiest done in the object of the class hclust, which contains complement $order:

clust.hclust$order
 [1]  1  2  6  7 14 19 24 25  3 10  8  9  4  5 15 20 11 16 21 26 31 36 37 42 41 43 44 45 48
[30] 49 50 12 17 22 29 28 27 32 33 30 35 34 38 39 40 13 18 23 46 47

As you see when comparing this order with the dendrogram above, it fits exactly the way how samples are sorted in the dendrogram from left to right (starts with sample 1 and ends with sample 47). Now, we need to know which of these samples belong to which cluster according to the function cutree (we have an object groups greater above, and will use it for that):

group.order <- groups[clust.hclust$order]
group.order
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 5 5 5 5 5 5 3 3 3 3 3 3 3 3 3 3 3 3 3
[45] 3 4 4 4 4 4

In this way, we get a vector of samples in the order of how they are displayed on the ordination diagram, and their assignments into the group. You can see that the order of clusters is 1, 2, 5, 3 and 4. We apply function unique to get this vector, and use it for colouring the groups in the cluster dendrogram later:

group.in.cluster <- unique (group.order)
group.in.cluster
[1] 1 2 5 3 4

To draw clusters onto the dendrogram, we use function rect.hclust. Make sure that you apply it on the object of the class hclust only (if you haven't transformed the output of agnes using as.hclust function, make sure to do that!). The function rect.hclust is a low-level graphical function, adding output onto existing dendrogram, so I draw the dendrogram again first, and then use this function. Also, I add the legend to the dendrogram to know which colour represents which cluster:

plot (clust.hclust)
rect.hclust (clust.hclust, border = group.in.cluster, k = 5) 
legend ('topleft', legend = paste ('Cluster', 1:5), pch = 22, col = 1:5, bty = 'n')

The next step is to draw the spatial distribution of plots classified into four clusters. For this, we need spatial coordinates of each plot, which are in the dataset BCI.env as coordinates UTM.NS (latitude) and UTM.EW (longitude):

BCI.env <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/bci.env.txt')
plot (UTM.NS ~ UTM.EW, data = BCI.env, pch = groups, cex = 3)  # this is the simple version, only with symbols differentiating individual groups
plot (UTM.NS ~ UTM.EW, data = BCI.env, pch = groups+20, cex = 3, bg = groups, col = 'white') # this is more "colorful" option. Note that symbols now are 21 to 25, arguments 'bg' and 'col' which 
 

Result of cluster analysis can be displayed also onto the ordination diagram of unconstrained ordination, to asses how well are individual groups compositionally separated from each other. Here we will use two contrasting unconstrained ordination analyses: one which is based on the same distance measure as is the cluster anlaysis (NMDS with Bray-Curtis distance), one which is not (DCA, whose mother method, CA, is based on chi-square distance). The point of this selection (NMDS vs DCA) is to show that to evaluate whether cluster analysis results into well defined clusters, one needs to use the ordination method which is based on the same distance metric as the cluster analysis (Bray-Curtis distance in this example).

# First, NMDS with Bray-Curtis distances
NMDS <- metaMDS (vegdist (BCI.log))  # note that I could use also "NMDS <- metaMDS (bc.dist)" here
par (mfrow = c(1,2)) # I want to plot both plots into one figure, with two panels in one row
ordiplot (NMDS, type = 'n')
points (NMDS, pch = groups, col = groups)
legend ('topright', pch = 1:5, col = 1:5, legend = 1:5, bty = 'n')
 
# Second, DCA ordination (implicitly using chi-square distance)
DCA <- decorana (BCI.log)
ordiplot (DCA, type = 'n', display = 'si')
points (DCA, pch = groups, col = groups)
legend ('topright', pch = 1:5, col = 1:5, legend = 1:5, bty = 'n')

The difference in the distribution of individual groups between ordinations is not too big in case of this dataset. But remember that the best for projecting results of cluster analysis is to use the same distance measure (here Bray-Curtis) for both cluster analysis and ordination.

1)
For alternative way, how to distinguish the clusters by color, see function ColorDendrogram in package sparcl.
2)
When applied, the script contains only print function, which is a generic function choosing the specialized function by asking the class of the object on which it is applied; in case of bc.dist, the object is of the class dist, so the generic function print will choose function print.dist to proceed. Check ?print.dist to see the helpfile for this function.
en/hier-agglom_examples.txt · Last modified: 2021/05/20 08:21 by David Zelený