User Tools

Site Tools


en:similarity

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
en:similarity [2020/03/25 23:15]
David Zelený
en:similarity [2020/03/30 11:44] (current)
David Zelený [Matrix of similarities/distances]
Line 11: Line 11:
 Intuitively,​ one thinks about **similarity** among objects - the more are two objects similar in terms of their properties, the higher is their similarity. In the case of species composition data, the similarity is calculated using similarity indices, ranging from 0 (the samples do not share any species) to 1 (samples have identical species composition). Ordination techniques are usually based on distances, because they need to localize the samples in a multidimensional space; clustering methods could usually handle both similarities or distances. **Distances** are of two types, either dissimilarity,​ converted from analogous similarity indices, or specific distance measures, such as Euclidean, which doesn'​t have a counterpart in any similarity index. While all similarity indices can be converted into distances, not all distances could be converted into similarities (as is true e.g. for Euclidean distance). Intuitively,​ one thinks about **similarity** among objects - the more are two objects similar in terms of their properties, the higher is their similarity. In the case of species composition data, the similarity is calculated using similarity indices, ranging from 0 (the samples do not share any species) to 1 (samples have identical species composition). Ordination techniques are usually based on distances, because they need to localize the samples in a multidimensional space; clustering methods could usually handle both similarities or distances. **Distances** are of two types, either dissimilarity,​ converted from analogous similarity indices, or specific distance measures, such as Euclidean, which doesn'​t have a counterpart in any similarity index. While all similarity indices can be converted into distances, not all distances could be converted into similarities (as is true e.g. for Euclidean distance).
  
-There is a number of measures of similarities or distances ([[references|Legendre & Legendre 2012]] list around 30 of them). The first decision one has to make is whether the aim is R- or Q-mode analysis (R-mode focuses on differences among species, Q-mode on differences among samples), since some of the measures differ between both modes (e.g. Pearson'​s //r// correlation coefficient makes sense for association between species (R-mode), but not for association between samples (Q-mode); in contrast, e.g. Sørensen index can be used in both Q- and R-mode analysis, called Dice index in R-mode analysis). Further, if focusing on differences between samples (Q-mode), the most relevant measures in ecology are asymmetric indices ignoring double zeros (more about //​double-zero problem// below). Then, it also depends whether the data are qualitative (i.e. binary, presence-absence) or quantitative (species abundances). In the case of distance indices, an important criterium is whether they are metric (they can be displayed in Euclidean space) or not, since this influences the choice of the index for some ordination or clustering methods.+There is a number of measures of similarities or distances ([[references|Legendre & Legendre 2012]] list around 30 of them). The first decision one has to make is whether the aim is R- or Q-mode analysis (R-mode focuses on differences among species, Q-mode on differences among samples), since some of the measures differ between both modes (e.g. Pearson'​s //r// correlation coefficient makes sense for the association between species (R-mode), but not for the association between samples (Q-mode); in contrast, e.g. Sørensen index can be used in both Q- and R-mode analysis, ​and is called Dice index in R-mode analysis). Further, if focusing on differences between samples (Q-mode), the most relevant measures in ecology are asymmetric indices ignoring double zeros (more about //​double-zero problem// below). Then, it also depends whether the data are qualitative (i.e. binary, presence-absence) or quantitative (species abundances). In the case of distance indices, an important criterium is whether they are metric (they can be displayed in Euclidean space) or not, since this influences the choice of the index for some ordination or clustering methods.
  
 [[references|Legendre & Legendre (2012)]] offers a key how to select an appropriate measure for given data and problem (check their Tables 7.4-7.6). Generally, as a rule of thumb, Bray-Curtis and Hellinger distances are better choices than Euclidean or Chi-square distances. [[references|Legendre & Legendre (2012)]] offers a key how to select an appropriate measure for given data and problem (check their Tables 7.4-7.6). Generally, as a rule of thumb, Bray-Curtis and Hellinger distances are better choices than Euclidean or Chi-square distances.
Line 17: Line 17:
 ===== Double-zero problem ===== ===== Double-zero problem =====
  
-"​Double zero" is a situation when certain species is missing in both compared community samples for which similarity/​distance is calculated. Species missing simultaneously in two samples can mean the following: (1) samples are located outside of the species ecological niche, but one cannot say whether both samples are on the same side of the ecological gradient (i.e. they can be rather ecologically similar, samples A and B on <imgref double-zero-curve>​) or they are on the opposite sides (and hence very different, samples A and C). Alternatively,​ (2) samples are located inside species ecological niche (samples D and E), but the species in given samples does not occur since it didn’t get there (dispersal limitation),​ or it was present, but overlooked and not sampled (sampling bias). In both cases, the double zero represents missing information,​ which cannot offer an insight into the ecology ​of compared samples.+"​Double zero" is a situation when certain species is missing in both compared community samples for which similarity/​distance is calculated. Species missing simultaneously in two samples can mean the following: (1) samples are located outside of the species ecological niche, but one cannot say whether both samples are on the same side of the ecological gradient (i.e. they can be rather ecologically similar, samples A and B on <imgref double-zero-curve>​) or they are on the opposite sides (and hence very different, samples A and C). Alternatively,​ (2) samples are located inside species ecological niche (samples D and E), but the species in given samples does not occur since it didn’t get there (dispersal limitation),​ or it was present, but overlooked and not sampled (sampling bias). In both cases, the double zero represents missing information,​ which cannot offer an insight into the ecological similarity ​of compared samples.
  
 <​imgcaption double-zero-curve |Response curve of a single species along environmental gradient; A, B..., E are samples located within or outside the species niche.>​{{ :​obrazky:​double-zero-illustration.png?​direct&​400|}}</​imgcaption>​ <​imgcaption double-zero-curve |Response curve of a single species along environmental gradient; A, B..., E are samples located within or outside the species niche.>​{{ :​obrazky:​double-zero-illustration.png?​direct&​400|}}</​imgcaption>​
  
-Similarity ​indices differ in a way how they approach the double-zero problem. **Symmetrical indices** treat double zero (0-0) in the same way (symmetrically) as double presences (1-1), i.e. as a reason to consider samples similar. This is not usually meaningful for species composition data (as explained above) but could be meaningful e.g. for multivariate data containing chemical measurement (the fact that, e.g., heavy metals are missing in both samples could really indicate similarity between both samples). **Asymmetrical indices** treat double zeros and double presences asymmetrically - they ignore double zeros, and focus only on double presences when evaluating the similarity of samples; these indices are usually more meaningful for species composition data.+Both similarity and distance ​indices differ in a way how they approach the double-zero problem. **Symmetrical indices** treat double zero (0-0) in the same way (symmetrically) as double presences (1-1), i.e. as a reason to consider samples similar. This is usually ​not meaningful for species composition data (as explained above) but could be meaningful e.g. for multivariate data containing chemical measurement (for example, ​the fact that heavy metals are missing in both samples could really indicate similarity between both samples). **Asymmetrical indices** treat double zeros and double presences asymmetrically - they ignore double zeros, and focus only on double presences when evaluating the similarity of samples; these indices are usually more meaningful for species composition data.
  
 <​imgcaption double-zero-table |For details see the text.>{{ :​obrazky:​double-zero-table.jpg?​direct&​400|}}</​imgcaption>​ <​imgcaption double-zero-table |For details see the text.>{{ :​obrazky:​double-zero-table.jpg?​direct&​400|}}</​imgcaption>​
Line 29: Line 29:
  
 ===== Similarity indices ===== ===== Similarity indices =====
-Categories of similarity indices are summarized in <tabref similarity-indices>​. Symmetric indices, i.e. those which consider double zeros as relevant, are not further treated here since they are not useful for analysis of ecological data (although they may be useful e.g. for analysis of environmental variables). Here we will consider only asymmetric similarity indices, i.e. those ignoring double zeros. These split into two types according to the data which they are using: qualitative (binary) indices, applied on presence-absence data, and quantitative indices, applied on raw (or transformed) species abundances. Note that some of the indices have also multi-sample alternatives (i.e. they could be calculated on more than two samples), which could be used for calculating beta diversity.+Categories of similarity indices are summarized in <tabref similarity-indices>​. Symmetric indices, i.e. those which consider double zeros as relevant, are not further treated here since they are not useful for analysis of ecological data (although they may be useful e.g. for analysis of environmental variables). Here we will consider only asymmetric similarity indices, i.e. those ignoring double zeros. These split into two types according to the data which they are using: ​**qualitative** (binary) indices, applied on presence-absence data, and **quantitative** indices, applied on raw (or transformed) species abundances ​(note, however, ​that presence-absence (qualitative) and abundance (quantitative) species composition data carry different type of information,​ and their analysis may have a different meaning - see the section [[en:​similarity#​presence-absence_vs_quantitative_species_composition_data|Presence-absence vs quantitative species composition data]]). Some of the similarity ​indices have also multi-sample alternatives (i.e. they could be calculated on more than two samples), which could be used for calculating beta diversity.
  
 <​tabcaption similarity-indices|Similarity indices classified according to their properties.>​ <​tabcaption similarity-indices|Similarity indices classified according to their properties.>​
-<fc #ff0000><fs large>​Similarity indices</​fs></fc>                                                 |^ How they deal with //double zero// problem? ​                                                                                                                                                                                     || +^ <fs large>​Similarity indices</​fs> ​                                                |^ How they deal with //double zero// problem? ​                                                                                                                                                                                     || 
-| :::                                                                                                | :::                                           ^ symmetrical (treat double zeros as important information) ​ ^ asymmetrial (ignore double zeros) ​                                                                                                                                   ^ +| :::                                                                               ​| :::                                           ^ symmetrical (treat double zeros as important information) ​ ^ asymmetrial (ignore double zeros) ​                                                                                                                                   ^ 
-^ Which type of data indices use?                     ​^ qualitative (binary = presence absence data)  | <wrap lo>not suitable for ecological data</​wrap> ​          | Jaccard similarity, Sørensen similarity, Simpson similarity ​                                                                                                         | +^ Which type of data indices use?    ^ qualitative (binary = presence absence data)  | <wrap lo>not suitable for ecological data</​wrap> ​          | Jaccard similarity, Sørensen similarity, Simpson similarity ​                                                                                                         | 
-| :::                                                 ​^ quantitative (species abundances) ​            | <wrap lo>not suitable for ecological data</​wrap> ​          | Percentage similarity((Percentage similarity (PS) is quantitative analog of Sørensen index; 1-PS is Percentage dissimilarity,​ also known as Bray-Curtis distance.)) ​ |+| :::                                ^ quantitative (species abundances) ​            | <wrap lo>not suitable for ecological data</​wrap> ​          | Percentage similarity((Percentage similarity (PS) is quantitative analog of Sørensen index; 1-PS is Percentage dissimilarity,​ also known as Bray-Curtis distance.)) ​ |
 </​tabcaption>​ </​tabcaption>​
  
-<​tabcaption abc-schema|The meaning of fraction a, b, c and d used in qualitative indices calculating similarity among two samples. In assymetric indices, the fraction d (double zero) is ignored.>​ 
-^ The number of species which are            |^ in sample 1           || {{ :​obrazky:​similarity-venns-diagram-abc.jpg?​350&​direct }}  | 
-| :::                                        | :::      | present ​     | absent ​ | :::                                                         | 
-^ in sample 2                      | present ​ | a            | b       | :::                                                         | 
-| :::                              | absent ​  | c            | d       | :::                                                         | 
-</​tabcaption>​ 
  
 \\ \\
-\\ +
-**Qualitative (binary) asymmetrical similarity indices** use information about the number of species shared by both samples, and numbers of species which are occurring in the first or the second sample only (see the schema at <tabref abc-schema>​).+
 <WRAP right box 45%> <WRAP right box 45%>
 **Jaccard similarity**:​ <​m>​J~=~{a}/​{a+b+c}</​m>​ **Jaccard similarity**:​ <​m>​J~=~{a}/​{a+b+c}</​m>​
Line 55: Line 48:
 **Simpson similarity**:​ <​m>​Si~=~{a}/​{a+min(b,​c)}</​m>​ **Simpson similarity**:​ <​m>​Si~=~{a}/​{a+min(b,​c)}</​m>​
 </​WRAP>​ </​WRAP>​
-**Jaccard similarity index** divides the number of species shared by both samples (fraction //a//) by the sum of all species occurring in both samples (//​a//​+//​b//​+//​c//,​ where //b// and //c// are numbers of species occurring only in the first and only in the second sample, respectively). **Sørensen similarity index** considers the number of species shared among both samples as more important, so it counts it twice. **Simpson similarity index** is useful in a case that compared samples largely differ in species richness (i.e. one sample has considerably more species than the other). If Jaccard or Sørensen are used on such data, their values are generally very low, since the fraction of species occurring only in the rich sample will make the denominator too large and the overall value of the index too low; Simpson index, which was originally introduced for comparison of fossil data, eliminates this problem by taking only the smaller from the fractions //b// and //c//. (Note that there is yet another Simpson index, namely //Simpson diversity index//; each of the indices was named after different ​Mr. Simpson, and while Simpson similarity index is calculating similarity between pair of compositional samples, Simpson diversity index is index calculating diversity of a single community sample; you may find details in my [[http://​davidzeleny.net/​blog/​2017/​03/​18/​simpsons-similarity-index-vs-simpsons-diversity-index/​|blog post]]).+**Qualitative (binary) asymmetrical similarity indices** use information about the number of species shared by both samples, and numbers of species which are occurring in the first or the second sample only (see the schema at <tabref abc-schema>​). ​**Jaccard similarity index** divides the number of species shared by both samples (fraction //a//) by the sum of all species occurring in both samples (//​a//​+//​b//​+//​c//,​ where //b// and //c// are numbers of species occurring only in the first and only in the second sample, respectively). **Sørensen similarity index** considers the number of species shared among both samples as more important, so it counts it twice. **Simpson similarity index** is useful in a case that compared samples largely differ in species richness (i.e. one sample has considerably more species than the other). If Jaccard or Sørensen are used on such data, their values are generally very low, since the fraction of species occurring only in the rich sample will make the denominator too large and the overall value of the index too low; Simpson index, which was originally introduced for comparison of fossil data, eliminates this problem by taking only the smaller from the fractions //b// and //c//. (Note that there is yet another Simpson index, namely //Simpson diversity index//; each of the indices was named after different ​person surnamed ​Simpsonwhile the Simpson similarity index is calculating ​the similarity between pair of compositional samples, ​the Simpson diversity index is calculating diversity of a single community sample; you may find details in my [[http://​davidzeleny.net/​blog/​2017/​03/​18/​simpsons-similarity-index-vs-simpsons-diversity-index/​|blog post]] ​on this topic). 
 + 
 +<​tabcaption abc-schema|The meaning of fraction a, b, c and d used in qualitative indices calculating similarity among two samples. In assymetric indices, the fraction d (double zero) is ignored.>​ 
 +^ The number of species which are            |^ in sample 1           || {{ :​obrazky:​similarity-venns-diagram-abc.jpg?​350&​direct }}  | 
 +| :::                                        | :::      | present ​     | absent ​ | :::                                                         | 
 +^ in sample 2                      | present ​ | a            | b       | :::                                                         | 
 +| :::                              | absent ​  | c            | d       | :::                                                         | 
 +</​tabcaption>​ 
 + 
 +\\
  
 <WRAP right box 65%> <WRAP right box 65%>
Line 67: Line 69:
 </​WRAP>​ </​WRAP>​
  
-** Quantitative similarity indices ** (applied on raw abundances) include **percentage similarity**,​ which is a quantitative version of Sørensen similarity index (which means that if calculated on presence-absence data, it gives the same results are Sørensen similarity index). Note that //​percentage difference//,​ calculated as 1-//​percentage similarity//,​ is called Bray-Curtis index.+** Quantitative similarity indices ** (applied on quantitative abundance data) include **percentage similarity**,​ which is a quantitative version of Sørensen similarity index (if calculated on presence-absence data, it gives the same results are Sørensen similarity index). Note that //​percentage difference//,​ calculated as 1-//​percentage similarity//,​ is called Bray-Curtis ​distance ​index (see below).
 ===== Distance indices ===== ===== Distance indices =====
 While similarity indices return the highest value in the case that both compares samples are identical (maximally similar), distance indices are largest for two samples which do not share any species (are maximally dissimilar). There are two types of distance (or dissimilarity) indices((Note that the use of "​distance"​ and "​dissimilarity"​ is somewhat not systematic; some authors call distances only those indices which are metric (Euclidean),​ i.e. can be displayed in metric (Euclidean) geometric space, and the other indices are called dissimilarities;​ but sometimes these two terms are simply synonyms.)):​ While similarity indices return the highest value in the case that both compares samples are identical (maximally similar), distance indices are largest for two samples which do not share any species (are maximally dissimilar). There are two types of distance (or dissimilarity) indices((Note that the use of "​distance"​ and "​dissimilarity"​ is somewhat not systematic; some authors call distances only those indices which are metric (Euclidean),​ i.e. can be displayed in metric (Euclidean) geometric space, and the other indices are called dissimilarities;​ but sometimes these two terms are simply synonyms.)):​
Line 75: Line 77:
 <​imgcaption triangle-inequality|Triangle inequality principle.>​{{ :​obrazky:​triangle_inequality_principle.jpg?​direct&​500|}}</​imgcaption>​ <​imgcaption triangle-inequality|Triangle inequality principle.>​{{ :​obrazky:​triangle_inequality_principle.jpg?​direct&​500|}}</​imgcaption>​
  
-An important criterium is **whether the distance index is metric or not** (i.e. it is semi-metric or non-metric). The term "​metric"​ refers to the indices ​which can be displayed in the orthogonal Euclidean spacesince they obey so-called "triangle inequality ​principle" ​(see explanation in <imgref triangle-inequality>​). Some dissimilarity ​indices calculated from similarities are metric (e.g. Jaccard dissimilarity),​ some are not (e.g. Sørensen dissimilarity and it's a quantitative version called Bray-Curtis ​dissimilarity ​are semimetric; some other distance may be nonmetric - they can reach negative values, which is nonsensible). In the case of Sørensen and Bray-Curtis (and some others), this can be solved by calculating the dissimilarity as <​m>​D~=~sqrt{1-S}</​m>​ instead of the standard <​m>​D~=~1-S</​m>​ (where S is the similarity);​ resulting dissimilarity index is then metric. Indices which are not metric cause troubles in ordination methods relying on Euclidean space (PCoA or db-RDA) and numerical clustering algorithms which need to locate samples in the Euclidean space (such as Ward algorithm or K-means). For example, PCoA calculated using distances which are not metric creates axes with negative eigenvalues,​ and this e.g. in db-RDA may result in virtually higher variation explained by explanatory variables than would reflect the data.+An important criterium is **whether the distance index is metric or not** (i.e. it is semi-metric or non-metric). The term "​metric"​ refers to the distance ​indices ​that obey the following four metric properties: 1) minimum distance is zero2) distance is always positive (unless it is zero), 3) the distance between sample 1 and sample 2 is the same as distance between sample 2 and sample 1, and 4) triangle inequality (see explanation in <imgref triangle-inequality>​). Indices that obey the fourth, triangle-inequality principles, can be displayed in the orthogonal Euclidean space (and are sometimes called as having Euclidean property; note that Euclidean distance is just one of many distance indices having Euclidean property). Some distance ​indices calculated from similarities are metric (e.g. Jaccard dissimilarity),​ some are not (e.g. Sørensen dissimilarity and its quantitative versioncalled Bray-Curtis ​distance, ​are semimetric; some other distance may be nonmetric - they can reach negative values, which is nonsensible ​for ecological data). In the case of Sørensen and Bray-Curtis (and some others), this can be solved by calculating the dissimilarity as <​m>​D~=~sqrt{1-S}</​m>​ instead of the standard <​m>​D~=~1-S</​m>​ (where S is the similarity);​ resulting dissimilarity index is then metric. Indices which are not metric cause troubles in ordination methods relying on Euclidean space (PCoA or db-RDA) and numerical clustering algorithms which need to locate samples in the Euclidean space (such as Ward algorithm or K-means). For example, PCoA calculated using distances which are not metric creates axes with negative eigenvalues,​ and this e.g. in db-RDA may result in virtually higher variation explained by explanatory variables than would reflect the data.
  
-**Bray-Curtis dissimilarity** or **percentage difference**((Note that according to P. Legendre, Bray-Curtis index should not be called after Bray and Curtis, since they have not really published it, only used it.)) is one complement of //​percentage similarity//​ index described above. It is considered suitable for community composition datasince it ignores double zeros, and it has a meaningful upper value equal to one (meaning complete mismatch between species composition of two samples, i.e. if one species in one sample is present and has some abundance, the same species in the other samples is zero, and vice versa). Bray-Curtis considers absolute species abundances in the samples, not only relative species abundances. The index is not metric, but the version calculated as <​m>​sqrt{1-PS}</​m>​ (where PS is percentage similarity) is metric and can be used in PCoA.+**Bray-Curtis dissimilarity** or **percentage difference**((Note that according to P. Legendre, Bray-Curtis index should not be called after Bray and Curtis, since they have not really published it, only used it.)) is one complement of //​percentage similarity//​ index described above. It is considered suitable for community composition data since it is asymmetrical (ignores double zeros), and it has a meaningful upper value equal to one (meaning complete mismatch between species composition of two samples, i.e. if one species in one sample is present and has some abundance, the same species in the other samples is zero, and vice versa). Bray-Curtis considers absolute species abundances in the samples, not only relative species abundances. The index is not metric, but the version calculated as <​m>​sqrt{1-PS}</​m>​ (where PS is percentage similarity) is metric and can be used in PCoA.
  
 <WRAP right box 35%> <WRAP right box 35%>
Line 84: Line 86:
 </​WRAP>​ </​WRAP>​
  
-**Euclidean distance**, although not suitable for ecological data, is frequently used in a multivariate analysis (mostly because it is the implicit distance for linear ordination methods like PCA, RDA and for some clustering algorithms). Euclidean distance has no upper limit and the maximum value depends on the data. The main reason why it is not suitable for compositional data is that it is a symmetrical index, i.e. it treats double zeros in the same way as double presences. Double ​zeros shrink the distance between two plots. The solution is to apply Euclidean distances on pre-transformed species composition data (e.g. using Hellinger, Chord or chi-square transformation). An example of calculating Euclidean distance between samples with only two species is on <imgref eucl-dist>​.+**Euclidean distance**, although not suitable for ecological data, is frequently used in a multivariate analysis (mostly because it is the implicit distance for linear ordination methods like PCA, RDA and for some clustering algorithms). Euclidean distance has no upper limit and the maximum value depends on the data. The main reason why it is not suitable for compositional data is that it is a symmetrical index, i.e. it treats double zeros in the same way as double presences, and as a result, double ​zeros shrink the distance between two plots (the solution is to apply Euclidean distances on pre-transformed species composition datae.g. using Hellinger, Chord or chi-square transformation; resulting distances are then asymmetrical). Another disadvantage of Euclidean distance is that it puts more emphasis on the absolute species abundances instead of species presences and absences in the samples; as the result, Euclidean distances between two samples not sharing any species may be smaller than between two samples sharing all species, but with the same species having large abundance differences between samples (Euclidean paradox). An example of calculating Euclidean distance between samples with only two species is on <imgref eucl-dist>​.
  
 <​imgcaption eucl-dist|Euclidean distance between two samples with only two species.>​{{ :​obrazky:​schema-calculating-eucl-distance.png?​direct&​400|}}</​imgcaption>​ <​imgcaption eucl-dist|Euclidean distance between two samples with only two species.>​{{ :​obrazky:​schema-calculating-eucl-distance.png?​direct&​400|}}</​imgcaption>​
  
-**Chord distance** is Euclidean distance calculated on normalized species data. Normalization means that species vector in multidimensional space is of unit length; to normalize the species vector, one needs to divide each species abundance in a given sample by the square-rooted sum of squared abundances of all species in that sample. Chord distance is then the Euclidean distance between samples with normalized species data. An advantage of chord distance compared to Euclidean distance is that it has the upper limit (equal to <​m>​sqrt{2}</​m>​),​ while Euclidean distance has no upper limit.+**Chord distance** is Euclidean distance calculated on normalized species data. Normalization means that species vector in multidimensional space is of unit length; to normalize the species vector, one needs to divide each species abundance in a given sample by the square-rooted sum of squared abundances of all species in that sample. Chord distance is then the Euclidean distance between samples with normalized species data. An advantage of chord distance compared to Euclidean distance is that it is asymmetrical (ignores double zeros) and has the upper limit (equal to <​m>​sqrt{2}</​m>​),​ while Euclidean distance has no upper limit.
  
-**Hellinger distance** is Euclidean distance calculated on Hellinger transformed species data (and is the distance used in tb-PCA and tb-RDA if the species data are pre-transformed by Hellinger transformation). Hellinger transformation consists of first relativizing the species abundances in the sample by standardizing them to sample total (sum of all abundances in the sample); then, each standardized value is square-rooted. This puts the species abundances on the relative scale, and square-rooting lowers the importance of the dominant species. Hellinger distance has an upper limit of <​m>​sqrt{2}</​m> ​and is considered as a suitable method for ecological data with many zeros.+**Hellinger distance** is Euclidean distance calculated on Hellinger transformed species data (and is the distance used in tb-PCA and tb-RDA if the species data are pre-transformed by Hellinger transformation). Hellinger transformation consists of first relativizing the species abundances in the sample by standardizing them to sample total (sum of all abundances in the sample); then, each standardized value is square-rooted. This puts the species abundances on the relative scale, and square-rooting lowers the importance of the dominant species. Hellinger distance ​is asymmetrical (not influenced by double zeros) and has an upper limit of <​m>​sqrt{2}</​m>​, which makes it a suitable method for ecological data with many zeros.
  
-**Chi-square distance** is rarely calculated itself, but is important since it is implicit for CA and CCA ordination. ​+**Chi-square distance** ​is an asymmetrical distance which is rarely calculated itself, but is important since it is implicit for CA and CCA ordination. ​
  
 ===== Euclidean distance: abundance paradox ===== ===== Euclidean distance: abundance paradox =====
Line 106: Line 108:
 <​m>​D_Eucl~(Sample 1, Sample 3)~=~sqrt{(0-0)^2+(1-4)^2+(1-8)^2}~=~7.615</​m> ​ <​m>​D_Eucl~(Sample 1, Sample 3)~=~sqrt{(0-0)^2+(1-4)^2+(1-8)^2}~=~7.615</​m> ​
  
-Euclidean distance between sample 1 and 2 is lower than between sample 1 and 3, although samples 1 and 2 have no species in common, while sample 1 and 3 share all species.+Euclidean distance between sample 1 and 2 is lower than between sample 1 and 3, although samples 1 and 2 have no species in common, while sample 1 and 3 share all species. Distances, which are based on relative species abundances (i.e. those in which abundances of species in the sample are made relative e.g. by dividing each abundance of species in the sample by the sum of abundances for all species in that sample), do not have this problem (e.g. Hellinger distance, which is Euclidean distance applied on Hellinger-standardized data - the first step of Hellinger standardization converts absolute species abundances into relative ones).
  
  
 ===== Matrix of similarities/​distances ===== ===== Matrix of similarities/​distances =====
-The matrix of similarities or distances is squared (the same number of rows as columns), with the values on diagonal either zeros (distances) or ones (similarities),​ and symmetric - the upper right triangle is a mirror of values in lower left one (<imgref dist-matrix>​).+The matrix of similarities or distances is squared (the same number of rows as columns), with the values on diagonal either zeros (distances) or ones (similarities),​ and is symmetric - the upper right triangle is a mirror of values in lower left one (since it does not matter whether you calculate similarit/​distance from sample A to sample B or from sample B to sample A; <imgref dist-matrix>​).
  
 <​imgcaption dist-matrix|Matrix of Euclidean distances calculated between all pairs of samples (a subset of 10 samples from Ellenberg'​s Danube meadow dataset used). Diagonal values (yellow) are zeros since the distance of two identical samples is zero.>{{ :​obrazky:​eucl-dist-matrix-danube-data.jpg?​direct |}}</​imgcaption>​ <​imgcaption dist-matrix|Matrix of Euclidean distances calculated between all pairs of samples (a subset of 10 samples from Ellenberg'​s Danube meadow dataset used). Diagonal values (yellow) are zeros since the distance of two identical samples is zero.>{{ :​obrazky:​eucl-dist-matrix-danube-data.jpg?​direct |}}</​imgcaption>​
 +
 +===== Presence-absence vs quantitative species composition data===== ​
 +Species composition data (i.e. data about the occurrence of species in individual community samples) containing species quantities (abundances,​ covers, biomass, numbers of individuals) can always be transformed into species presences-absences. It is important to know that by transforming abundances into presences-absences you are not only reducing the amount of information (quantities are lost) but also likely changing what data are able to tell you about natural processes behind the community assembly. ​
 +
 +Abundance data carry two types of information:​ 1) whether a species occurs in this community (or not), and 2) how much of this species occurs here; presence-absence data contain only the first type of this information. At the same time, whether a species occurs in a given community (or not) is driven by different ecological processes than the abundance of species in the community if the species already occurs there. The occurrence of a species in the site is often dependent not only on environmental suitability,​ but also on dispersal limitation (whether the species can get there), random drift (species may simply go extinct due to stochastic processes) or the existence of biogeographical boundaries (two sites with similar environmental conditions may not share the same species simply because there is a river or mountain range between them). On the other side, if the species already occurs in the sampled community, then its abundance is often driven by the suitability of environmental conditions (more abundant species are those for which the environment is favourable),​ but also by biotic interaction (competition or mutualism with other species present in the community). ​
 +
 +This also means that sometimes, analysing parallelly both abundance and presence-absence transformed species composition data may be meaningful if we are trying to uncover alternative processes. Note, however, that if the species composition data contain many zeros (meaning that they are very heterogeneous),​ then most of the information stored in them is related to the first type of information (see above), even if they contain abundances of all non-absent species. Some studies (e.g. [[en:​references|Wilson 2012]]) show that when analysing the relationship of species composition to environmental variables, transforming species composition data into presences-absences may actually improve results (variance explained by the environment in constrained ordination, or fit of environmental variables to unconstrained ordination axes).
 +
 +
  
  
en/similarity.1585149348.txt.gz · Last modified: 2020/03/25 23:15 by David Zelený