User Tools

Site Tools


en:rarefaction_examples

Section: Diversity analysis

Rarefaction

Example 1: Comparing forest diversity along elevation standardized to sample area, number of individuals and sample completeness

We will use vegetation data from One-hectar plots in different forest types across Taiwan, containing the survey of woody species at seven localities sampled in different elevations in Taiwan. At each locality, a 1-ha plot has been established, and within the plot, 25 10×10-m subplots have been sampled on an even grid (since there are gaps between the subplots, in total, 25×0.01-ha = 0.25-ha area was surveyed within each locality; see data description for details). The dataset is prepared in two forms: abundance data (hp.abund) contains numbers of individuals for each species at each locality; incidence-based data (hp.incid) contains incidences of each species at each locality (incidence is the presence of species in a subplot made within each locality; each species at the locality can incidence number up to 25, i.e. occurring in up to 25 subplots).

Our aim will be to compare the diversities of forest vegetation in different elevations. For this, we need to standardize data to a common base. The original data are standardized to the area (at each locality, a total of 0.25 ha was surveyed), which is a common approach for vegetation ecologists. Another option is to standardize data to the same number of individuals (this is possible in the case of abundance-based data) or the same coverage (this is possible for both abundance- and incidence-based data). Let's see the result of all three options.

First, upload the dataset:

hp.abund <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/hp.abund.txt')
hp.incid <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/hp.incid.txt')

Both hp.abund and hp.incid are data frames, with species in rows and localities (1-ha plots) in columns; the cells are filled either by numbers of individuals of given species (hp.abund) or the number of incidences (out of 25) of each species (hp.incid). In case of hp.incid, the first row additionally contains the information about the number of subplots for which incidences were recorded (25 in case of all localities); this is necessary for the calculation of incidence-based rarefaction (see further).

For all calculations in this exercise, we will use the library iNEXT, developed by the team of prof. Anne Chao (Tsing-Hua University, Hsinchu, Taiwan). You may need to install the package from CRAN first if you haven't used it before:

# install.packages ('iNEXT')
library (iNEXT)

Before the analysis, let's oversee the data in each of the data frames first. The package iNEXT offers function DataInfo for that:

DataInfo (hp.abund, datatype = 'abundance')
  site    n S.obs     SC f1 f2 f3 f4 f5 f6 f7 f8 f9 f10
1   FT  232    30 0.9658  8  8  1  1  0  1  3  1  2   0
2  YLN 1056    59 0.9924  8 11  5  2  3  1  1  3  1   1
3   LJ  838    40 0.9905  8  4  2  1  1  1  1  0  0   3
4   WJ 1731    60 0.9919 14  3  4  0  1  0  1  0  1   3
5  YYH 1551    34 0.9916 13  1  0  0  0  0  0  1  0   1
6   PL  394    29 0.9823  7  4  3  0  1  0  1  0  0   0
7   GY  362    20 0.9890  4  4  1  2  1  0  2  0  0   0
DataInfo (hp.incid, datatype = 'incidence')
  site  T   U S.obs     SC Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
1   FT 25 140    30 0.9396  9  7  2  2  1  2  1  1  2   1
2  YLN 25 425    59 0.9761 11 11  5  4  1  4  2  1  3   2
3   LJ 25 264    40 0.9627 10  2  2  1  5  2  2  2  3   2
4   WJ 25 564    60 0.9738 15  3  6  0  1  2  3  1  3   2
5  YYH 25 313    34 0.9587 13  1  0  0  0  1  1  2  3   0
6   PL 25 186    29 0.9533  9  4  1  0  2  1  0  1  2   3
7   GY 25 134    20 0.9490  7  2  2  2  0  2  0  0  0   0

The meaning of the variables in the output of DataInfo is the following:

  • for datatype = “abundance”: site is the abbreviation of the locality (e.g. FT is Feng-Tien, see the data description), n is the number of individuals, S.obs is the observed number of species in the locality, SC is sample coverage (estimated values of sampling completeness for each locality, between 0 and 1). The values in columns f1, f2, f3 ... f10 are numbers of species in the locality represented by only 1, 2, 3, ...10 individuals (singletons, doubletons, tripletons etc.)
  • for datatype = “incidence”: site and S.obs the same as above; T is the number of plots within each locality (25 subplots in our case), U is the overall number of species incidences within locality (i.e. the sum of incidences of individual species, where incidence = presence of the species in one subplot). Q1, Q2, Q3, ... Q10 are numbers of species occurring in only 1, 2, 3, ... 10 subplots within each locality (unique species, duplicate species, etc.).

(note that both abundance and incidence data are based on the same original dataset, which contains a number of individuals surveyed within each of 25 10×10-m subplots; for abundance data, individuals of each species have been summed across all 25 subplots within the locality, while for incidence data, only presences-absences of species within the subplots (i.e. not the number of their individuals) were considered and summed across all 25 subplots within the locality).

Let's focus on abundance-based data first (hp.abund). You can see that localities are quite remarkably different in numbers of individuals (n), with the lowest number in FT (Feng-Tien, 232 individuals, low elevation) and highest in WJ (Wu-Jie, 1731 individuals, middle elevation). The numbers of species somehow copy the number of individuals (the correlation between n and S.obs in DataInfo (hp.abund) is 0.7: cor (DataInfo (hp.abund)$S.obs, DataInfo (hp.abund)$n)). This may be suspicious; what if the middle elevation localities are diverse simply because they have a higher density of individuals per fixed sampled area?

To make sure that this is not the case, let's standardize the data to fixed number of individuals. We can first draw the rarefaction curves to see differences between individual localities:

D_abund <- iNEXT (hp.abund, datatype = 'abundance')
plot (D_abund)

Localities very much differ by the number of individuals, which is why the rarefaction curves have rather different length; additionally, by default, the function calculates and plots also extrapolated part of rarefaction curve (up to double the number of individuals, dashed line). Important are also confidence intervals (C.I., envelopes around each curve): only diversity estimates with non-overlapping C.I. can be considered as significantly different. (Note that localities in the figure legend are sorted by alphabet, while in the original dataset, they are sorted by elevation).

To rarefy the diversities of all localities to the lowest number of observed individuals per locality (232 individuals in FT), we can use:

D_abund_232 <- iNEXT (hp.abund, datatype = 'abundance', endpoint = 232)
plot (D_abund_232)

The estimated numbers are, more efficiently, calculated by the function estimateD, with the same arguments as iNEXT:

est_D_abund <- estimateD (hp.abund, datatype = 'abundance', conf = NULL)
est_D_abund
   Assemblage   m        Method Order.q        SC        qD    qD.LCL    qD.UCL
1          FT 464 Extrapolation       0 0.9953734 33.443758 25.969938 40.917577
2          FT 464 Extrapolation       1 0.9953734 14.546192 12.627943 16.464441
3          FT 464 Extrapolation       2 0.9953734  8.495710  6.874289 10.117131
4         YLN 464   Rarefaction       0 0.9724390 49.730102 46.990454 52.469750
5         YLN 464   Rarefaction       1 0.9724390 23.366161 21.826476 24.905845
6         YLN 464   Rarefaction       2 0.9724390 14.435290 13.181701 15.688879
7          LJ 464   Rarefaction       0 0.9839027 35.387881 32.537498 38.238265
8          LJ 464   Rarefaction       1 0.9839027 18.100085 16.793039 19.407132
9          LJ 464   Rarefaction       2 0.9839027 12.550161 11.408808 13.691515
10         WJ 464   Rarefaction       0 0.9808507 45.874795 43.182416 48.567174
11         WJ 464   Rarefaction       1 0.9808507 26.134792 24.874321 27.395263
12         WJ 464   Rarefaction       2 0.9808507 18.811664 17.672183 19.951146
13        YYH 464   Rarefaction       0 0.9895967 24.270681 21.516176 27.025185
14        YYH 464   Rarefaction       1 0.9895967 11.386044 10.692071 12.080018
15        YYH 464   Rarefaction       2 0.9895967  7.187384  6.571547  7.803221
16         PL 464 Extrapolation       0 0.9855435 30.123771 25.212698 35.034844
17         PL 464 Extrapolation       1 0.9855435 15.862272 14.369425 17.355119
18         PL 464 Extrapolation       2 0.9855435 12.081222 10.552471 13.609972
19         GY 464 Extrapolation       0 0.9937452 20.859236 17.344485 24.373986
20         GY 464 Extrapolation       1 0.9937452  7.572978  6.740570  8.405386
21         GY 464 Extrapolation       2 0.9937452  5.284599  4.729888  5.839310

In the output of estimateD, the columns qD contains estimated diversity by Hill number for individual localities and q orders, specified in Order.q column: q = 0 for species richness, q = 1 for Shannon diversity, and q = 2 for Simpson diversity); check Hill numbers to refresh what it means. We care about species richness in this case, so we filter out the estimates of richness standardized to the same number of individuals (232):

D_individuals <- est_D_abund$qD[est_D_abund$Order.q == 0]

Alternatively, we may standardize the same data not to the same number of individuals but to the same sample coverage (completeness of our survey when compared to the expected number of species occurring in the surveyed community). We may first check the coverage rarefaction curve for our localities, which can be plotted by the same plot function applied on the result of iNEXT, just with modified type argument (check ?plot.iNEXT for the meaning of individual arguments):

plot (D_abund, type = 3)

Since sample coverage values for all localities are quite high (lowest 0.9658 for FT, highest 0.9924 for YLN), we may zoom to the end of the x-axis, using the xlim argument of the plot function:

plot (D_abund, type = 3, xlim = c(.95, 1))

The locality with the lowest coverage is again Feng-Tien (it also had the lowest number of individuals while being a potentially highly diverse lowland forest). We can use estimateD to standardize the diversities to this lowest coverage. Function estimateD has (additionally to datatype argument) also arguments base (either size if we want to do a comparison based on sample size (number of individuals in abundance-based data or number of plots in incidence-based data), or coverage if we do comparison based on coverage) and level - the absolute value to which to standardize (either size or coverage). If the level argument is left default (NULL), the standardization is done to the lowest level (of size or coverage) among localities:

est_D_abund_coverage <- estimateD (hp.abund, datatype = 'abundance', base = 'coverage')
est_D_abund_coverage
   Assemblage         m        Method Order.q        SC        qD    qD.LCL    qD.UCL
1          FT  412.9387 Extrapolation       0 0.9928149 33.145692 23.325549 42.965835
2          FT  412.9387 Extrapolation       1 0.9928149 14.442230 12.065224 16.819236
3          FT  412.9387 Extrapolation       2 0.9928149  8.478736  6.526250 10.431223
4         YLN 1075.3365 Extrapolation       0 0.9928149 59.142677 51.647595 66.637759
5         YLN 1075.3365 Extrapolation       1 0.9928149 24.363421 22.566673 26.160170
6         YLN 1075.3365 Extrapolation       2 0.9928149 14.677421 13.266701 16.088141
7          LJ 1074.9892 Extrapolation       0 0.9928149 41.969322 32.180488 51.758155
8          LJ 1074.9892 Extrapolation       1 0.9928149 18.639115 17.242732 20.035499
9          LJ 1074.9892 Extrapolation       2 0.9928149 12.730665 11.533934 13.927397
10         WJ 2207.7809 Extrapolation       0 0.9928149 63.636669 27.547941 99.725397
11         WJ 2207.7809 Extrapolation       1 0.9928149 27.600699 25.963761 29.237636
12         WJ 2207.7809 Extrapolation       2 0.9928149 19.401169 18.134126 20.668212
13        YYH 3102.0000 Extrapolation       0 0.9928149 46.048233  1.355485 90.740981
14        YYH 3102.0000 Extrapolation       1 0.9928149 11.818359 10.940971 12.695747
15        YYH 3102.0000 Extrapolation       2 0.9928149  7.270006  6.711362  7.828650
16         PL  704.7632 Extrapolation       0 0.9928149 32.631485 16.661780 48.601191
17         PL  704.7632 Extrapolation       1 0.9928149 16.148008 14.905676 17.390341
18         PL  704.7632 Extrapolation       2 0.9928149 12.180815 10.911933 13.449698
19         GY  438.9009 Extrapolation       0 0.9928149 20.690376 13.091915 28.288836
20         GY  438.9009 Extrapolation       1 0.9928149  7.559357  6.532531  8.586183
21         GY  438.9009 Extrapolation       2 0.9928149  5.281804  4.602661  5.960946

The estimates of species richness standardized to sample coverage 0.993 need to be again filter out from the column qD:

D_coverage <- est_D_abund_coverage$qD[est_D_abund_coverage$Order.q == 0]

We need to also add the original numbers of species at each locality (i.e. diversity standardised to sample area, D_area; this can be taken, e.g. from the DataInfo output object, argument S.obs) and store all three variables into a single object, to allow for comparison (D_est):

D_area <- DataInfo (hp.abund, datatype = 'abundance')$S.obs
D_est <- cbind (D_area, D_individuals, D_coverage)
rownames (D_est) <- DataInfo (hp.abund, datatype = 'abundance')$Assemblage
D_est
    D_area D_individuals D_coverage
FT      30      33.44376   33.14569
YLN     59      49.73010   59.14268
LJ      40      35.38788   41.96932
WJ      60      45.87480   63.63667
YYH     34      24.27068   46.04823
PL      29      30.12377   32.63149
GY      20      20.85924   20.69038

To visually compare diversity, we can use a simple barplot function, with argument beside = TRUE to make sure that bars for individual sites will be displayed beside, not stacked:

barplot (D_est, beside = T, legend.text = T, col = heat.colors (7), xlab = 'Sample standardisation',
         ylab = 'Species richness', names.arg = c('sample area', '# of individuals', 'sample coverage'))

Note that it makes no sense to compare diversities of individual plots across different standardisations (e.g. the richness of FT standardised to area, individuals and coverage), but it makes sense to compare them within the same standardisation (e.g. FT standardised to a number of individuals with YYH standardised to a number of individuals). From the barplot it is clear that the rank of localities according to their richness changes after standardisation; e.g., before standardization, YYH (Yuan-Yang-Hu, the plot close to famous Yuan-Yang lake, 鴛鴦湖, perhaps the foggiest locality in Taiwan) was slightly species richer than FT (Feng-Tien, lowland subtropical forest); after standardisation to number of individuals, YYH became poorer than FT, but after standardization to sample coverage, YYH became considerably richer than FT. The reason may be a remarkably higher number of individuals surveyed in YYH (1551 ind.) than in FT (232 ind.).

Another option is to remove the absolute difference in diversity among standardisations (use function scale to standardise diversity within the same standardisation to zero mean and unit variance) and plot the pattern along elevation (the script is already a bit more advanced, you may check the ?matplot for details):

matplot (scale (D_est), type = 'b', axes = F, xlab = 'Locality', ylab = 'Standardized species richness')
axis (1, at = 1:7, labels = rownames (D_est))
axis (2)
box ()
legend ('topright', title = 'Diversity standardised by:', legend = c('sample area', '# individuals', 'sample coverage'), pch = as.character (1:3), col = 1:3, lty = 1:3)

en/rarefaction_examples.txt · Last modified: 2023/05/10 22:12 by David Zelený

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki