# Analysis of community ecology data in R

David Zelený

### Site Tools

en:expl_var

Section: Ordination analysis

## Explained variation in constrained ordination

As explained in RDA, tb-RDA, CCA & db-RDA (constrained ordination), constrained ordination is a set of multivariate regression analyses, and as in ordinary least squared regression, the effect size is measured by R2, the coefficient of determination. R2 quantifies the variation in species composition explained by environmental variable(s), and can be calculated (if no covariables are included) as the sum of eigenvalues of all constrained axes divided by the total variation (sum of eigenvalues of all axes).

The value of R2 in constrained ordination suffers from the same drawback as in ordinary regression, namely that it decreases with the number of samples in the dataset and increases with the number of explanatory variables, making the values incomparable between datasets of different size. The solution is to use adjusted R2. The absolute value of explained variation itself is not too informative on its own unless it is put into the context, for example by comparing it to the variation the same number of explanatory variables could possibly explain on the same species composition data. Even if the explanatory variables are in fact randomly generated, the R2 is non-zero and positive (in contrast to adjusted R2, which may be zero or even negative), and to decide whether the results are interpretable, it is useful to test their significance by Monte Carlo permutation test.

R2 is known to depend on the number of samples in the dataset (sites in our case) and on the number of explanatory variables: with the number of samples R2 decreases, and with the number of predictors (even if these are randomly generated) it increases (Fig. 1). The relationship can be expressed numerically: p random predictors explain (in average) p/(n-1) of the variation (where n is the number of samples in the analysis).

The solution to this problem is to calculate adjusted R2. For linear ordination methods (as well for ordinary least squared multiple regression) the adjusted R2 can be calculated using Ezekiel’s formula:

where n is the number of samples and p is the number of predictors (explanatory variables). Resulting adjusted R2 are independent on the number of samples and predictors (Fig. 1).

Figure 1: Comparison of variance explained in constrained ordination expressed by R2 and adjusted R2. The community data with one strong gradient were simulated using the library (simcom), with an increasing number of samples. The explanatory variables are randomly generated. R2 decreases with the number of samples in the dataset (left figure) and increases with the number of explanatory variables (although these are just randomly generated). Adjusted R2 is not influenced by these two dataset parameters.

In the case of unimodal ordination methods, however, the values returned by Ezekiel’s formula are overestimated (and the dependence of variation on the number of samples and/or explanatory variables is not removed), and the R2 needs to be adjusted using the permutational method proposed by Peres-Neto et al. (2006). The principle of this permutation adjustment is based on using modified Ezekiel's formula to compare observed variation explained by the variables (R2) with expected (mean) variation the same number of variables would explain if they are random (, Fig. 2). Adjusted R2 calculated by the permutation method will slightly differ among calculations (these differences will be rather small if the number of permutations is set to be high).

Figure 2

In contrast to R2, the values of adjusted R2 can reach zero, which means that the explanatory variables do not explain any variation in species composition, or they can be even negative, which means that explanatory variables explain even less variation than would be explained (in average) by the same number of randomly generated ones. The negative values are usually ignored and not interpreted (this is important e.g. when interpreting fractions in variation partitioning).

### Is the value of explained variation too low?

The variation explained by constrained ordination may often seem as too low in absolute terms. For example, in Example 1 in this section, the variation in species composition of vltava dataset explained by two explanatory variables, soil pH and soil depth, is less than 9%. Are the results of the analysis explaining less than 9% of variation worth interpreting and publishing?

To correctly interpret the value of explained variation, you need to consider that in the case of multivariate linear regression (which constrained ordination is), two explanatory variables will be unable to explain 100% of the overall variation. In fact, the amount of variation explainable by a given number of explanatory variables in the case of the certain dataset can be exactly calculated, if we assume that the best explanatory variable for that dataset is represented by sample scores on the ordination axes calculated by an unconstrained variant of the ordination on the same dataset. For example (Fig. 3), if we take one explanatory variable, and we want to know how much it could maximally explain in the constrained ordination analysis done on a given dataset (soil pH used as explanatory in tb-RDA on log and Hellinger transformed species composition data from vltava dataset, explaining 4.8% of variation), we can do the following: 1) calculate unconstrained variant of given ordination method (here tb-PCA) on the same species composition dataset (with the same transformation of raw data if applicable), and 2) check the variation represented by the first ordination axis (calculated by dividing the eigenvalue of this axis by total variation in the dataset). This value (13.1% in our example) is the maximum variation any single predictor can explain in a given dataset. In our example, pH explained 4.8% of the total variation, while it could maximally explain 13.1% - in fact, it explained more than a third of what it could explain (4.8/13.1= 36.6%). This is not too bad.

Figure 3

The same applies for more than one explanatory variable – e.g. if you have two variables, you take the variation represented by the first two axes in the unconstrained variant of the ordination for comparison. In the case of more than one variable, however, the comparison starts to have the problem that ordination axes are by definition not correlated, while explanatory variables are (often) somewhat correlated. It means that the variation represented by n unconstrained ordination axes is the maximum variation which could be explained by n explanatory variables which are not intercorrelated.