User Tools

Site Tools


en:pca

Unconstrained ordination

Principal component analysis (PCA)

Theory

Linear method of unconstrained ordination. Even if not explicitly stated, it preserves Euclidean distances between samples. For ecological data, which are often rather heterogeneous (many zeros in matrix), it may produce so called “horse shoe” artefact, which brings dissimilar ends of the gradient close to each other. Its domain in analysis of ecological data is for ordination of matrix with environmental factors, to uncover their intercorrelated nature. For species data it could be used if the data are rather homogeneous, or if suitable transformation of data is applied (e.g. Hellinger transformation, see Legendre & Gallagher 2001 and example below1)).

PCA axes are sorted in descending order according to the amount of variance they extract – eigenvalues. How to decide, which axes are important and representative, e.g. for visualization of data? There are two options (following Borcard et al. 2011):

  • Kaiser-Guttman criterion – calculate the mean of all eigenvalues and interpret only axes with eigenvalues larger than this mean;
  • broken stick model – randomly divides the stick of unit length into the same number of pieces as there is PCA axes and then sorts these pieces from the longest to the shortest. Repeats this procedure many times and averages the results of all permutations (analytical solution to this problem is also known). Broken stick model represents eigenvalues, which would occur by random. One may want to interpret only those PCA axes with eigenvalues larger than values generated by broken stick model.

Both methods could be visualized using evplot function (definition here, example below), written by F. Gillet (Borcard et al. 2011). Broken stick model can be calculated by PCAsignificance function from BiodiversityR package.

Two types of scaling are available - scaling in this case refers to the way ordination results are projected in the reduced space for graphical display. There is no single way how to display sites and variables (species) in the same biplot diagram (i.e. diagram showing two types of results, here sites and variables), that's why there are two ways of scaling results2):

  • Scaling 1 - distances among objects (sites) in the biplot are approximations of their Euclidean distances in multidimensional space; the angles among descriptor (species) vectors are meaningless. Choose this scaling if the main interest is to interpret relationships among objects.
  • Scaling 2 - distances among objects in the biplot are not approximations of their Euclidean distances; the angles between descriptor (species) vectors reflect their correlations. Choose this scaling if the main interest focuses on the relationships among descriptors (species).

Note: the way how scaling is implemented in vegan package was recently reworked - read the blog post of Gavin Simpson for more details.


1) However, seems that not everybody agrees that this transformation really removes the problem - see ESA 2010 presentation of Peter Minchin & Lauren Rennie
2) In CANOCO, Scaling 1 corresponds to the option Focus scaling on intersample distances option, and Scaling 2 corresponds to the option Focus scaling on inter-species correlations.
en/pca.txt · Last modified: 2017/02/15 09:23 by David Zelený