Theory, R functions & Examples
Data sets used in community ecology usually consist of one, two or three of the following matrices (see also Fig. 1):
Variables (be it environmental variables, species attributes, or values in species composition matrix) fits into on of the three main categories:
Primary data are at the beginning of all analytical exercise. While collected into notebooks or newly also to some electronic devices like recorders, tablets or smartphones, they need to be retyped to a spreadsheet and archived. I found it useful to keep data in a multi-sheet Excel file, each matrix in a separate sheet, with one extra sheet containing metadata - a detail description of what kind of data each sheet contains, what is the meaning of abbreviations (if these are used for e.g. environmental variables or species names) and whether there were some changes or corrections done after the data were retyped (there are usually some). Such file can be used in future as a base for all further analyses, keeping all primary data in one place.
Long term storage of primary data is still a bit complex issue - it seems that the safest way is still to print them on acid-free paper, using the laser printer. The other option is to append data to published papers (usually as an online appendix) - some journals (like Ecological Monographs, Journal of Ecology etc.) requires attaching the data as a condition for accepting the paper. Another option is to store data in public and managed databases or electronic repositories (e.g. Dryad Digital Repository, www.datadryad.org), which (for free or some small payment) offers time-unlimited storage of your data (with the advantage of data being citable by assigning them DOI identifier).
For analysis, the first step is always to get data into R. In most cases, R expects that community ecology data will have samples in rows and species (or other descriptors, like environmental variables) in columns1). For work in R, a good practice is to save each data matrix into one file (preferentially *.txt file with cells delimited by tabulator), save these files to a certain folder, and at the beginning of the script locate a set of lines reading the data from the files into R working space. This will make the script reproducible - just wrap the script together with the files into a zip file and the analysis (using the same data) can be fully reproduced. There are also other types of import (e.g. directly via the clipboard, from other format types like *.csv, directly from Excel's *.xls or *.xlsx file, etc.), and RStudio contains also a button functionality to load data. Still, I suggest that *.txt format loaded via script is the most stable solution.
Before importing the file into R, make sure that all variable names are valid R names. Valid name can contain letters, numbers, dots and underline characters; it should start with a letter (not a number) or dot not followed by a number. R is case sensitive, which means that uppercase and lowercase letters have a different meaning. Examples of valid names:
Var1. Not valid names include
.1vars. There is also a list of reserved words which cannot be used as variable names, since they have fixed meaning in R:
break, and also
Species names of taxa, consisting of the genus, species and sometimes subspecific Latin name, may need to be abbreviated. This is not because R cannot handle long names of variables (it can, up to 255 characters), but because long species names may be difficult to display, e.g. in ordination diagram. Library
vegan contains handy function
make.cepnames, which abbreviates species names into up to 8 letters abbreviations; it takes four first letters from genus name and merges them with first four letters from species names.
Example section contains various examples of how to import data from other sources and formats into R.
iNEXTfor analysis of diversity (iNterpolation/extrapolation) expect samples in columns and species (descriptors) in rows; this needs to be kept in mind when using it!