User Tools

Site Tools


en:data_preparation

FIXME

Preparation of data for analysis

Before the main analysis, and after the data have been imported into R, it is usefull to screen the data fist, check for missing values or outliers and apply transformation or standardization if necessary.

Search for missing values

This is not as trivial as it may sound. Missing data are elements in the matrix with no value, in R usually replaced by NA (not available). Note that there is a difference between 0 and NA, and not always has a sense to replace missing values in dataset by zeroes1). Samples with missing values will in most cases be removed from analysis (often silently without R reporting the warning message), and if there are many many missing values scatterd across different variables, in the end the analysis could be based on rather few samples. One way to reduce this effect is to remove the variables which have the highest proportion of missing values from analysis. Other option is to replace the missing values by estimate (either from similar plots, neighbours, values measured at the same time somewhere close, or values predicted by model based on real values).

As example, let's use Danube meadow dataset with Ellenberg indicator values for individual species (this is dataset with species attributes, so in rows are species and in columns are tabulated Ellenberg indicator values being estimates of species ecological optima for given environmental variable):

danube.ell <- read.delim ('http://www.davidzeleny.net/anadat-r/data-download/danube.ell.txt', row.names = 1)
 

Simple way to know mising values if the data are in matrix or data.frame format is to use summary:

summary (danube.ell)
     Light            Temp            Cont           Moist            React            Nutr      
 Min.   :4.000   Min.   :4.000   Min.   :2.000   Min.   : 2.000   Min.   :3.000   Min.   :1.000  
 1st Qu.:6.250   1st Qu.:5.000   1st Qu.:3.000   1st Qu.: 4.000   1st Qu.:6.500   1st Qu.:4.000  
 Median :7.000   Median :5.000   Median :3.000   Median : 5.000   Median :7.000   Median :5.000  
 Mean   :6.889   Mean   :5.435   Mean   :3.901   Mean   : 5.524   Mean   :6.851   Mean   :4.938  
 3rd Qu.:7.000   3rd Qu.:6.000   3rd Qu.:5.000   3rd Qu.: 6.000   3rd Qu.:7.000   3rd Qu.:6.000  
 Max.   :8.000   Max.   :6.000   Max.   :7.000   Max.   :10.000   Max.   :8.000   Max.   :9.000  
 NA's   :4       NA's   :48      NA's   :23      NA's   :10       NA's   :47      NA's   :13     

The lowest row in the output of summary shows the number of missing values in each variable. But which values are missing?

Missing values in the dataset visualized using image function. Yellow fields are those with missing values.
which (is.na (danube.ell), arr.ind = T)
                           row col
Chenopodium album           22   1
Festuca rubra               34   1
Melandrium diurnum          58   1
Vicia sepium                93   1
Achillea millefolium         1   2
Ajuga reptans                2   2
Alopecurus pratensis         4   2
Angelica sylvestris          5   2
Anthriscus silvestris        6   2
...

Here, the function is.na will transform the danube.ell data frame into logical values - TRUE if the values is missing and FALSE if it is present. The function which search for TRUE values in the data frame, and by setting argument arr.ind = TRUE the function returns the coordinates (row x column) of each missing value.

And do you want to see how many holes your dataset has?

image (t (as.matrix (is.na (danube.ell))))

The function image draws the “heatmap” of values in matrix-like object (here it has to be matrix, which is why I transformed the danube.ell data frame into matrix by as.matrix function). And since I am interested to see only values which have some value and has missing value (NA), I also used function is.na to transform original real values into logical TRUE/FALSE values depending whether given element is missing or not. The color palette used in image function by default is heat.colors, in which low values are red and high values are yellow to almost white; here high values are TRUE, meaning missing values, which are yellow in the resulting table. I also transposed the matrix using function t, otherwise the columns would be drawn horizontally. But note that still, the drawing starts from left bottom corner, not from the left top, so the matrix visualization is upside down compare to the values in the original matrix. FIXME

Data transformation

Pro transformaci dat postačí základní eRkové funkce. Řekněme, že proměnná veg.data obsahuje matici vzorky x druhy, přičemž abundance druhů je vyjádřená v procentech (rozmezí 0-100%).

Odmocninová transformace

veg.data.transf <- sqrt (veg.data)

Logaritmická transformace

veg.data.transf <- log1p (veg.data)

je funkce, která vypočte přirozený logaritmu z algoritmu po přičtení jedničky (kvůli nule). Stejný výsledek dostaneme takto:

veg.data.transf <- log (veg.data + 1)

Standardizace

Standardizace, na rozdíl od transformace, znamená že měníme jednotlivé hodnoty matice relativně vůči jiným hodnotám; pokud například standardizuji matici po sloupcích, pak pro každý sloupec vypočtu hodnotu nějaké statistiky (např. součet) a tou podělím hodnoty jednotlivých buněk sloupce.

Celá řada standardizačních metod je obsažena ve funkci decostand, která je součástí knihovny vegan.

Standardizace na jednotný rozsah (ranging)

Upraví hodnoty v jednotlivých sloupcích na rozsah 0 až 1:

veg.data.stand <- decostand (veg.data, method = 'range')

O tom, jak výsledek vypadá, se můžeme přesvědčit třeba funkcí:

apply (veg.data.stand, MARGIN = 2, FUN = range)

která by měla vrátit sérii nul a jedniček pro jednotlivé sloupce.

Standardizace sensu stricto

Za standardizaci v úzkém slova smyslu se považuje standardizace na průměr hodnot nula a jednotkovou varianci:

veg.data.stand <- decostand (veg.data, method = 'standardize')
1)
e.g. if I didn't measure pH in some samples because the pH meter got broken, I should not replace these values by 0
en/data_preparation.txt · Last modified: 2017/02/17 09:29 by David Zelený