Trace:

en:data_preparation

This shows you the differences between two versions of the page.

Both sides previous revision Previous revision Next revision | Previous revision Last revision Both sides next revision | ||

en:data_preparation [2019/02/10 19:48] David Zelený [Data transformation] |
en:data_preparation [2019/05/23 09:44] David Zelený [Missing values] |
||
---|---|---|---|

Line 8: | Line 8: | ||

==== Missing values ==== | ==== Missing values ==== | ||

- | This is not as trivial as it may sound. Missing data are elements in the matrix with no value, in R usually replaced by ''NA'' (not available). Note that there is an important difference between ''0'' and ''NA''. It makes sense to replace missing value by zero if the entity is really missing (e.g. species was not recorded and gets zero cover or abundance), but it make not sense to replace it by zero if the entity was not recorded (e.g., if I didn't measure pH in some samples because the pH-meter got broken, I should not replace these values by 0, since it does not mean that the pH of that sample is so low). Samples with missing values will be removed from the analysis (often silently without reporting any warning message), and if there are many missing values scattered across different variables, the analysis will be based on rather few samples. One way to reduce this effect is to remove those variables with the highest proportion of missing values from the analysis. Another option is to replace the missing values by estimates if these could be reasonably accurate (mostly by interpolation, e.g. from similar plots, neighbours, values measured at the same time somewhere close, or values predicted by a model). | + | This is not as trivial as it may sound. Missing data are elements in the matrix with no value, in R usually replaced by ''NA'' (not available). Note that there is an important difference between ''0'' and ''NA''. It makes sense to replace missing value by zero if the entity is really missing (e.g. species was not recorded and gets zero cover or abundance), but it makes no sense to replace it by zero if the entity was not recorded (e.g., if I didn't measure pH in some samples because the pH-meter got broken, I should not replace these values by 0, since it does not mean that the pH of that sample is so low). Samples with missing values will be removed from the analysis (often silently without reporting any warning message), and if there are many missing values scattered across different variables, the analysis will be based on rather few samples. One way to reduce this effect is to remove those variables with the highest proportion of missing values from the analysis. Another option is to replace the missing values by estimates if these could be reasonably accurate (mostly by interpolation, e.g. from similar plots, neighbours, values measured at the same time somewhere close, or values predicted by a model). |

==== Outliers ==== | ==== Outliers ==== | ||

Line 17: | Line 17: | ||

<imgcaption boxplot_definition_outliers left|Definition of outliers in the box plot. An outlier is shown by circle below the non-outlier range of values.>{{:obrazky:boxplot_definition_outliers.jpg?direct|}}</imgcaption> | <imgcaption boxplot_definition_outliers left|Definition of outliers in the box plot. An outlier is shown by circle below the non-outlier range of values.>{{:obrazky:boxplot_definition_outliers.jpg?direct|}}</imgcaption> | ||

- | Using pH values available from [[en:data:vltava|Vltava river valley dataset]] as an example, <imgref vltava_ph_boxplot_histogram> illustrates the use of box plot and histogram to identify outliers((Script to draw this figure can be found [[en:scripts:vltava_ph_boxplot_histogram_script|here]].)). Boxplot indicates that there are four outliers with pH value too high. Histogram confirms that three pH values above 6.0 are really separated by a gap from the other pH values. Closer examination reveals that three samples (namely 32, 33 and 34, highlighted by red colour) are from the same transect which was made in a limestone bedrock, which is why they have rather high values of soil pH. When I was sampling the data, I was aware that there is limestone, and I hoped to have high pH samples in my dataset; that time I did not think that it will be the only three plots between all 97 plots which will be on a limestone. These values are therefore not a mistake, but they are outliers since they describe a different phenomenon (forest on limestone bedrock) which does not have enough replicates in the dataset. I may either delete them or go back to the field and try to collect more limestone samples. The fourth value indicated as outlier by boxplot (highlighted by green color) is a sample done in different area, perhaps also on some small limestone patch; however, since I am not sure with that, I would not remove it as an outlier (according to histogram this value fits to the overall distribution, although this would change if the histogram breaks are set up more fine). | + | Using pH values available from [[en:data:vltava|Vltava river valley dataset]] as an example, <imgref vltava_ph_boxplot_histogram> illustrates the use of box plot and histogram to identify outliers((Script to draw this figure can be found [[en:scripts:vltava_ph_boxplot_histogram_script|here]].)). Boxplot indicates that there are four outliers with pH value too high. Histogram confirms that three pH values above 6.0 are really separated by a gap from the other pH values. Closer examination reveals that three samples (namely 32, 33 and 34, highlighted by red colour) are from the same transect which was made in a limestone bedrock, which is why they have rather high values of soil pH. When I was sampling the data, I was aware that there is limestone, and I hoped to have high pH samples in my dataset; that time I did not think that it will be the only three plots between all 97 plots which will be on a limestone. These values are therefore not a mistake, but they are outliers since they describe a different phenomenon (forest on limestone bedrock) which does not have enough replicates in the dataset. I may either delete them or go back to the field and try to collect more limestone samples. The fourth value indicated as outlier by boxplot (highlighted by green colour) is a sample done in different area, perhaps also on some small limestone patch; however, since I am not sure with that, I would not remove it as an outlier (according to histogram this value fits to the overall distribution, although this would change if the histogram breaks are set up more fine). |

<imgcaption vltava_ph_boxplot_histogram left|Boxplot (above) and histogram (below) of soil pH values from Vltava river valley dataset.>{{:obrazky:vltava_ph_values_outliers_boxplot_histogram.png?direct|}}</imgcaption> | <imgcaption vltava_ph_boxplot_histogram left|Boxplot (above) and histogram (below) of soil pH values from Vltava river valley dataset.>{{:obrazky:vltava_ph_values_outliers_boxplot_histogram.png?direct|}}</imgcaption> | ||

- | FIXME | ||

==== Data transformation ==== | ==== Data transformation ==== | ||

Data transformation changes relative differences among individual values and consequently also their distribution. We may want to transform data either because (some) statistical analyses and tests require the residuals that tare approximately normally distributed and have homogeneous variance (homoscedasticity), i.e. no relationship between variance and mean, or because linear relationships may be easier to interpret than non-linear. | Data transformation changes relative differences among individual values and consequently also their distribution. We may want to transform data either because (some) statistical analyses and tests require the residuals that tare approximately normally distributed and have homogeneous variance (homoscedasticity), i.e. no relationship between variance and mean, or because linear relationships may be easier to interpret than non-linear. | ||

Line 37: | Line 36: | ||

Log transformation is suitable for strongly right-skewed data with log-normal distribution (with the relationship between mean and variance): | Log transformation is suitable for strongly right-skewed data with log-normal distribution (with the relationship between mean and variance): | ||

- | <m 20>y=log (y)</m> or <m 20>y'=log (ay+c)</m> | + | <m 20>y{prime}=log (y)</m> or <m 20>y{prime}=log (ay+c)</m> |

- | where constant //a// is usually 1, but if //y// is from interval <0;1>, than //a// > 1 (to maintain positive y’ values); constant //c// can be added if y contains zeroes, since log (0) is not defined (-Inf), and can be 1 or some arbitrary selected small value (e.g. 0.001). Note that constant //c// can influence results of analysis (e.g. ANOVA), and it is better to select the value which makes the transformed distribution the most symmetrical. Example on <imgref raw-log-popul> shows relationship between area of the country and it’s population; both variables are strongly right skewed, and without transformation the whole relationship is driven by few large or populous countries; after log transformation, strong correlation appears. | + | |

+ | where constant //a// is usually 1, but if //y// is from interval <0;1>, than //a// > 1 (to maintain positive y’ values); constant //c// can be added if y contains zeroes, since log (0) is not defined (-Inf), and can be 1 or some arbitrary selected small value (e.g. 0.001). Note that constant //c// can influence results of the analysis (e.g. ANOVA), and it is better to select the value which makes the transformed distribution the most symmetrical. Example on <imgref raw-log-popul> shows the relationship between the area of the country and it’s population; both variables are strongly right skewed, and without transformation, the whole relationship is driven by few large or populous countries; after log transformation, a strong correlation appears. | ||

<imgcaption raw-log-popul|>{{:obrazky:raw-vs-log-population-area.png?direct|}}</imgcaption> | <imgcaption raw-log-popul|>{{:obrazky:raw-vs-log-population-area.png?direct|}}</imgcaption> | ||

===Square-root transformation=== | ===Square-root transformation=== | ||

Suitable for slightly right-skewed data: | Suitable for slightly right-skewed data: | ||

- | <m 20>y’ = sqrt (y)</m> or <m 20>y’ = sqrt (y+c)</m> | + | <m 20>y{prime} = sqrt{y}</m> or <m 20>y{prime} = sqrt{y+c}</m> |

where constant //c// can added if the values contain zeros, and can be e.g. 0.5, or 3/8 (0.325); the higher-root transformation is more powerful for right-skewed data (fourth-or higher root transformation is essentially approaching presence-absence transformation). While log transformation is suitable for strongly right-skewed data, sqrt transformation is suitable for slightly right-skewed data (<imgref log-sqrt-pig>). | where constant //c// can added if the values contain zeros, and can be e.g. 0.5, or 3/8 (0.325); the higher-root transformation is more powerful for right-skewed data (fourth-or higher root transformation is essentially approaching presence-absence transformation). While log transformation is suitable for strongly right-skewed data, sqrt transformation is suitable for slightly right-skewed data (<imgref log-sqrt-pig>). | ||

+ | |||

<imgcaption log-sqrt-pig|Difference between log and sqrt transformation. For the meaning of the pig shape, see below.>{{:obrazky:log-sqrt-pig-transformation-legleg.jpg?direct|}}</imgcaption> | <imgcaption log-sqrt-pig|Difference between log and sqrt transformation. For the meaning of the pig shape, see below.>{{:obrazky:log-sqrt-pig-transformation-legleg.jpg?direct|}}</imgcaption> | ||

===Power transformation=== | ===Power transformation=== | ||

Suitable for left-skewed data: | Suitable for left-skewed data: | ||

- | <m 20>y’ = y^p</m> [to raise //y// on the power of //p//] | ||

- | with p values lower than one becoming root transformation (p = 0.5 – square-root, p = 0.25 – fourth-root etc.) | ||

- | ==== Data standardization ==== | + | <m 20>y{prime} = y^p</m> [to raise //y// on the power of //p//] |

+ | which, with //p// values lower than one, becomes root transformation (p = 0.5 – square-root, p = 0.25 – fourth-root etc.) | ||

+ | === Arcsin transformation (angular transformation) === | ||

+ | Suitable for percentage values (and ratios in general): | ||

+ | <m 20>y{prime}=arcsin(y)</m> or <m 20>y{prime}=arcsin(sqrt{y})</m> | ||

+ | |||

+ | where //y// values must be in the range [−1, 1] and transformed values are in radians within the range [−𝜋/2, 𝜋/2 ]. | ||

+ | |||

+ | === Reciprocal transformation === | ||

+ | |||

+ | Suitable for ratios (e.g. height/weight body ratio, number of children in population per number of females): | ||

+ | |||

+ | <m 20>y{prime}=1/y</m> | ||

+ | |||

+ | === Transformation to normal pig === | ||

+ | <imgcaption image1|The pig in the middle is "normal pig", with parameters as it should be. All other pigs can be transformed into the normal pig by the transformation in the upper and left figure margin. So, for example, if the distribution of your data look like the pig in the lower-right corner, you may need to apply the log transformation on both x and y variables to obtain normal pig.>{{:obrazky:trans.pig.png?direct|}}</imgcaption> | ||

==== Data standardization ==== | ==== Data standardization ==== | ||

+ | Standardization changes the data using a statistic calculated from data itself, e.g. mean, range, sum of values (it is data dependent). The most common reason to apply standardization is to remove differences in relative weights (importance) of individual variables or samples. | ||

+ | |||

+ | === Centring === | ||

+ | Standardised variable has mean equal to zero: | ||

+ | |||

+ | <m 20>y{prime}=y-mean(y)</m> | ||

+ | |||

+ | === Standardization sensu stricto (also called “z-scores”) === | ||

+ | Standardised variable has mean equal to zero and standard deviation equal to one: | ||

+ | |||

+ | <m 20>y{prime}={y-mean(y)}/{sd(y)}</m> | ||

+ | |||

+ | Used to synchronise the variables measured in different units and using different scales. | ||

+ | |||

+ | === Ranging === | ||

+ | |||

+ | Changes the range of variable, e.g. into [0, 1]: | ||

+ | |||

+ | <m 20>y{prime}=y/y_max</m> or <m 20>y{prime}={y-y_min}/{y_max-y_min}</m> | ||

+ | |||

+ | where the first formula is for a variable on a relative scale (starts by zero, i.e. y<sub>min</sub> = 0), while the second formula is for general variables. | ||

+ | |||

+ | |||

+ | ==== Special case: transformation and standardisation of species composition matrix ==== | ||

+ | While the variables in the environmental or trait matrix are often of very different types (qualitative, quantitative, ordinal) and measured in very different units, the species composition matrix is homogeneous, with all variables (species) measured in the same units (frequencies, abundances, covers, presences-absences). | ||

+ | |||

+ | It is always good to check which units and what range of values is used to quantify the occurrence of species in the samples, and **transform data** accordingly. For example, if the values are percentage estimates of plant covers (often used in vegetation studies), log or sqrt transformation may be necessary, since these covers have often very right-skewed distribution (covers between 1-15% are far more common than covers >25%). However, if the estimates of the plant cover have been done in Braun-Blanquet scale (//r// = 0.01% of cover, //+// = 0.1%, //1// = 1%, //2m// = 5%, //2a// = , //2b// = , //3// = , //4// = , //5// = ) and these values are then transformed into ordinal scale (//r// -> 1, //+// -> 2, //1// -> 3, ..., //5// -> 9), these 1-9 ordinal values in comparison to percentage cover already contain implicit log-transformation and does not need to be further transformed. In some cases, transforming data into presences-absences may be useful, e.g. if the estimates of species abundances are inaccurate or data are merged from different sources using different scales or estimation methods. | ||

+ | |||

+ | Species composition data are also often subjected to standardisation, either by species (columns) or samples (rows)(<imgref stand-row-col>). **Standardization by species** makes species to have the same importance (i.e. species with overall lower abundances will be the same important as species with overall higher abundances). It may not always be meaningful, e.g. if species occurs only in one sample, standardization by species will put a high weight on this sample and it will become very different from the others. **Standardization by samples** is useful in the case that the analysis is focused on relative proportions of species, not their absolute abundances, e.g. because recorded abundances are dependent on sampling effort, and this effort differs between samples (the effort is related to time spent at the plot, number of traps, or can be influenced by bad weather affecting mobility of the sampled organisms). | ||

+ | |||

+ | <imgcaption stand-row-col|>{{:obrazky:stand_byrow_bycol.jpg?direct|}}</imgcaption> | ||

+ | |||

+ | **Hellinger standardisation** deserves special attention here, because it is a method of pre-transforming species composition data for the use in linear ordination methods, resulting in transformation-based ordination analysis (tb-PCA, tb-RDA). The formula of Hellinger standardisation is: | ||

+ | |||

+ | <m 20>y{prime}_{ij}=sqrt{{y_{ij}}/{y_{i+}}}</m> | ||

+ | |||

+ | where y<sub>ij</sub> is the abundance of species //j// in sample //i//, and y<sub>i+</sub> is the sum of abundances of all species in sample //i// (row sum). As is clear from the formula, it removes differences in absolute abundances between samples. It calculates relative species abundances per sites (species profiles) and these relative values are square-rooted, which reduces the effect of species with high abundances (<imgref stand-hell>). Euclidean distance applied on Hellinger standardized data results into Hellinger distance, which has suitable properties for analysis of community data. | ||

+ | <imgcaption stand-hell|>{{:obrazky:stand_hellinger_examplel.jpg?direct|}}</imgcaption> | ||

en/data_preparation.txt · Last modified: 2019/05/23 09:54 by David Zelený