User Tools

Site Tools


recol:dplyr

Data manipulation with dplyr and tidyr packages

In this section we focus on the use of dplyr and tidyr, two of the tidyverse packages. We will see how to group and summarise data in dplyr, and how to reformat long to wide and wide to long format of the data frame (or tibble, as it is called in tidyverse) using tidyr. We will use two set of real ecological datasets for this purpose.

Piping data with %>% from magrittr

But first, let's have a reminder of the piping operator %>% from the package margrittr, which is used to pipe data into the function and also the output of one function to the other function. Here I will directly upload all tidyverse packages using library (tidyverse), but for piping, library (magrittr) would be enough.

For example, instead of writing

result <- sum (log (sqrt (abs (-10:10))))

you can write the sequence using the pipe operator as

library (tidyverse)
result <- -10:10 %>% abs %>% sqrt %>% log1p %>% sum

Piping produces cleaner code and allows you to create a logical pipeline of the functions, allowing to imagine better how the data flow through it (from left to right).

If not specified, the data get piped to the first argument of the function, so

cars %>% plot

will plot the scatterplot (try it!). I can include the rest of the arguments in the parenthesis of the function, while simply ignoring (skipping) the first one:

cars %>% plot (xlab = 'Speed [mph]', ylab = 'Distance [ft]')

Alternatively, it can be specified and replaced by a placeholder, a dot (.):

cars %>% plot (., xlab = 'Speed [mph]', ylab = 'Distance [ft]')

In this way, the data can be piped also to other arguments (here xlab):

'Speed [mph]' %>% plot (cars, xlab = ., ylab = 'Distance [ft]')

The original package from which pipes come (magrittr) contains also compound assignment operator %<>%, which pipes the result of the pipeline back to the original variable:

a <- 1:10
a %<>% sin %>% abs
a
 [1] 0.8414710 0.9092974 0.1411200 0.7568025 0.9589243 0.2794155 0.6569866 0.9893582 0.4121185
[10] 0.5440211

This piping operator is not imported to dplyr, so if you call dplyr directly (instead of tidyverse), you are not likely be able to use it.

Check magrittr reference for more details.

tibble – the analogy of data.frame, used by tidyverse

The tidyverse library introduces its own object type for storing two-dimensional (spreadsheet) data, so called tibble. It has many features shared with data.frame used by base R, but has some of the characteristics that make it different. Let’s convert the data frame iris into the tibble format, and observe some of its features.

iris_tb <- as_tibble (iris)

For example, when you print tibble on the screen, the output is tidy, includes some additional information (e.g. the number of rows and columns at the first row, the type of the variables in the columns on the third row), and responsively modifies the amount of outputted information according to the available space in the command line window (not displayed variables and rows will be summarised below the printed output).

iris_tb   # or you can also call print (iris_tb)
# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length
          <dbl>       <dbl>        <dbl>
 1          5.1         3.5          1.4
 2          4.9         3            1.4
 3          4.7         3.2          1.3
 4          4.6         3.1          1.5
 5          5           3.6          1.4
 6          5.4         3.9          1.7
 7          4.6         3.4          1.4
 8          5           3.4          1.5
 9          4.4         2.9          1.4
10          4.9         3.1          1.5
# ... with 140 more rows, and 2 more variables:
#   Petal.Width <dbl>, Species <fct>

In contrast to data.frame, tibble does not support row names (and stores these usually as the first column). The variables in tibble are created dynamically, so when creating tibble, you can use the values in previous columns to create the next one.

t1 <- tibble (a = 1:10, b = 11:20, d = a+b)
t1
# A tibble: 10 x 3
       a     b     d
   <int> <int> <int>
 1     1    11    12
 2     2    12    14
 3     3    13    16
 4     4    14    18
 5     5    15    20
 6     6    16    22
 7     7    17    24
 8     8    18    26
 9     9    19    28
10    10    20    30

The tibble also supports some advanced features, such as grouping of rows into groups (using function group_by, see below).

The main verbs of dplyr library

In the grammar of dplyr library, there are five main verbs that can be used to ply the data: arrange, select, filter, mutate and summarise. Let’s check them one by one, with examples using

arrange

The verb arrange sorts the rows in the table according to values in the selected column(s). In the first example, we arrange the rows by values in Sepal.Length, sorted in ascending order:

iris_tb %>% arrange (Sepal.Length)
# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
 *        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          4.3         3            1.1         0.1 setosa 
 2          4.4         2.9          1.4         0.2 setosa 
 3          4.4         3            1.3         0.2 setosa 
 4          4.4         3.2          1.3         0.2 setosa 
 5          4.5         2.3          1.3         0.3 setosa 
 6          4.6         3.1          1.5         0.2 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          4.6         3.6          1           0.2 setosa 
 9          4.6         3.2          1.4         0.2 setosa 
10          4.7         3.2          1.3         0.2 setosa 
# ... with 140 more rows

We can also sort variables in descending (decreasing) order, by wrapping the variable name into function desc:

iris_tb %>% arrange (desc (Sepal.Length), Sepal.Width)
# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
 *        <dbl>       <dbl>        <dbl>       <dbl> <fct>    
 1          7.9         3.8          6.4         2   virginica
 2          7.7         2.6          6.9         2.3 virginica
 3          7.7         2.8          6.7         2   virginica
 4          7.7         3            6.1         2.3 virginica
 5          7.7         3.8          6.7         2.2 virginica
 6          7.6         3            6.6         2.1 virginica
 7          7.4         2.8          6.1         1.9 virginica
 8          7.3         2.9          6.3         1.8 virginica
 9          7.2         3            5.8         1.6 virginica
10          7.2         3.2          6           1.8 virginica
# ... with 140 more rows

In this previous example, we sorted values by two of the columns - first in descending order of Sepal.Length, and then by increasing order of Sepal.Width (i.e. if some of the rows have the same value for Sepal.Length, as values on the rows two to five, these will be sorted by Sepal.Width).

select

The verb select allows selecting of the columns in the table by either their position (number), their name, or using some selection helpers (see below). ## select —-

iris_tb %>% select (Sepal.Length)  # selects variable Sepal.Length
iris_tb %>% select (Sepal.Length, Petal.Length)  # selects two variables by their name
iris_tb %>% select (Sepal.Length:Petal.Length)  # selects the sequence of variables, starting with Sepal.Length and ending by Petal.Length
iris_tb %>% select (-Petal.Width, -Species) # selects all variables except Petal.Width and Species
iris_tb %>% select (!Species) # selects all variables except Species
iris_tb %>% select (-4:-5)  # selects all variables except variable in the 4th and 5th column

Selection helpers are functions which increase the flexibility how variables can be selected:

iris_tb %>% select (starts_with ('S'))  # selects variables with 'S' at the beginning of the name
iris_tb %>% select (last_col())  # selects the last column of the table
iris_tb %>% select (where (is.numeric))  # selects variables that are numeric (and not other type, such as factor or integer)

filter

The verb filter allows to filter only some of the rows of the table, that pass through the filtering criteria (set by logical expression). For example, we can filter only rows that belong to species Iris setosa (this information is stored in the column Species of iris_tb):

iris_tb %>% filter (Species == 'setosa')

The result will be tibble with only 50 rows (but all columns). The logical expressions can be combined. For example, we can select rows belonging to either setosa or virginica:

iris_tb %>% filter (Species == 'setosa' | Species == 'virginica')
# or, the same, using %in%:
iris_tb %>% filter (Species %in% c('setosa', 'virginica'))

We can also use numerical values for filtering the rows (here rows with Sepal.Length is larger than 5, and only those rows belonging to species setosa or virginica):

iris_tb %>% filter (Sepal.Length > 5, Species %in% c('setosa', 'virginica'))

mutate

The verb mutate allows the creation of a new variable (or redefinition of the existing one). For example, let's multiply the length and width of the sepals, and call it Sepal.Area; function mutate will append it at the end of the table:

iris_tb %>% mutate (Sepal.Area = Sepal.Length*Sepal.Width) 
# A tibble: 150 x 6
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
          <dbl>       <dbl>        <dbl>       <dbl> <fct>        <dbl>
 1          5.1         3.5          1.4         0.2 setosa        17.8
 2          4.9         3            1.4         0.2 setosa        14.7
 3          4.7         3.2          1.3         0.2 setosa        15.0
 4          4.6         3.1          1.5         0.2 setosa        14.3
 5          5           3.6          1.4         0.2 setosa        18  
 6          5.4         3.9          1.7         0.4 setosa        21.1
 7          4.6         3.4          1.4         0.3 setosa        15.6
 8          5           3.4          1.5         0.2 setosa        17  
 9          4.4         2.9          1.4         0.2 setosa        12.8
10          4.9         3.1          1.5         0.1 setosa        15.2
# ... with 140 more rows

I can also redefine the value of existing variable, e.g. by multiplying the values of Sepal.Length by 100; if the name I use for the variable is identical with existing one, the column of that variable will be rewritten with new values:

iris_tb %>% mutate (Sepal.Length = Sepal.Length*100)
# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          510         3.5          1.4         0.2 setosa 
 2          490         3            1.4         0.2 setosa 
 3          470         3.2          1.3         0.2 setosa 
 4          460         3.1          1.5         0.2 setosa 
 5          500         3.6          1.4         0.2 setosa 
 6          540         3.9          1.7         0.4 setosa 
 7          460         3.4          1.4         0.3 setosa 
 8          500         3.4          1.5         0.2 setosa 
 9          440         2.9          1.4         0.2 setosa 
10          490         3.1          1.5         0.1 setosa 
# ... with 140 more rows

I can also simplify the output by creating the new variable, while removing all other variables in the table; this is done by sister function transmute:

iris_tb %>% transmute (Sepal.Area = Sepal.Length*Sepal.Width)
# A tibble: 150 x 1
   Sepal.Area
        <dbl>
 1       17.8
 2       14.7
 3       15.0
 4       14.3
 5       18  
 6       21.1
 7       15.6
 8       17  
 9       12.8
10       15.2
# ... with 140 more rows

summarise (or summarize in American English spelling)

The verb summarize extracts summaries for values in the columns, using provided function. For example, I can calculate the mean and standard deviation for the first two columns in iris_tb:

iris_tb %>% summarise (mean.SL = mean (Sepal.Length),
                       mean.SW = mean (Sepal.Width),
                       sd.SL = sd (Sepal.Length),
                       sd.SW = sd (Sepal.Width))
# A tibble: 1 x 4
  mean.SL mean.SW sd.SL sd.SW
    <dbl>   <dbl> <dbl> <dbl>
1    5.84    3.06 0.828 0.436

If I want to apply the same function (or a set of functions) across multiple columns of the table, this can be done using the extra function across, which can be used within the dplyr verbs (not only summarise). The function needs to define which columns should be used for calculation (first argument) and what functions should be applied to them (the second argument; if more than one function should be used, they need to be wrapped into a list):

iris_tb %>% summarise (across (.cols = 1:4, .fns = mean))
# A tibble: 1 x 4
  Sepal.Length Sepal.Width Petal.Length Petal.Width
         <dbl>       <dbl>        <dbl>       <dbl>
1         5.84        3.06         3.76        1.20

If two or more functions should be used on each column, creating named list is the best option:

iris_tb %>% summarise (across (.cols = 1:4, 
                               .fns = list (mean = mean, sd = sd)))
# A tibble: 1 x 8
  Sepal.Length_mean Sepal.Length_sd Sepal.Width_mean Sepal.Width_sd
              <dbl>           <dbl>            <dbl>          <dbl>
1              5.84           0.828             3.06          0.436
# ... with 4 more variables: Petal.Length_mean <dbl>,
#   Petal.Length_sd <dbl>, Petal.Width_mean <dbl>, Petal.Width_sd <dbl>

Group data (group_by)

The group_by allows grouping of observations (rows) in the table by given criteria (e.g. by factor variable, or by numeric values).

Note that iris table contains a grouping variable, Species. Let's use it to group the data in the tibble:

iris_tb %>% group_by (Species)
# A tibble: 150 x 5
# Groups:   Species [3]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ... with 140 more rows

Nothing much changed to the table, except that there is a newly inserted second row, informing us about tibble being grouped according to Species into three levels. The use of grouping becomes more obvious if combined with some of the dplyr verbs. For example, let's calculate mean of each flower variable, separately for each species:

iris_tb %>% group_by (Species) %>%
  summarise (across (1:4, mean))
  Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>             <dbl>       <dbl>        <dbl>       <dbl>
1 setosa             5.01        3.43         1.46       0.246
2 versicolor         5.94        2.77         4.26       1.33 
3 virginica          6.59        2.97         5.55       2.03 

Let's combine several dplyr verbs together into a somehow meaningful sequence to demonstrate their use. I will first select only a subset of variables for further use (Species, Sepal.Length and Sepal.Width). Then I will create a new one, Sepal.Area, by multiplying sepal length and width. I will the filter only those observations with Sepal.Area large or equal to 200 (individuals with larger sepals). After grouping the rows into species, I will calculate mean and standard deviation of Sepal.Length and Sepal.Width, and also count how many observations are there in each group (defined by Species), since some of the observations were filtered out (have too small sepals):

iris_tb %>% select (Species, Sepal.Length, Sepal.Width) %>%
  mutate (Sepal.Area = Sepal.Length*Sepal.Width) %>%
  filter (Sepal.Area >= 20) %>%
  group_by (Species) %>%
  summarise (across (c(Sepal.Length, Sepal.Width), list (mean = mean, sd = sd)),
             n = n())
# A tibble: 3 x 6
  Species    Sepal.Length_mean Sepal.Length_sd Sepal.Width_mean Sepal.Width_sd     n
  <fct>                  <dbl>           <dbl>            <dbl>          <dbl> <int>
1 setosa                  5.53           0.214             4.04          0.207     7
2 versicolor              6.59           0.331             3.18          0.128     8
3 virginica               7.02           0.498             3.17          0.283    26

Reshaping data - long vs wide format

The long table format describes a data where each column represents a different variable and each row has a different observation. On the other side, a wide format describes a table in which the information is spread across many columns, where columns are often representing different levels of the same factor (e.g. species). An example of a long format is a census of tree species in multiple plots, where for each species we measure some quantity (e.g. abundance); each row of such data represent a species recorded in some of the plot, one column identifies plot ID, one column species name, and one column records the abundance of given species in given plot. The same data, if shaped in a wide format, would look like this: each column is a different species name, each row is a plot ID, and each cell of this data frame is the actual value of abundance of given species in a given plot (in this case, the species abundance for species not occurring in the given plot will be missing, and can be replaced e.g. by zero).

Long to wide (pivot_wider)

As an example of long data, we use the subset of the ten most dominant species occurring in 3 different plots sampled within the cloud forest of Tamanshan (塔曼山, highest peak of Taipei basin), stored in the dataset Tamanshan (this link contains complete dataset, while data used here are simplified into fewer species).

Load data stored as gist on GitHub (we use function read_delim from package readr for it):

long_format <- readr::read_delim ('https://gist.githubusercontent.com/zdealveindy/3078ac5f2852531604e703900c3d05a2/raw/770529ff2cc746e6b02318f53089fc72e52ac1d7/long_format.txt', delim = '\t')
long_format
# A tibble: 22 x 3
   Plot_ID Species_name     BA
   <chr>   <chr>         <dbl>
 1 L2L     EuryGlab       98.2
 2 L2L     PrunPhae      630. 
 3 L2L     SycoSine     1295  
 4 L2L     NeolAcum     2321. 
 5 L2L     QuerSess     9430. 
 6 L2R     QuerSten       11.3
 7 L2R     SympMacr       11.3
 8 L2R     PrunPhae       59.4
 9 L2R     EuryGlab       77.9
10 L2R     SycoSine      170. 
# ... with 12 more rows

Use the function pivot_wider to spread the values of the column BA across multiple columns and rows, by using Species_name as a names_from argument, and BA as values_from argument. Argument values_fill = 0 makes sure that combinations of Species_name and Plot_ID, which are missing and would become NA, will be replaced by the given value, here 0. We can also make sure that the order of columns will be alphabetic by adding the argument names_sort = TRUE (without this argument, the order will reflect the order in which the Species_name occurs in the dataset):

wide <- long_format %>% pivot_wider (names_from = Species_name, values_from = BA, values_fill = 0, names_sort = TRUE)
wide
# A tibble: 3 x 11
  Plot_ID CarpRank DaphHima EuryGlab NeolAcum PrunPhae PrunTran QuerSess QuerSten
  <chr>      <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 L2L           0         0     98.2    2321.    630.        0     9430.      0  
2 L2R           0         0     77.9    1700      59.4       0    30627.     11.3
3 L2W        1166.     1190    719.      350.      1.3    2382.    1090    1625. 
# ... with 2 more variables: SycoSine <dbl>, SympMacr <dbl>

Data in this format can be used for further analysis (e.g., multivariate methods in the package vegan). For this, you may want to convert the tibble format into data.frame, and move the first column, Plot_ID, into row names, using column_to_rownames function from tibble package: column_to_rownames (wide, var = 'Plot_ID')

Wide to long

This is the opposite situation - we have data spread across multiple columns and want to gather them into one key and one value column. We use the iris dataset here as an example:

iris_long <- as_tibble (iris) %>% pivot_longer (names_to = "Flower_feature", values_to = "Measurement", cols = Sepal.Length:Petal.Width)
# A tibble: 600 × 3
   Species Flower_feature Measurement
   <fct>   <chr>                <dbl>
 1 setosa  Sepal.Length           5.1
 2 setosa  Sepal.Width            3.5
 3 setosa  Petal.Length           1.4
 4 setosa  Petal.Width            0.2
 5 setosa  Sepal.Length           4.9
 6 setosa  Sepal.Width            3  
 7 setosa  Petal.Length           1.4
 8 setosa  Petal.Width            0.2
 9 setosa  Sepal.Length           4.7
10 setosa  Sepal.Width            3.2
# … with 590 more rows
# ℹ Use `print(n = ...)` to see more rows

Note that the as_tibble function transforms the iris dataset into a tibble (although this may not be necessary here - if we didn't transform, the result will be an ordinary data frame). The sequence of variable names (Sepal.Length:Petal.Width) indicates which columns should be gathered into the key-value columns (if we did not specify, the function will also use the Plot_ID and gather it as a part of these two columns. Alternatively, we may code as_tibble (iris) %>% pivot_longer (names_to = “Flower_feature”, values_to = “Measurement”, -Species) (minus sign means to exclude this column from the pivoting).

The long format of data may be required for certain analyses and also may be more suitable, e.g. for plotting (for example, try to use the formula version of boxplot function: boxplot (measurement ~ flower_feature, iris_long), or alternative solution using the ggplot package for plotting: iris_long %>% ggplot (aes (x = Species, y = Measurement)) + geom_boxplot () + facet_grid(~ Flower_feature).

recol/dplyr.txt · Last modified: 2023/11/21 07:37 by david