Introduction to tidyverse

The R Tidyverse is a collection of packages for data handling, analysis and visualization. If you want to use the tidyverse, you have to install the additional packages first with the install.packages() function. Once installed, you then have to tell R to make the tidyverse functions available in your current R session with library() You only have to install a package once, but loading it has to be done every time you start a new R session. It is recommened to either not include the install.packages() in your script or just comment it out like below.

# install.packages("tidyverse")
library(tidyverse)

As you see in the output, library(tidyverse) actually loads nine different packages. It will also give you a warning about conflicting functions. Do not worry for now, we will get to that in time.

Why tidyverse?

  • consistent syntax and workflows
  • makes code more readable
  • pipe operator %>% / |> can chain functions together
  • tidy data approach
    • rows are observations
    • columns are variables / features

https://ajsmit.github.io/Intro_R_Official/figures/tidy_workflow.png

data.frames with dplyr

  • provides functions for data.frame manipulation
  • can complement or replace base R functions

Of course, you can also load single packages from the tidyverse with the library() function.

library(dplyr)
aasee = read.csv("data/2021-06_aasee.csv")

slice - a slice of data - i.e. the specified rows

aasee = slice(aasee, seq(8))

select - selects columns

select(aasee, Wassertemperatur)
  Wassertemperatur
1            17.98
2            17.66
3            18.03
4            18.08
5            18.06
6            18.01
7            18.02
8            18.06

filter - filters rows based on logical operators

filter(aasee, Wassertemperatur < 18)
             Datum Wassertemperatur pH.Wert Sauerstoffgehalt
1 2021-05-31 23:57            17.98    8.05            10.53
2 2021-06-01 00:09            17.66    8.04             9.64

mutate - mutates the data.frame by adding columns

mutate(aasee, t_kelvin = Wassertemperatur + 273.15)
             Datum Wassertemperatur pH.Wert Sauerstoffgehalt t_kelvin
1 2021-05-31 23:57            17.98    8.05            10.53   291.13
2 2021-06-01 00:09            17.66    8.04             9.64   290.81
3 2021-06-01 00:19            18.03    8.12            11.30   291.18
4 2021-06-01 00:27            18.08    8.14            11.32   291.23
5 2021-06-01 00:39            18.06    8.12            11.06   291.21
6 2021-06-01 00:49            18.01    8.10            10.91   291.16
7 2021-06-01 00:59            18.02    8.10            10.96   291.17
8 2021-06-01 01:08            18.06    8.10            10.83   291.21

summarise - summarises data

summarise(aasee, minimum_t = min(Wassertemperatur))
  minimum_t
1     17.66

The functions above could all be realized with base R:

# the same in base R

# select
aasee$Wassertemperatur

# filter
aasee[,aasee$Wassertemperatur < 18]

# mutate
aasee$t_kelvin = aasee$Wassertemperatur + 273.15

# summarise
min(aasee$Wassertemperatur)

The pipe operator

The strength of dplyr is the possibility to chain functions with %>% or |>.

aasee|> 
    filter(Wassertemperatur < 18) |> 
    select(pH.Wert) |>  
    max()
[1] 8.05

With base R functions this looks messy, because we have to use functions inside functions.

max(aasee$pH.Wert[aasee$Wassertemperatur < 18])
[1] 8.05