Introduction to data.frame

From a data science perspective, the most important class of objects is the data frame - Chambers (2020)

  • Data ordered in rows and columns - just like a spreadsheet
  • Technical implementation in R:
    • data.frame is a list of vectors
    • each vector is one column
    • vectors are atomic - each value in a column has the same data type
    • different columns can have different data types

data.frames are a named list of vectors
df = data.frame(plotID = seq(3),
                soil_ph = c(5.5, 5.4, 6.1),
                soil_temperature = c(10, 11, 12),
                forest_type = c("coniferous", "coniferous", "deciduous"))

class(df)
[1] "data.frame"
df
  plotID soil_ph soil_temperature forest_type
1      1     5.5               10  coniferous
2      2     5.4               11  coniferous
3      3     6.1               12   deciduous

Once you deal with actual data you want to analyse, it is rare that you want to build a data.frame from scratch like the example above. Instead, you have some file prepared on your computer e.g. a .csv file you want to get into R. Learn more about external data and how to get them into R in the next lesson.

For now, we simply want to have a larger data.frame to show and test some functions. Here, I use the temperature data from the Aasee that was also part of assignments/Ex02_second.qmd.

data = read.csv(file = "data/2021-06_aasee.csv")


# show the first few rows of the df
head(data)
             Datum Wassertemperatur pH.Wert Sauerstoffgehalt
1 2021-05-31 23:57            17.98    8.05            10.53
2 2021-06-01 00:09            17.66    8.04             9.64
3 2021-06-01 00:19            18.03    8.12            11.30
4 2021-06-01 00:27            18.08    8.14            11.32
5 2021-06-01 00:39            18.06    8.12            11.06
6 2021-06-01 00:49            18.01    8.10            10.91
# show the last few rows of the df
tail(data)
                Datum Wassertemperatur pH.Wert Sauerstoffgehalt
4220 2021-06-30 22:57            23.73    8.78            17.80
4221 2021-06-30 23:09            23.70    8.72            17.66
4222 2021-06-30 23:18            23.68    8.73            17.72
4223 2021-06-30 23:29            23.64    8.81            18.38
4224 2021-06-30 23:39            23.62    8.76            17.93
4225 2021-06-30 23:49            23.63    8.77            17.82
# get a short summary of the structure
str(data)
'data.frame':   4225 obs. of  4 variables:
 $ Datum           : chr  "2021-05-31 23:57" "2021-06-01 00:09" "2021-06-01 00:19" "2021-06-01 00:27" ...
 $ Wassertemperatur: num  18 17.7 18 18.1 18.1 ...
 $ pH.Wert         : num  8.05 8.04 8.12 8.14 8.12 8.1 8.1 8.1 8.1 8.1 ...
 $ Sauerstoffgehalt: num  10.53 9.64 11.3 11.32 11.06 ...

data.frame subsetting

You can think about data.frames as 2-d vectors. Subsetting of data.frames hence requires two values, one for the row subset, one for the column subset:

# row 1, column 2
data[1,2]
[1] 17.98
# the first row, empty means "everything"
data[1,]
             Datum Wassertemperatur pH.Wert Sauerstoffgehalt
1 2021-05-31 23:57            17.98    8.05            10.53
# the first 3 rows, column 3 and 4
data[seq(3), c(3,4)]
  pH.Wert Sauerstoffgehalt
1    8.05            10.53
2    8.04             9.64
3    8.12            11.30

If you want to extract a column of a data.frame you can use the $ operator. The resulting object is a vector, and hence only has one dimension. This is important to recognize, because a subset of a column by the $ operator again needs only one value (see example below).

temperature <- data$Wassertemperatur
summary(temperature)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  17.18   21.65   23.07   23.15   24.52   28.96 
# subsetting a column accessed with $ only needs one value
data$pH.Wert[100]
[1] 8.38
# You can also define new columns with the $
data$location <- "Aasee"
head(data)
             Datum Wassertemperatur pH.Wert Sauerstoffgehalt location
1 2021-05-31 23:57            17.98    8.05            10.53    Aasee
2 2021-06-01 00:09            17.66    8.04             9.64    Aasee
3 2021-06-01 00:19            18.03    8.12            11.30    Aasee
4 2021-06-01 00:27            18.08    8.14            11.32    Aasee
5 2021-06-01 00:39            18.06    8.12            11.06    Aasee
6 2021-06-01 00:49            18.01    8.10            10.91    Aasee

Subsetting a data.frame by column names also works:

# read: from data, rows 1 to 5, and columns with the name Datum and pH.Wert
data[seq(5), c("Datum", "pH.Wert")]
             Datum pH.Wert
1 2021-05-31 23:57    8.05
2 2021-06-01 00:09    8.04
3 2021-06-01 00:19    8.12
4 2021-06-01 00:27    8.14
5 2021-06-01 00:39    8.12

Of course logical operators also work for data.frame subsetting:

# read: from data, only the rows where ph is larger than 8.9, and all the columns
data[data$pH.Wert > 8.9, ]
                Datum Wassertemperatur pH.Wert Sauerstoffgehalt location
4056 2021-06-29 19:20            25.37    8.92            21.21    Aasee
4185 2021-06-30 17:07            24.14    8.91            20.58    Aasee
4186 2021-06-30 17:18            24.14    8.91            20.26    Aasee
4188 2021-06-30 17:38            24.17    8.92            20.50    Aasee
4189 2021-06-30 17:49            24.15    8.93            20.56    Aasee
4190 2021-06-30 17:59            24.14    8.92            20.15    Aasee
4191 2021-06-30 18:07            24.13    8.91            19.97    Aasee
4192 2021-06-30 18:19            24.14    8.91            19.71    Aasee
4193 2021-06-30 18:29            24.12    8.91            19.70    Aasee
4194 2021-06-30 18:39            24.12    8.92            19.43    Aasee
# read: from data, only rows where temperature is smaller than 17.5, and only the column with the name Datum
data[data$Wassertemperatur < 17.5, "Datum"]
 [1] "2021-06-01 06:59" "2021-06-01 07:09" "2021-06-01 07:18" "2021-06-01 07:29"
 [5] "2021-06-01 07:39" "2021-06-01 07:50" "2021-06-01 08:09" "2021-06-01 08:17"
 [9] "2021-06-01 08:38" "2021-06-01 08:49" "2021-06-01 08:58" "2021-06-01 11:58"
[13] "2021-06-01 12:19" "2021-06-01 12:59" "2021-06-01 13:04" "2021-06-01 13:13"
[17] "2021-06-01 13:25"

References

Chambers, John, M. 2020. “S, R, and Data Science.” The R Journal 12 (1): 462. https://doi.org/10.32614/RJ-2020-028.