Wide vs. Long data.frames

Most data you collect yourself or can download from data repositories will come in some variation or form of a “Excel” style spreadsheet. What I mean by that is, that a row in the table contains different features of the same observation. Each column on the other hand contains the same feature of many different observations. Of course, what information we store with feature and observation heavily depends on the content and purpose of the data.

Take a look at the following data from the German weather service DWD about sunshine duration across German states. You can get this data here. I will only use a subset of the data:

library(dplyr)

sunshine = sunshine |>
    filter(Jahr > 2015 & Jahr <= 2020) |>
    select("Jahr", "Sachsen", "Hessen", "Bayern")

Wide format

Jahr Sachsen Hessen Bayern
2016 649.2 579.4 638.5
2017 682.9 608.2 720.3
2018 772.1 774.0 779.6
2019 791.2 769.3 787.0
2020 653.0 640.2 695.6

Take a conscious look on this data.frame for a moment.

  • What information does one row contain?
  • What is stored in one column?
  • What is the meaning of the numbers in the data.frame? (And maybe more importantly: How do you know?)

This typical format is called wide as we have information (i.e. the sunshine duration) stored in different columns for different states per observation year (the row). While this is a perfectly fine form of data storage (e.g. you can get a good overview of the data), there is a second organisation scheme that can also be very useful, especially in the context of data analysis and visualization with R: the long format.

Long format

The principle of the long format is, that we have one column where we store the actual measurement values (sunshine duration in this example). Along that, we have multiple columns with metadata that describe and “encode” each value. Take a look at the example data from above in the long format:

Jahr Bundesland Sonnenscheindauer
2016 Sachsen 649.2
2016 Hessen 579.4
2016 Bayern 638.5
2017 Sachsen 682.9
2017 Hessen 608.2
2017 Bayern 720.3
2018 Sachsen 772.1
2018 Hessen 774.0
2018 Bayern 779.6
2019 Sachsen 791.2
2019 Hessen 769.3
2019 Bayern 787.0
2020 Sachsen 653.0
2020 Hessen 640.2
2020 Bayern 695.6

All the measurement of sunshine duration from the different states are now in one column Sonnenscheindauer. And the two columns Jahr and Bundesland describe each value. In a way, the long format is better for the overall understanding than the wide format as it is way clearer what the content of each column actually means.

Conversion between long and wide with tidyr

library(tidyr)

sunshine_long = pivot_longer(sunshine,
                             cols = c("Sachsen", "Hessen", "Bayern"),
                             names_to = "Bundesland",
                             values_to = "Sonnenscheindauer")
Jahr Bundesland Sonnenscheindauer
2016 Sachsen 649.2
2016 Hessen 579.4
2016 Bayern 638.5
2017 Sachsen 682.9
2017 Hessen 608.2
2017 Bayern 720.3
2018 Sachsen 772.1
2018 Hessen 774.0
2018 Bayern 779.6
2019 Sachsen 791.2
2019 Hessen 769.3
2019 Bayern 787.0
2020 Sachsen 653.0
2020 Hessen 640.2
2020 Bayern 695.6
sunshine_wide = pivot_wider(sunshine_long,
                            names_from = "Bundesland",
                            values_from = "Sonnenscheindauer")
Jahr Sachsen Hessen Bayern
2016 649.2 579.4 638.5
2017 682.9 608.2 720.3
2018 772.1 774.0 779.6
2019 791.2 769.3 787.0
2020 653.0 640.2 695.6
sunshine_year = pivot_wider(sunshine_long,
                            names_from = "Jahr",
                            values_from = "Sonnenscheindauer")
Bundesland 2016 2017 2018 2019 2020
Sachsen 649.2 682.9 772.1 791.2 653.0
Hessen 579.4 608.2 774.0 769.3 640.2
Bayern 638.5 720.3 779.6 787.0 695.6

If you want to go into more details for pivot_wider and pivot_longer check out this tidyr reference article.