Pivoting: data in wide and long format

This tutorial is not ready yet. Please come back later.

Introduction

Tabular data can be stored in different formats. In tidy data, every row is an observation and every column is a variable. But depending on what you define as your observations and variables, you might need to pivot your data from wide to long format or vice versa.

This tutorial introduces the concepts of long and wide format, and shows you how to use the pivot_longer() and pivot_wider() functions from tidyverse on the population dataset (which comes pre-loaded with tidyverse so you don’t need to import it).

Let’s load the tidyverse package and have a look at the population dataset:

# load tidyverse
library(tidyverse)

# add diamonds to the environment
data(population)

Long and wide format

Compare the following two simple datasets:

Data A:

year country value
2022 countryA 10
2022 countryB 12
2023 countryA 14
2023 countryB 15

Data B:

year countryA countryB
2022 10 12
2023 14 15

Data A is in long format, and Data B is in wide format. The tables contain the same information, but sometimes one format is more convenient than the other.

You can convert between these two forms using the pivot_longer() (wide to long) and pivot_wider() (long to wide) functions. To use pivot_longer() you need to specify which columns you’d like to turn into a single column: e.g. to go from Data B to Data A, you’d use the argument cols = c(countryA, countryB) (or equivalently, cols = -year). To use pivot_wider(), you need to specify which column to use for variable names, and which column for variable values: going from Data A to Data B would use the arguments names_from = country, values_from = value.

Pivoting on the population data

The population dataset contains the population of different countries over time. Currently it is in long format: the identifying variables are country and year, and the variable of interest is population.

We can convert it to wide format where the unit of observation is the country, and we have different variables for the population in every year.

# convert the data to wide format with country as the unit of observation
pivot_wider(population, names_from = year, values_from = population)
## # A tibble: 219 × 20
##    country `1995` `1996` `1997` `1998` `1999` `2000` `2001` `2002` `2003` `2004`
##    <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 Afghan… 1.76e7 1.84e7 1.90e7 1.95e7 2.00e7 2.06e7 2.13e7 2.22e7 2.31e7 2.40e7
##  2 Albania 3.36e6 3.34e6 3.33e6 3.33e6 3.32e6 3.30e6 3.29e6 3.26e6 3.24e6 3.22e6
##  3 Algeria 2.93e7 2.98e7 3.03e7 3.08e7 3.13e7 3.17e7 3.22e7 3.26e7 3.30e7 3.35e7
##  4 Americ… 5.29e4 5.39e4 5.49e4 5.59e4 5.68e4 5.75e4 5.82e4 5.87e4 5.91e4 5.93e4
##  5 Andorra 6.39e4 6.43e4 6.41e4 6.38e4 6.41e4 6.54e4 6.80e4 7.16e4 7.56e4 7.91e4
##  6 Angola  1.21e7 1.25e7 1.28e7 1.31e7 1.35e7 1.39e7 1.44e7 1.49e7 1.54e7 1.60e7
##  7 Anguil… 9.81e3 1.01e4 1.03e4 1.05e4 1.08e4 1.11e4 1.14e4 1.17e4 1.20e4 1.23e4
##  8 Antigu… 6.83e4 7.02e4 7.22e4 7.42e4 7.60e4 7.76e4 7.90e4 8.00e4 8.09e4 8.17e4
##  9 Argent… 3.48e7 3.53e7 3.57e7 3.61e7 3.65e7 3.69e7 3.73e7 3.76e7 3.80e7 3.83e7
## 10 Armenia 3.22e6 3.17e6 3.14e6 3.11e6 3.09e6 3.08e6 3.06e6 3.05e6 3.04e6 3.03e6
## # ℹ 209 more rows
## # ℹ 9 more variables: `2005` <dbl>, `2006` <dbl>, `2007` <dbl>, `2008` <dbl>,
## #   `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>

Alternatively, we can say that the unit of observation is the year, and we have different variables for the population in every country in that year.

# convert the data to wide format with year as the unit of observation
pivot_wider(population, names_from = country, values_from = population)
## # A tibble: 19 × 220
##     year Afghanistan Albania  Algeria `American Samoa` Andorra   Angola Anguilla
##    <dbl>       <dbl>   <dbl>    <dbl>            <dbl>   <dbl>    <dbl>    <dbl>
##  1  1995    17586073 3357858 29315463            52874   63854 12104952     9807
##  2  1996    18415307 3341043 29845208            53926   64274 12451945    10063
##  3  1997    19021226 3331317 30345466            54942   64090 12791388    10305
##  4  1998    19496836 3325456 30820435            55899   63799 13137542    10545
##  5  1999    19987071 3317941 31276295            56768   64084 13510616    10797
##  6  2000    20595360 3304948 31719449            57522   65399 13924930    11071
##  7  2001    21347782 3286084 32150198            58176   68000 14385283    11371
##  8  2002    22202806 3263596 32572977            58729   71639 14886574    11693
##  9  2003    23116142 3239385 33003442            59117   75643 15421075    12023
## 10  2004    24018682 3216197 33461345            59262   79060 15976715    12342
## 11  2005    24860855 3196130 33960903            59117   81223 16544376    12637
## 12  2006    25631282 3179573 34507214            58652   81877 17122409    12903
## 13  2007    26349243 3166222 35097043            57919   81292 17712824    13145
## 14  2008    27032197 3156608 35725377            57053   79969 18314441    13365
## 15  2009    27708187 3151185 36383302            56245   78659 18926650    13571
## 16  2010    28397812 3150143 37062820            55636   77907 19549124    13768
## 17  2011    29105480 3153883 37762962            55274   77865 20180490    13956
## 18  2012    29824536 3162083 38481705            55128   78360 20820525    14132
## 19  2013    30551674 3173271 39208194            55165   79218 21471618    14300
## # ℹ 212 more variables: `Antigua and Barbuda` <dbl>, Argentina <dbl>,
## #   Armenia <dbl>, Aruba <dbl>, Australia <dbl>, Austria <dbl>,
## #   Azerbaijan <dbl>, Bahamas <dbl>, Bahrain <dbl>, Bangladesh <dbl>,
## #   Barbados <dbl>, Belarus <dbl>, Belgium <dbl>, Belize <dbl>, Benin <dbl>,
## #   Bermuda <dbl>, Bhutan <dbl>, `Bolivia (Plurinational State of)` <dbl>,
## #   `Bonaire, Saint Eustatius and Saba` <dbl>, `Bosnia and Herzegovina` <dbl>,
## #   Botswana <dbl>, Brazil <dbl>, `British Virgin Islands` <dbl>, …

These different formats can be useful for different types of analysis.

To be able to work with this new dataset, you need to save it to a new object. Let’s assign the result of the pivot_wider() function (with countries as the unit) to a new object called population_wide:

# save the result to a new object
population_wide <- pivot_wider(population, names_from = year, values_from = population)

Pivoting is reversible, so we can convert population_wide back to long format with the pivot_longer() function. We want to take the years as the variable names, and the population as the values – instead of listing all years as the columns we want to convert, we can say that we want to pivot all columns except country.

# convert the data back to long format
pivot_longer(population_wide, cols = -country, names_to = "year", values_to = "population")
## # A tibble: 4,161 × 3
##    country     year  population
##    <chr>       <chr>      <dbl>
##  1 Afghanistan 1995    17586073
##  2 Afghanistan 1996    18415307
##  3 Afghanistan 1997    19021226
##  4 Afghanistan 1998    19496836
##  5 Afghanistan 1999    19987071
##  6 Afghanistan 2000    20595360
##  7 Afghanistan 2001    21347782
##  8 Afghanistan 2002    22202806
##  9 Afghanistan 2003    23116142
## 10 Afghanistan 2004    24018682
## # ℹ 4,151 more rows

The names_to and values_to arguments are used to specify the names of the new columns that will be created; they are optional, but can be useful for clarity.

Video tutorial TBA