Introduction
Tabular data can be stored in different formats. In tidy data, every row is an observation and every column is a variable. But depending on what you define as your observations and variables, you might need to pivot your data from wide to long format or vice versa.
This tutorial introduces the concepts of long and wide format, and shows
you how to use the pivot_longer()
and pivot_wider()
functions from
tidyverse
on the population
dataset (which comes pre-loaded with
tidyverse
so you don’t need to import it).
Let’s load the tidyverse
package and have a look at the population
dataset:
# load tidyverse
library(tidyverse)
# add diamonds to the environment
data(population)
Long and wide format
Compare the following two simple datasets:
Data A:
year | country | value |
---|---|---|
2022 | countryA | 10 |
2022 | countryB | 12 |
2023 | countryA | 14 |
2023 | countryB | 15 |
Data B:
year | countryA | countryB |
---|---|---|
2022 | 10 | 12 |
2023 | 14 | 15 |
Data A is in long format, and Data B is in wide format. The tables contain the same information, but sometimes one format is more convenient than the other.
You can convert between these two forms using the pivot_longer()
(wide
to long) and pivot_wider()
(long to wide) functions. To use
pivot_longer()
you need to specify which columns you’d like to turn
into a single column: e.g. to go from Data B to Data A, you’d use the
argument cols = c(countryA, countryB)
(or equivalently,
cols = -year
). To use pivot_wider()
, you need to specify which
column to use for variable names, and which column for variable values:
going from Data A to Data B would use the arguments
names_from = country, values_from = value
.
Pivoting on the population
data
The population
dataset contains the population of different countries
over time. Currently it is in long format: the identifying variables are
country
and year
, and the variable of interest is population
.
We can convert it to wide format where the unit of observation is the country, and we have different variables for the population in every year.
# convert the data to wide format with country as the unit of observation
pivot_wider(population, names_from = year, values_from = population)
## # A tibble: 219 × 20
## country `1995` `1996` `1997` `1998` `1999` `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010` `2011`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 1.76e7 1.84e7 1.90e7 1.95e7 2.00e7 2.06e7 2.13e7 2.22e7 2.31e7 2.40e7 2.49e7 2.56e7 2.63e7 2.70e7 2.77e7 2.84e7 2.91e7
## 2 Albania 3.36e6 3.34e6 3.33e6 3.33e6 3.32e6 3.30e6 3.29e6 3.26e6 3.24e6 3.22e6 3.20e6 3.18e6 3.17e6 3.16e6 3.15e6 3.15e6 3.15e6
## 3 Algeria 2.93e7 2.98e7 3.03e7 3.08e7 3.13e7 3.17e7 3.22e7 3.26e7 3.30e7 3.35e7 3.40e7 3.45e7 3.51e7 3.57e7 3.64e7 3.71e7 3.78e7
## 4 American Sam… 5.29e4 5.39e4 5.49e4 5.59e4 5.68e4 5.75e4 5.82e4 5.87e4 5.91e4 5.93e4 5.91e4 5.87e4 5.79e4 5.71e4 5.62e4 5.56e4 5.53e4
## 5 Andorra 6.39e4 6.43e4 6.41e4 6.38e4 6.41e4 6.54e4 6.80e4 7.16e4 7.56e4 7.91e4 8.12e4 8.19e4 8.13e4 8.00e4 7.87e4 7.79e4 7.79e4
## 6 Angola 1.21e7 1.25e7 1.28e7 1.31e7 1.35e7 1.39e7 1.44e7 1.49e7 1.54e7 1.60e7 1.65e7 1.71e7 1.77e7 1.83e7 1.89e7 1.95e7 2.02e7
## 7 Anguilla 9.81e3 1.01e4 1.03e4 1.05e4 1.08e4 1.11e4 1.14e4 1.17e4 1.20e4 1.23e4 1.26e4 1.29e4 1.31e4 1.34e4 1.36e4 1.38e4 1.40e4
## 8 Antigua and … 6.83e4 7.02e4 7.22e4 7.42e4 7.60e4 7.76e4 7.90e4 8.00e4 8.09e4 8.17e4 8.26e4 8.35e4 8.44e4 8.53e4 8.63e4 8.72e4 8.82e4
## 9 Argentina 3.48e7 3.53e7 3.57e7 3.61e7 3.65e7 3.69e7 3.73e7 3.76e7 3.80e7 3.83e7 3.86e7 3.90e7 3.93e7 3.97e7 4.00e7 4.04e7 4.07e7
## 10 Armenia 3.22e6 3.17e6 3.14e6 3.11e6 3.09e6 3.08e6 3.06e6 3.05e6 3.04e6 3.03e6 3.01e6 3.00e6 2.99e6 2.98e6 2.97e6 2.96e6 2.96e6
## # ℹ 209 more rows
## # ℹ 2 more variables: `2012` <dbl>, `2013` <dbl>
Alternatively, we can say that the unit of observation is the year, and we have different variables for the population in every country in that year.
# convert the data to wide format with year as the unit of observation
pivot_wider(population, names_from = country, values_from = population)
## # A tibble: 19 × 220
## year Afghanistan Albania Algeria `American Samoa` Andorra Angola Anguilla `Antigua and Barbuda` Argentina Armenia Aruba Australia
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1995 17586073 3357858 29315463 52874 63854 1.21e7 9807 68349 34833168 3223173 80326 18124234
## 2 1996 18415307 3341043 29845208 53926 64274 1.25e7 10063 70245 35264070 3173425 83195 18339037
## 3 1997 19021226 3331317 30345466 54942 64090 1.28e7 10305 72232 35690778 3137652 85447 18563442
## 4 1998 19496836 3325456 30820435 55899 63799 1.31e7 10545 74206 36109342 3112958 87276 18794552
## 5 1999 19987071 3317941 31276295 56768 64084 1.35e7 10797 76041 36514558 3093820 89004 19027438
## 6 2000 20595360 3304948 31719449 57522 65399 1.39e7 11071 77648 36903067 3076098 90858 19259377
## 7 2001 21347782 3286084 32150198 58176 68000 1.44e7 11371 78972 37273361 3059960 92894 19487257
## 8 2002 22202806 3263596 32572977 58729 71639 1.49e7 11693 80030 37627545 3047002 94995 19714625
## 9 2003 23116142 3239385 33003442 59117 75643 1.54e7 12023 80904 37970411 3036032 97015 19953121
## 10 2004 24018682 3216197 33461345 59262 79060 1.60e7 12342 81718 38308779 3025652 98742 20218481
## 11 2005 24860855 3196130 33960903 59117 81223 1.65e7 12637 82565 38647854 3014917 100031 20520736
## 12 2006 25631282 3179573 34507214 58652 81877 1.71e7 12903 83467 38988923 3002911 100830 20865583
## 13 2007 26349243 3166222 35097043 57919 81292 1.77e7 13145 84397 39331357 2989882 101219 21246274
## 14 2008 27032197 3156608 35725377 57053 79969 1.83e7 13365 85349 39676083 2977488 101344 21645095
## 15 2009 27708187 3151185 36383302 56245 78659 1.89e7 13571 86300 40023641 2968154 101418 22037143
## 16 2010 28397812 3150143 37062820 55636 77907 1.95e7 13768 87233 40374224 2963496 101597 22404488
## 17 2011 29105480 3153883 37762962 55274 77865 2.02e7 13956 88152 40728738 2964120 101932 22740536
## 18 2012 29824536 3162083 38481705 55128 78360 2.08e7 14132 89069 41086927 2969081 102384 23050471
## 19 2013 30551674 3173271 39208194 55165 79218 2.15e7 14300 89985 41446246 2976566 102911 23342553
## # ℹ 207 more variables: Austria <dbl>, Azerbaijan <dbl>, Bahamas <dbl>, Bahrain <dbl>, Bangladesh <dbl>, Barbados <dbl>,
## # Belarus <dbl>, Belgium <dbl>, Belize <dbl>, Benin <dbl>, Bermuda <dbl>, Bhutan <dbl>, `Bolivia (Plurinational State of)` <dbl>,
## # `Bonaire, Saint Eustatius and Saba` <dbl>, `Bosnia and Herzegovina` <dbl>, Botswana <dbl>, Brazil <dbl>,
## # `British Virgin Islands` <dbl>, `Brunei Darussalam` <dbl>, Bulgaria <dbl>, `Burkina Faso` <dbl>, Burundi <dbl>,
## # `Cabo Verde` <dbl>, Cambodia <dbl>, Cameroon <dbl>, Canada <dbl>, `Cayman Islands` <dbl>, `Central African Republic` <dbl>,
## # Chad <dbl>, Chile <dbl>, China <dbl>, `China, Hong Kong SAR` <dbl>, `China, Macao SAR` <dbl>, Colombia <dbl>, Comoros <dbl>,
## # Congo <dbl>, `Cook Islands` <dbl>, `Costa Rica` <dbl>, `Côte d'Ivoire` <dbl>, Croatia <dbl>, Cuba <dbl>, Curaçao <dbl>, …
These different formats can be useful for different types of analysis.
To be able to work with this new dataset, you need to save it to a new
object. Let’s assign the result of the pivot_wider()
function (with
countries as the unit) to a new object called population_wide
:
# save the result to a new object
population_wide <- pivot_wider(population, names_from = year, values_from = population)
Pivoting is reversible, so we can convert population_wide
back to long
format with the pivot_longer()
function. We want to take the years as
the variable names, and the population as the values – instead of
listing all years as the columns we want to convert, we can say that we
want to pivot all columns except country
.
# convert the data back to long format
pivot_longer(population_wide, cols = -country, names_to = "year", values_to = "population")
## # A tibble: 4,161 × 3
## country year population
## <chr> <chr> <dbl>
## 1 Afghanistan 1995 17586073
## 2 Afghanistan 1996 18415307
## 3 Afghanistan 1997 19021226
## 4 Afghanistan 1998 19496836
## 5 Afghanistan 1999 19987071
## 6 Afghanistan 2000 20595360
## 7 Afghanistan 2001 21347782
## 8 Afghanistan 2002 22202806
## 9 Afghanistan 2003 23116142
## 10 Afghanistan 2004 24018682
## # ℹ 4,151 more rows
The names_to
and values_to
arguments are used to specify the names
of the new columns that will be created; they are optional, but can be
useful for clarity.