Video tutorial
Please watch this video (3:34), then read and follow along with the written tutorial below. Compare your own output to what you see printed below to make sure all of your code runs as expected.
Introduction
In some cases you might need to summarize your data – that is, collapse
a lot of information into a few key statistics– to understand it better
and to compare different groups. In this tutorial, we show you how to
summarize data within groups using the group_by()
and summarize()
functions from tidyverse
functions and the diamonds
dataset (which
comes pre-loaded with tidyverse
so you don’t need to import it).
Let’s load the tidyverse
package and have a look at the diamonds
dataset:
# load tidyverse
library(tidyverse)
# add diamonds to the environment
data(diamonds)
Default summary
The summary()
function provides a quick overview of the data, printed
in the console. For numeric variables, it shows the minimum, 1st
quartile, median, mean, 3rd quartile, maximum, and the number of missing
values. Let’s use the summary()
function to get an overview of the
diamonds
dataset:
# get a summary of the diamonds dataset
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
Summarizing the data
The summarize()
function can produce the same summary statistics as
the summary()
function and more, and it allows you to save the results
to a new dataset. The summarize()
function uses the argument structure
summarize(data, variable = expression)
where data
is the dataset you
want to summarize, variable
is the name of the new variable, and
expression
is the calculation you want to perform. (Note that it has
the same argument structure as the mutate()
function for creating new
variables. The difference is that mutate()
calculates one value per
observation and summarize()
calculates one value for the whole
dataset.)
For example, let’s calculate the mean price of diamonds:
# calculate the mean price of diamonds
summarize(diamonds, mean_price = mean(price))
## # A tibble: 1 × 1
## mean_price
## <dbl>
## 1 3933.
You can also calculate multiple summary statistics at once by separating them with a comma (and as a good practice, a new line).
Let’s calculate the mean price and the number of observations in the dataset:
# calculate the mean price and the number of observations
summarize(diamonds,
mean_price = mean(price),
n = n())
## # A tibble: 1 × 2
## mean_price n
## <dbl> <int>
## 1 3933. 53940
To be able to work with this new dataset, you need to save it to a new
object. Let’s assign the result of the summarize()
function to a new
object called diamonds_summary
:
# save the result to a new object
diamonds_summary <- summarize(diamonds,
mean_price = mean(price),
n = n())
Summarizing data within groups
The group_by()
function is used to group the data by one or more
variables. Then summary statistics can be calculated within each group
using the summarize()
function in the same way as before. This is
useful when you want to compare different groups in your data, for
example the mean price of diamonds by cut.
Note that the following code uses the pipe operator |>
to chain the
functions together. The pipe operator is used to pass the output of one
function as the first input to the next function, making the code more
readable. To read more about the pipe operator, see the tutorial on the
tidy workflow.
# start with the diamonds tibble
diamonds |>
# group by cut
group_by(cut) |>
# calculate the mean price within each group
summarize(mean_price = mean(price))
## # A tibble: 5 × 2
## cut mean_price
## <ord> <dbl>
## 1 Fair 4359.
## 2 Good 3929.
## 3 Very Good 3982.
## 4 Premium 4584.
## 5 Ideal 3458.