Data Center Apprenticeship: R basics: Summary statistics


June 2024

To get a descriptive statistic of a single variable in a tibble, we can use that variable as an argument to a relevant function (using $ to refer to a variable in a tibble).

mean(data$age)
## [1] 19.68276
median(data$age)
## [1] 19
sd(data$grade)
## [1] 1.281705

To get the frequencies of a categorical variable, we can use the count() function, with the sort = TRUE argument returning the values in descending frequency. count() is a tidy function that works well with pipe workflows and can count the joint frequencies of multiple variables.

# frequencies of a single variable
count(data, reading)
## # A tibble: 2 × 2
##   reading     n
##   <lgl>   <int>
## 1 FALSE      76
## 2 TRUE       69
# joint frequency distribution
count(data, reading, listening, notes)
## # A tibble: 8 × 4
##   reading listening notes     n
##   <lgl>   <lgl>     <lgl> <int>
## 1 FALSE   FALSE     FALSE    14
## 2 FALSE   FALSE     TRUE     23
## 3 FALSE   TRUE      FALSE    20
## 4 FALSE   TRUE      TRUE     19
## 5 TRUE    FALSE     FALSE    19
## 6 TRUE    FALSE     TRUE     14
## 7 TRUE    TRUE      FALSE    15
## 8 TRUE    TRUE      TRUE     21

To get the correlation coefficient between two variables, we can use the cor() function in the same way we used other descriptives such as mean().

cor(data$age, data$grade)
## [1] 0.1856025

The easiest way to get summary statistics of all variables in a tibble is with the summary() function: this function shows the distribution of numeric variables, the frequencies of categorical variables, and the number of missing values for each variable.

summary(data)
##        id            age            sex             scholarship    
##  Min.   :5001   Min.   :18.00   Length:145         Min.   : 25.00  
##  1st Qu.:5037   1st Qu.:18.00   Class :character   1st Qu.: 50.00  
##  Median :5073   Median :19.00   Mode  :character   Median : 50.00  
##  Mean   :5073   Mean   :19.68                      Mean   : 64.76  
##  3rd Qu.:5109   3rd Qu.:21.00                      3rd Qu.: 75.00  
##  Max.   :5145   Max.   :26.00                      Max.   :100.00  
##                                                    NA's   :1       
##  additional_work  reading          notes         listening      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:96        FALSE:76        FALSE:68        FALSE:70       
##  TRUE :49        TRUE :69        TRUE :77        TRUE :75       
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##      grade      
##  Min.   :0.000  
##  1st Qu.:1.500  
##  Median :3.000  
##  Mean   :2.755  
##  3rd Qu.:4.000  
##  Max.   :4.000  
## 

The summary() function is useful for viewing the data in the Console, but doesn’t export to outside of R nicely. There are a few packages available for generating simple summary statistics tables that contain information about the central tendencies and dispersion of the data, such as vtable or stargazer (with many more available). These packages all have different default settings, output types, and customization options.

library(vtable)
library(stargazer)

# vtable

data |> 
  # by default creates HTML table; out = "csv" returns a dataframe
  # can change which descriptives to keep
  # can report group-level descriptives
  sumtable(out = "csv", group = "reading")
##           Variable  N Mean  SD   N Mean  SD
## 1          reading No          Yes         
## 2               id 76 5080  44  69 5066  39
## 3              age 76   20 2.3  69   20 1.7
## 4              sex 76           69         
## 5       ... Female 32  42%      26  38%    
## 6         ... Male 44  58%      43  62%    
## 7      scholarship 75   65  18  69   64  21
## 8  additional_work 76           69         
## 9           ... No 47  62%      49  71%    
## 10         ... Yes 29  38%      20  29%    
## 11           notes 76           69         
## 12          ... No 34  45%      34  49%    
## 13         ... Yes 42  55%      35  51%    
## 14       listening 76           69         
## 15          ... No 37  49%      33  48%    
## 16         ... Yes 39  51%      36  52%    
## 17           grade 76  2.5 1.3  69    3 1.2
# stargazer

data |> 
  # input needs to be a data.frame, not tibble
  as.data.frame() |> 
  # default output is LaTeX table
  # can be exported with the out argument or a following write() function
  # can change which descriptives to keep or omit
  # limited to numeric variables
  stargazer(type = "text")
## 
## ==================================================
## Statistic        N    Mean    St. Dev.  Min   Max 
## --------------------------------------------------
## id              145 5,073.000  42.002  5,001 5,145
## age             145  19.683    1.992    18    26  
## scholarship     144  64.757    19.480   25    100 
## additional_work 145   0.338    0.475     0     1  
## reading         145   0.476    0.501     0     1  
## notes           145   0.531    0.501     0     1  
## listening       145   0.517    0.501     0     1  
## grade           145   2.755    1.282   0.000 4.000
## --------------------------------------------------

Alternatively, we can define our own summary statistics with the dplyr functions group_by() and summarize(), which also easily allows the calculation of more complex descriptive statistics, including grouped statistics based on categorical variables. The across() helper function in the summarize() function can be used to apply the same calculation to multiple variables at once: it requires the first argument as the list of variables (potentially with the help of selector functions) and the function we’d like to apply.

# tibble of mean and sd for a single variable
data |> 
  summarize(mean_grade = mean(grade),
            sd_grade = sd(grade))
## # A tibble: 1 × 2
##   mean_grade sd_grade
##        <dbl>    <dbl>
## 1       2.76     1.28
# mean and sd of age and grade variables, grouped by reading
data |> 
  group_by(reading) |> 
  # .names allows overriding default option to reuse original column names
  summarize(across(c(age, grade), mean, .names = "mean_{.col}"),
            across(c(age, grade), sd, .names = "sd_{.col}"))
## # A tibble: 2 × 5
##   reading mean_age mean_grade sd_age sd_grade
##   <lgl>      <dbl>      <dbl>  <dbl>    <dbl>
## 1 FALSE       19.7       2.54   2.27     1.34
## 2 TRUE        19.7       2.99   1.66     1.18
# mean of all numeric variables, grouped by reading
data |> 
  group_by(reading) |> 
  # where() is a helper function evaluating the contents of variables
  # specify full function call with ~ at the start and .x replacing the variable name
  summarize(across(where(is.numeric), ~mean(.x, na.rm = TRUE)))
## # A tibble: 2 × 5
##   reading    id   age scholarship grade
##   <lgl>   <dbl> <dbl>       <dbl> <dbl>
## 1 FALSE   5080.  19.7        65    2.54
## 2 TRUE    5066.  19.7        64.5  2.99
# mean of all variables with names containing the letter a
data |> 
  summarize(across(contains("a"), ~mean(.x, na.rm = TRUE)))
## # A tibble: 1 × 5
##     age scholarship additional_work reading grade
##   <dbl>       <dbl>           <dbl>   <dbl> <dbl>
## 1  19.7        64.8           0.338   0.476  2.76
# sample size of each group and correlation between age and grade per group
data |> 
  group_by(reading, listening) |> 
  summarize(age_grade_correlation = cor(age, grade),
            n = n())
## # A tibble: 4 × 4
## # Groups:   reading [2]
##   reading listening age_grade_correlation     n
##   <lgl>   <lgl>                     <dbl> <int>
## 1 FALSE   FALSE                    0.267     37
## 2 FALSE   TRUE                     0.334     39
## 3 TRUE    FALSE                    0.0720    33
## 4 TRUE    TRUE                    -0.0777    36

The list of helper functions that can be used instead of listing which variables to include/exclude is in the help file accessible with ?dplyr_tidy_select.

To export a descriptive statistics table, we can use the relevant write...() function shown in the data importing section (e.g. write_csv() for tibbles, general write() for HTML, plain text, LaTeX, other general types). CSV tables already copy nicely into e.g. MS Word. If using LaTeX or RMarkdown, the knitr package contains the kable() function that directly improves on the design of the table without needing formatting afterwards.

data |> 
  count(reading, listening) |> 
  write_csv("table1.csv")
data |> 
  count(reading, listening) |> 
  # knitr:: allows using function from the package without library(knitr)
  knitr::kable()
reading listening n
FALSE FALSE 37
FALSE TRUE 39
TRUE FALSE 33
TRUE TRUE 36

Go to