Data Center Apprenticeship: Statistics in R


January 2025

Introduction

This document provides an overview of the basic statistical functions in R, including descriptive statistics and summary tables. We work with some example data of student characteristics and grades. You can import the data from a CSV file with the following code:

library(tidyverse)
data <- read_csv("https://github.com/ucrdatacenter/projects/raw/refs/heads/main/apprenticeship/2025h1/student_data.csv")

Descriptive statistics

Summaries in the R Console

To get a descriptive statistic of a single variable in a tibble, we can use that variable as an argument to a relevant function (using $ to refer to a variable in a tibble).

mean(data$age)
## [1] 19.68276
median(data$age)
## [1] 19
sd(data$grade)
## [1] 1.281705

To get the frequencies of a categorical variable, we can use the count() function, with the sort = TRUE argument returning the values in descending frequency. count() is a tidy function that works well with pipe workflows and can count the joint frequencies of multiple variables.

# frequencies of a single variable
count(data, reading)
## # A tibble: 2 × 2
##   reading     n
##   <lgl>   <int>
## 1 FALSE      76
## 2 TRUE       69
# joint frequency distribution
count(data, reading, listening, notes)
## # A tibble: 8 × 4
##   reading listening notes     n
##   <lgl>   <lgl>     <lgl> <int>
## 1 FALSE   FALSE     FALSE    14
## 2 FALSE   FALSE     TRUE     23
## 3 FALSE   TRUE      FALSE    20
## 4 FALSE   TRUE      TRUE     19
## 5 TRUE    FALSE     FALSE    19
## 6 TRUE    FALSE     TRUE     14
## 7 TRUE    TRUE      FALSE    15
## 8 TRUE    TRUE      TRUE     21

To get the correlation coefficient between two variables, we can use the cor() function in the same way we used other descriptives such as mean().

cor(data$age, data$grade)
## [1] 0.1856025

The easiest way to get summary statistics of all variables in a tibble is with the summary() function: this function shows the distribution of numeric variables, the frequencies of categorical variables, and the number of missing values for each variable.

summary(data)
##        id            age            sex             scholarship    
##  Min.   :5001   Min.   :18.00   Length:145         Min.   : 25.00  
##  1st Qu.:5037   1st Qu.:18.00   Class :character   1st Qu.: 50.00  
##  Median :5073   Median :19.00   Mode  :character   Median : 50.00  
##  Mean   :5073   Mean   :19.68                      Mean   : 64.76  
##  3rd Qu.:5109   3rd Qu.:21.00                      3rd Qu.: 75.00  
##  Max.   :5145   Max.   :26.00                      Max.   :100.00  
##                                                    NA's   :1       
##  additional_work  reading          notes         listening      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:96        FALSE:76        FALSE:68        FALSE:70       
##  TRUE :49        TRUE :69        TRUE :77        TRUE :75       
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##      grade      
##  Min.   :0.000  
##  1st Qu.:1.500  
##  Median :3.000  
##  Mean   :2.755  
##  3rd Qu.:4.000  
##  Max.   :4.000  
## 

Publication-ready summaries with gtsummary

The summary() function is useful for viewing the data in the Console, but doesn’t export to outside of R nicely. There are a few packages available for generating simple summary statistics tables that contain information about the central tendencies and dispersion of the data that also export nicely, such as gtsummary.

library(gtsummary)

# tbl_summary() creates a summary table of the data
tbl_summary(data)
Characteristic N = 1451
id 5,073 (5,037, 5,109)
age
    18 65 (45%)
    19 17 (12%)
    20 15 (10%)
    21 21 (14%)
    22 17 (12%)
    23 1 (0.7%)
    24 4 (2.8%)
    25 3 (2.1%)
    26 2 (1.4%)
sex
    Female 58 (40%)
    Male 87 (60%)
scholarship
    25 3 (2.1%)
    50 76 (53%)
    75 42 (29%)
    100 23 (16%)
    Unknown 1
additional_work 49 (34%)
reading 69 (48%)
notes 77 (53%)
listening 75 (52%)
grade
    0 8 (5.5%)
    1 17 (12%)
    1.5 13 (9.0%)
    2 17 (12%)
    3 31 (21%)
    4 59 (41%)
1 Median (Q1, Q3); n (%)

You can stratify your summary by a grouping variable using the by argument:

tbl_summary(data, by = sex)
Characteristic Female
N = 58
1
Male
N = 87
1
id 5,115 (5,053, 5,131) 5,060 (5,032, 5,088)
age

    18 29 (50%) 36 (41%)
    19 8 (14%) 9 (10%)
    20 2 (3.4%) 13 (15%)
    21 10 (17%) 11 (13%)
    22 8 (14%) 9 (10%)
    23 0 (0%) 1 (1.1%)
    24 0 (0%) 4 (4.6%)
    25 0 (0%) 3 (3.4%)
    26 1 (1.7%) 1 (1.1%)
scholarship

    25 0 (0%) 3 (3.5%)
    50 26 (45%) 50 (58%)
    75 19 (33%) 23 (27%)
    100 13 (22%) 10 (12%)
    Unknown 0 1
additional_work 26 (45%) 23 (26%)
reading 26 (45%) 43 (49%)
notes 33 (57%) 44 (51%)
listening 31 (53%) 44 (51%)
grade

    0 8 (14%) 0 (0%)
    1 0 (0%) 17 (20%)
    1.5 4 (6.9%) 9 (10%)
    2 6 (10%) 11 (13%)
    3 12 (21%) 19 (22%)
    4 28 (48%) 31 (36%)
1 Median (Q1, Q3); n (%)
# add p-value of difference between groups
tbl_summary(data, by = sex) |> 
  add_p()
Characteristic Female
N = 58
1
Male
N = 87
1
p-value2
id 5,115 (5,053, 5,131) 5,060 (5,032, 5,088) <0.001
age

0.14
    18 29 (50%) 36 (41%)
    19 8 (14%) 9 (10%)
    20 2 (3.4%) 13 (15%)
    21 10 (17%) 11 (13%)
    22 8 (14%) 9 (10%)
    23 0 (0%) 1 (1.1%)
    24 0 (0%) 4 (4.6%)
    25 0 (0%) 3 (3.4%)
    26 1 (1.7%) 1 (1.1%)
scholarship

0.11
    25 0 (0%) 3 (3.5%)
    50 26 (45%) 50 (58%)
    75 19 (33%) 23 (27%)
    100 13 (22%) 10 (12%)
    Unknown 0 1
additional_work 26 (45%) 23 (26%) 0.022
reading 26 (45%) 43 (49%) 0.6
notes 33 (57%) 44 (51%) 0.5
listening 31 (53%) 44 (51%) 0.7
grade

<0.001
    0 8 (14%) 0 (0%)
    1 0 (0%) 17 (20%)
    1.5 4 (6.9%) 9 (10%)
    2 6 (10%) 11 (13%)
    3 12 (21%) 19 (22%)
    4 28 (48%) 31 (36%)
1 Median (Q1, Q3); n (%)
2 Wilcoxon rank sum test; Fisher’s exact test; Pearson’s Chi-squared test

You can also customize the appearance of the table.

tbl_summary(data, by = sex) |> 
  modify_header(label ~ "Variable") |> 
  modify_caption("Summary Table by Sex")
Summary Table by Sex
Variable Female
N = 58
1
Male
N = 87
1
id 5,115 (5,053, 5,131) 5,060 (5,032, 5,088)
age

    18 29 (50%) 36 (41%)
    19 8 (14%) 9 (10%)
    20 2 (3.4%) 13 (15%)
    21 10 (17%) 11 (13%)
    22 8 (14%) 9 (10%)
    23 0 (0%) 1 (1.1%)
    24 0 (0%) 4 (4.6%)
    25 0 (0%) 3 (3.4%)
    26 1 (1.7%) 1 (1.1%)
scholarship

    25 0 (0%) 3 (3.5%)
    50 26 (45%) 50 (58%)
    75 19 (33%) 23 (27%)
    100 13 (22%) 10 (12%)
    Unknown 0 1
additional_work 26 (45%) 23 (26%)
reading 26 (45%) 43 (49%)
notes 33 (57%) 44 (51%)
listening 31 (53%) 44 (51%)
grade

    0 8 (14%) 0 (0%)
    1 0 (0%) 17 (20%)
    1.5 4 (6.9%) 9 (10%)
    2 6 (10%) 11 (13%)
    3 12 (21%) 19 (22%)
    4 28 (48%) 31 (36%)
1 Median (Q1, Q3); n (%)

To export the table as a Word document, use the gtsave() function. Note that we first use the as_gt() function to convert the tbl_summary() output to a gt object, and load the gt package in order to use the Word export function defined for the gt package.

library(gt)

tbl_summary(data, by = sex) |> 
  as_gt() %>%
  gtsave("summary_table.docx")

If you use LaTeX, you can also export as a LaTeX table, also relying on the gt package.

tbl_summary(data, by = sex) |> 
  as_gt() %>%
  gtsave("summary_table.tex")

Creating custom summary tables

Alternatively, we can define our own summary statistics with the dplyr functions group_by() and summarize(), which also easily allows the calculation of more complex descriptive statistics, including grouped statistics based on categorical variables. The across() helper function in the summarize() function can be used to apply the same calculation to multiple variables at once: it requires the first argument as the list of variables (potentially with the help of selector functions) and the function we’d like to apply.

# tibble of mean and sd for a single variable
data |> 
  summarize(mean_grade = mean(grade),
            sd_grade = sd(grade))
## # A tibble: 1 × 2
##   mean_grade sd_grade
##        <dbl>    <dbl>
## 1       2.76     1.28
# mean and sd of age and grade variables, grouped by reading
data |> 
  group_by(reading) |> 
  # .names allows overriding default option to reuse original column names
  summarize(across(c(age, grade), mean, .names = "mean_{.col}"),
            across(c(age, grade), sd, .names = "sd_{.col}"))
## # A tibble: 2 × 5
##   reading mean_age mean_grade sd_age sd_grade
##   <lgl>      <dbl>      <dbl>  <dbl>    <dbl>
## 1 FALSE       19.7       2.54   2.27     1.34
## 2 TRUE        19.7       2.99   1.66     1.18
# mean of all numeric variables, grouped by reading
data |> 
  group_by(reading) |> 
  # where() is a helper function evaluating the contents of variables
  # specify full function call with ~ at the start and .x replacing the variable name
  summarize(across(where(is.numeric), ~mean(.x, na.rm = TRUE)))
## # A tibble: 2 × 5
##   reading    id   age scholarship grade
##   <lgl>   <dbl> <dbl>       <dbl> <dbl>
## 1 FALSE   5080.  19.7        65    2.54
## 2 TRUE    5066.  19.7        64.5  2.99
# mean of all variables with names containing the letter a
data |> 
  summarize(across(contains("a"), ~mean(.x, na.rm = TRUE)))
## # A tibble: 1 × 5
##     age scholarship additional_work reading grade
##   <dbl>       <dbl>           <dbl>   <dbl> <dbl>
## 1  19.7        64.8           0.338   0.476  2.76
# sample size of each group and correlation between age and grade per group
data |> 
  group_by(reading, listening) |> 
  summarize(age_grade_correlation = cor(age, grade),
            n = n())
## # A tibble: 4 × 4
## # Groups:   reading [2]
##   reading listening age_grade_correlation     n
##   <lgl>   <lgl>                     <dbl> <int>
## 1 FALSE   FALSE                    0.267     37
## 2 FALSE   TRUE                     0.334     39
## 3 TRUE    FALSE                    0.0720    33
## 4 TRUE    TRUE                    -0.0777    36

The list of helper functions that can be used instead of listing which variables to include/exclude is in the help file accessible with ?dplyr_tidy_select.

To export a descriptive statistics table, we can use the relevant write...() function shown in the data importing section (e.g. write_csv() for tibbles, general write() for HTML, plain text, LaTeX, other general types). CSV tables already copy nicely into e.g. MS Word. Altenatively, the gt package can be used to export the table as a Word or LaTeX document the same way as before.

data |> 
  count(reading, listening) |> 
  write_csv("table1.csv")
data |> 
  count(reading, listening) |> 
  gt() |> 
  gtsave("table1.docx")

Hypothesis testing

Most of the simple statistical tests are from base R, so they don’t rely on tidy principles, but many are compatible with tidy workflows to at least some extent. In the following we’ll cover some of the key methods that show up in methods and statistics courses at UCR. In addition, the tidy() function from the broom package converts most text output into simple tibbles, which are useful for exporting and visualizing results.

library(broom)

t-tests

We can run one and two samples t-tests to evaluate group means with the t.test() function. The function supports various options and model specifications: a simple one-sample t-test only requires specifying the variable of interest, either with x = data$variable or x = variable, data = data syntax. For two-sample t-tests, we can use the formula syntax y ~ x to specify the dependent and independent variables or the x and y (and optionally data) arguments. Additional options include specifying the alternative hypothesis, the confidence level, the value of $\mu$, and whether we want a paired t-test and assume equal variances. Helper functions such as tidy() convert the Console output to an easy-to-export tibble of results.

# simple t-test (H0: mean=mu)
t.test(data$scholarship, mu = 50)
## 
##  One Sample t-test
## 
## data:  data$scholarship
## t = 9.0903, df = 143, p-value = 7.493e-16
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
##  61.54805 67.96584
## sample estimates:
## mean of x 
##  64.75694
# can also use data argument instead of data$...
# grade ~ reading is formula specification: variable ~ group
t.test(grade ~ reading, alternative = "greater", data = data)
## 
##  Welch Two Sample t-test
## 
## data:  grade by reading
## t = -2.1671, df = 142.85, p-value = 0.9841
## alternative hypothesis: true difference in means between group FALSE and group TRUE is greater than 0
## 95 percent confidence interval:
##  -0.7995718        Inf
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            2.539474            2.992754
# tidy() function turns results into a tibble
t.test(grade ~ reading, alternative = "greater", data = data) |> tidy()
## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1   -0.453      2.54      2.99     -2.17   0.984      143.   -0.800       Inf
## # ℹ 2 more variables: method <chr>, alternative <chr>

Correlation test

The cor.test() function calculates the correlation between two variables. Again, the function supports various specifications: x and y arguments, formula syntax (see below for an example), and the data argument.

cor.test( ~ grade + age, data = data)
## 
##  Pearson's product-moment correlation
## 
## data:  grade and age
## t = 2.2587, df = 143, p-value = 0.02541
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.02329811 0.33837421
## sample estimates:
##       cor 
## 0.1856025
# tidy() function turns results into a tibble
cor.test( ~ grade + age, data = data) |> tidy()
## # A tibble: 1 × 8
##   estimate statistic p.value parameter conf.low conf.high method     alternative
##      <dbl>     <dbl>   <dbl>     <int>    <dbl>     <dbl> <chr>      <chr>      
## 1    0.186      2.26  0.0254       143   0.0233     0.338 Pearson's… two.sided

Simple regression

A key building block of statistical analysis is linear regression. The lm() function fits a linear model to the data, with a wide range of specifications, passed as the formula argument (first argument if unnamed). The formula syntax is y ~ x, where y is the dependent variable and x is the independent variable. Again, optional function arguments allow for a lot of customization, but the default settings are sufficient for most applications. Helper functions such as tidy() and summary() provide extensive summaries of the model fit and coefficients, and tbl_regression() from the gtsummary package creates neat tables of the results. Assigning the result of a model to an object saves computational time, as then we can work with the results without having to re-run the analysis every time.

# assign outcome to object
fit <- lm(grade ~ age, data = data)

# extensive result summary
fit |> summary()
## 
## Call:
## lm(formula = grade ~ age, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9125 -1.0542  0.3264  1.0875  1.4458 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.40464    1.04592   0.387   0.6994  
## age          0.11942    0.05287   2.259   0.0254 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.264 on 143 degrees of freedom
## Multiple R-squared:  0.03445,    Adjusted R-squared:  0.0277 
## F-statistic: 5.102 on 1 and 143 DF,  p-value: 0.02541
# tidy coefficients
fit |> tidy()
## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    0.405    1.05       0.387  0.699 
## 2 age            0.119    0.0529     2.26   0.0254
# display-ready table with gtsummary
tbl_regression(fit)
Characteristic Beta 95% CI1 p-value
age 0.12 0.01, 0.22 0.025
1 CI = Confidence Interval

Multiple regression

Multiple regression extends simple regression to multiple independent variables. The only difference is the formula specification, which now connects multiple independent variables with + signs. The formula specification also allows for interactions between variables, which can be specified with * (if the main effects should be included) or : (for only the interaction term). The DV ~ .~ syntax includes all variables in the data except the dependent variable as independent variables.

lm(grade ~ age + scholarship, data = data) |> summary()
## 
## Call:
## lm(formula = grade ~ age + scholarship, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8845 -1.0609  0.3735  1.1478  1.5025 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 0.048581   1.292374   0.038    0.970  
## age         0.129004   0.056549   2.281    0.024 *
## scholarship 0.002536   0.005788   0.438    0.662  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.271 on 141 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.03636,    Adjusted R-squared:  0.0227 
## F-statistic:  2.66 on 2 and 141 DF,  p-value: 0.07343
# all variables in data
lm(grade ~ ., data = data) |> summary()
## 
## Call:
## lm(formula = grade ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5165 -0.9030  0.2786  0.9096  2.1809 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)   
## (Intercept)         45.780346  14.038274   3.261  0.00141 **
## id                  -0.009043   0.002749  -3.290  0.00128 **
## age                  0.128874   0.056319   2.288  0.02367 * 
## sexMale             -0.517980   0.230416  -2.248  0.02620 * 
## scholarship          0.005147   0.005825   0.883  0.37855   
## additional_workTRUE -0.041568   0.231178  -0.180  0.85757   
## readingTRUE          0.361383   0.206867   1.747  0.08292 . 
## notesTRUE            0.011816   0.210672   0.056  0.95536   
## listeningTRUE        0.225988   0.209844   1.077  0.28343   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.218 on 135 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1525, Adjusted R-squared:  0.1023 
## F-statistic: 3.037 on 8 and 135 DF,  p-value: 0.003596
# interactions
lm(grade ~ age * scholarship, data = data) |> summary()
## 
## Call:
## lm(formula = grade ~ age * scholarship, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8931 -1.0857  0.2747  1.1853  1.5539 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)     -2.554899   3.814695  -0.670    0.504
## age              0.262316   0.192274   1.364    0.175
## scholarship      0.046368   0.060690   0.764    0.446
## age:scholarship -0.002266   0.003123  -0.726    0.469
## 
## Residual standard error: 1.273 on 140 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.03997,    Adjusted R-squared:  0.0194 
## F-statistic: 1.943 on 3 and 140 DF,  p-value: 0.1254

ANOVA

Analysis of variance (ANOVA) is a generalization of the t-test to multiple groups. The aov() function fits an ANOVA model to the data, with the formula syntax y ~ x, where y is the dependent variable and x is the independent variable. The same helper functions as with lm() can be used to summarize the results.

Note that ANOVA is a specific case of a linear regression model, so the results are equivalent to those of a linear regression model with a categorical independent variable.

anova_fit <- aov(grade ~ reading, data = data)

summary(anova_fit)
##              Df Sum Sq Mean Sq F value Pr(>F)  
## reading       1   7.43   7.431   4.638  0.033 *
## Residuals   143 229.13   1.602                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tidy(anova_fit)
## # A tibble: 2 × 6
##   term         df  sumsq meansq statistic p.value
##   <chr>     <dbl>  <dbl>  <dbl>     <dbl>   <dbl>
## 1 reading       1   7.43   7.43      4.64  0.0330
## 2 Residuals   143 229.     1.60     NA    NA
# equivalent regression
lm(grade ~ reading, data = data) |> summary()
## 
## Call:
## lm(formula = grade ~ reading, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9928 -1.0395  0.4605  1.0072  1.4605 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.5395     0.1452  17.490   <2e-16 ***
## readingTRUE   0.4533     0.2105   2.153    0.033 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.266 on 143 degrees of freedom
## Multiple R-squared:  0.03141,    Adjusted R-squared:  0.02464 
## F-statistic: 4.638 on 1 and 143 DF,  p-value: 0.03296

Chi-square test

The chi-square test is used to test the independence of two categorical variables. The chisq.test() function calculates the chi-square statistic and the p-value for the test. Unlike the previous functions, the function does not allow for a data argument, and is therefore difficult to implement in tidy workflows.

chisq.test(data$reading, data$notes)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data$reading and data$notes
## X-squared = 0.14464, df = 1, p-value = 0.7037

Add contingency tables:

# with table()
table(data$reading, data$notes)
##        
##         FALSE TRUE
##   FALSE    34   42
##   TRUE     34   35
# with xtabs()
xtabs(~ reading + notes, data = data)
##        notes
## reading FALSE TRUE
##   FALSE    34   42
##   TRUE     34   35
# with count() and pivot_wider()
data |> 
  count(reading, notes) |>
  pivot_wider(names_from = notes, values_from = n, values_fill = 0)
## # A tibble: 2 × 3
##   reading `FALSE` `TRUE`
##   <lgl>     <int>  <int>
## 1 FALSE        34     42
## 2 TRUE         34     35
# proportions with table() and prop.table()
table(data$reading, data$notes) |> prop.table()
##        
##             FALSE      TRUE
##   FALSE 0.2344828 0.2896552
##   TRUE  0.2344828 0.2413793
# proportions with count() and pivot_wider()
data |> 
  count(reading, notes) |>
  mutate(prop = n / sum(n)) |>
  select(-n) |> 
  pivot_wider(names_from = notes, values_from = prop, values_fill = 0)
## # A tibble: 2 × 3
##   reading `FALSE` `TRUE`
##   <lgl>     <dbl>  <dbl>
## 1 FALSE     0.234  0.290
## 2 TRUE      0.234  0.241

Logistic regression

When it comes to predicting binary outcomes, linear regression has some problems, such as predicting values outside the 0-1 range. Therefore in those cases, we often use logistic regression instead. The glm() function fits a logistic regression model to the data. Other than the family argument, which specifies the distribution of the dependent variable, the function works in the same as lm(). A logistic regression uses family = "binomial".

logit_fit <- glm(reading ~ age, data = data, family = "binomial")

logit_fit |> summary()
## 
## Call:
## glm(formula = reading ~ age, family = "binomial", data = data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.219462   1.656757  -0.132    0.895
## age          0.006241   0.083743   0.075    0.941
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 200.67  on 144  degrees of freedom
## Residual deviance: 200.67  on 143  degrees of freedom
## AIC: 204.67
## 
## Number of Fisher Scoring iterations: 3
logit_fit |> tidy()
## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) -0.219      1.66     -0.132    0.895
## 2 age          0.00624    0.0837    0.0745   0.941
# display-ready table with gtsummary
tbl_regression(logit_fit)
Characteristic log(OR)1 95% CI1 p-value
age 0.01 -0.16, 0.17 >0.9
1 OR = Odds Ratio, CI = Confidence Interval

Non-parametric tests

Non-parametric tests are used when the assumptions of parametric tests are violated. Running them in R follows the same structure as running the parametric alternative, other than the function name itself and potential alternative optional arguments.

# Wilcoxon signed-rank test
wilcox.test(data$grade, mu = 2)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  data$grade
## V = 6767.5, p-value = 1.238e-10
## alternative hypothesis: true location is not equal to 2
# Mann-Whitney U test
wilcox.test(grade ~ reading, data = data)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  grade by reading
## W = 2119, p-value = 0.03795
## alternative hypothesis: true location shift is not equal to 0
# Kruskal-Wallis test (scholarship is a number in the data so convert to factor for this example)
kruskal.test(grade ~ factor(scholarship), data = data)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  grade by factor(scholarship)
## Kruskal-Wallis chi-squared = 20.368, df = 3, p-value = 0.0001424

PCA and factor analysis

Principal component analysis (PCA) and factor analysis are used to reduce the dimensionality of a dataset. The prcomp() function fits a PCA model to the data, and the factanal() function fits a factor analysis model. Both functions work with the formula syntax, and the data argument can be used to specify the data frame.

# PCA
pca_fit <- prcomp(~ grade + age + scholarship, data = data)

# summary of PCA
summary(pca_fit)
## Importance of components:
##                            PC1     PC2     PC3
## Standard deviation     19.4919 1.90658 1.24411
## Proportion of Variance  0.9865 0.00944 0.00402
## Cumulative Proportion   0.9865 0.99598 1.00000
# component loadings for each observation
tidy(pca_fit)
## # A tibble: 432 × 3
##    row      PC    value
##    <chr> <dbl>    <dbl>
##  1 1         1 -14.8   
##  2 1         2   1.05  
##  3 1         3  -1.01  
##  4 2         1 -14.8   
##  5 2         2   0.0756
##  6 2         3  -1.23  
##  7 3         1 -14.8   
##  8 3         2   1.05  
##  9 3         3  -1.01  
## 10 4         1 -14.7   
## # ℹ 422 more rows
# Factor analysis
fa_fit <- factanal(~ grade + age + scholarship, factors = 1, data = data)

# summary of factor analysis
fa_fit
## 
## Call:
## factanal(x = ~grade + age + scholarship, factors = 1, data = data)
## 
## Uniquenesses:
##       grade         age scholarship 
##       0.965       0.005       0.888 
## 
## Loadings:
##             Factor1
## grade        0.188 
## age          0.997 
## scholarship -0.334 
## 
##                Factor1
## SS loadings      1.142
## Proportion Var   0.381
## 
## The degrees of freedom for the model is 0 and the fit was 0.0014
tidy(fa_fit)
## # A tibble: 3 × 3
##   variable    uniqueness    fl1
##   <chr>            <dbl>  <dbl>
## 1 grade            0.965  0.188
## 2 age              0.005  0.997
## 3 scholarship      0.888 -0.334