Data Center Apprenticeship: R basics: Hypothesis testing / modelling


June 2024

Most of the simple statistical tests are from base R, so they don’t rely on tidy principles, but many are compatible with tidy workflows to at least some extent. In the following we’ll cover some of the key methods that show up in methods and statistics courses at UCR. In addition, the tidy() function from the broom package converts most text output into simple tibblesy, which are useful for exporting and visualizing results.

t-tests

We can run one and two samples t-tests to evaluate group means with the t.test() function. The function supports various options and model specifications: a simple one-sample t-test only requires specifying the variable of interest, either with x = data$variable or x = variable, data = data syntax. For two-sample t-tests, we can use the formula syntax y ~ x to specify the dependent and independent variables or the x and y (and optionally data) arguments. Additional options include specifying the alternative hypothesis, the confidence level, the value of $\mu$, and whether we want a paired t-test and assume equal variances. Helper functions such as tidy() convert the Console output to an easy-to-export tibble of results.

library(broom) # for tidy() function

# simple t-test (H0: mean=mu)
t.test(data$scholarship, mu = 50)
## 
##  One Sample t-test
## 
## data:  data$scholarship
## t = 9.0903, df = 143, p-value = 7.493e-16
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
##  61.54805 67.96584
## sample estimates:
## mean of x 
##  64.75694
# use data argument instead of data$... to work in pipe workflows
data |> 
  # grade ~ reading is formula specification: variable ~ group
  # _ is placeholder if the pipe input is not the first argument of the next function
  t.test(grade ~ reading, alternative = "greater", data = _)
## 
##  Welch Two Sample t-test
## 
## data:  grade by reading
## t = -2.1671, df = 142.85, p-value = 0.9841
## alternative hypothesis: true difference in means between group FALSE and group TRUE is greater than 0
## 95 percent confidence interval:
##  -0.7995718        Inf
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            2.539474            2.992754
data |> 
  t.test(grade ~ reading, alternative = "greater", data = _) |> 
  tidy()
## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1   -0.453      2.54      2.99     -2.17   0.984      143.   -0.800       Inf
## # ℹ 2 more variables: method <chr>, alternative <chr>

Correlation test

The cor.test() function calculates the correlation between two variables. Again, the function supports various specifications: x and y arguments, formula syntax (see below for an example), and the data argument.

cor.test( ~ grade + age, data = data)
## 
##  Pearson's product-moment correlation
## 
## data:  grade and age
## t = 2.2587, df = 143, p-value = 0.02541
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.02329811 0.33837421
## sample estimates:
##       cor 
## 0.1856025
# assign the outcome to an object
cor_result <- cor.test( ~ grade + age, data = data)
# tibble of results
tidy(cor_result)
## # A tibble: 1 × 8
##   estimate statistic p.value parameter conf.low conf.high method     alternative
##      <dbl>     <dbl>   <dbl>     <int>    <dbl>     <dbl> <chr>      <chr>      
## 1    0.186      2.26  0.0254       143   0.0233     0.338 Pearson's… two.sided

Simple regression

A key building block of statistical analysis is linear regression. The lm() function fits a linear model to the data, with a wide range of specifications, passed as the formula argument (first argument if unnamed). The formula syntax is y ~ x, where y is the dependent variable and x is the independent variable. Again, optional function arguments allow for a lot of customization, but the default settings are sufficient for most applications. Helper functions such as tidy() and summary() provide extensive summaries of the model fit and coefficients, and stargazer() creates neat tables of the results. Assigning the result of a model to an object saves computational time, as then we can work with the results without having to re-run the analysis every time.

# assign outcome to object
fit <- lm(grade ~ age, data = data)

# extensive result summary
fit |> summary()
## 
## Call:
## lm(formula = grade ~ age, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9125 -1.0542  0.3264  1.0875  1.4458 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.40464    1.04592   0.387   0.6994  
## age          0.11942    0.05287   2.259   0.0254 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.264 on 143 degrees of freedom
## Multiple R-squared:  0.03445,    Adjusted R-squared:  0.0277 
## F-statistic: 5.102 on 1 and 143 DF,  p-value: 0.02541
# tidy coefficients
fit |> tidy()
## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    0.405    1.05       0.387  0.699 
## 2 age            0.119    0.0529     2.26   0.0254
# display-ready table
fit |> 
  stargazer(type = "text", title = "Grade - agre regression results")
## 
## Grade - agre regression results
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                grade           
## -----------------------------------------------
## age                           0.119**          
##                               (0.053)          
##                                                
## Constant                       0.405           
##                               (1.046)          
##                                                
## -----------------------------------------------
## Observations                    145            
## R2                             0.034           
## Adjusted R2                    0.028           
## Residual Std. Error      1.264 (df = 143)      
## F Statistic            5.102** (df = 1; 143)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Multiple regression

Multiple regression extends simple regression to multiple independent variables. The only difference is the formula specification, which now connects multiple independent variables with + signs. The formula specification also allows for interactions between variables, which can be specified with * (if the main effects should be included) or : (for only the interaction term). The DV ~ .~ syntax includes all variables in the data except the dependent variable as independent variables.

lm(grade ~ age + scholarship, data = data) |> summary()
## 
## Call:
## lm(formula = grade ~ age + scholarship, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8845 -1.0609  0.3735  1.1478  1.5025 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 0.048581   1.292374   0.038    0.970  
## age         0.129004   0.056549   2.281    0.024 *
## scholarship 0.002536   0.005788   0.438    0.662  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.271 on 141 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.03636,    Adjusted R-squared:  0.0227 
## F-statistic:  2.66 on 2 and 141 DF,  p-value: 0.07343
# all variables in data
lm(grade ~ ., data = data) |> summary()
## 
## Call:
## lm(formula = grade ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5165 -0.9030  0.2786  0.9096  2.1809 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)   
## (Intercept)         45.780346  14.038274   3.261  0.00141 **
## id                  -0.009043   0.002749  -3.290  0.00128 **
## age                  0.128874   0.056319   2.288  0.02367 * 
## sexMale             -0.517980   0.230416  -2.248  0.02620 * 
## scholarship          0.005147   0.005825   0.883  0.37855   
## additional_workTRUE -0.041568   0.231178  -0.180  0.85757   
## readingTRUE          0.361383   0.206867   1.747  0.08292 . 
## notesTRUE            0.011816   0.210672   0.056  0.95536   
## listeningTRUE        0.225988   0.209844   1.077  0.28343   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.218 on 135 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1525, Adjusted R-squared:  0.1023 
## F-statistic: 3.037 on 8 and 135 DF,  p-value: 0.003596
# interactions
lm(grade ~ age * scholarship, data = data) |> summary()
## 
## Call:
## lm(formula = grade ~ age * scholarship, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8931 -1.0857  0.2747  1.1853  1.5539 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)     -2.554899   3.814695  -0.670    0.504
## age              0.262316   0.192274   1.364    0.175
## scholarship      0.046368   0.060690   0.764    0.446
## age:scholarship -0.002266   0.003123  -0.726    0.469
## 
## Residual standard error: 1.273 on 140 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.03997,    Adjusted R-squared:  0.0194 
## F-statistic: 1.943 on 3 and 140 DF,  p-value: 0.1254

ANOVA

Analysis of variance (ANOVA) is a generalization of the t-test to multiple groups. The aov() function fits an ANOVA model to the data, with the formula syntax y ~ x, where y is the dependent variable and x is the independent variable. The same helper functions as with lm() can be used to summarize the results.

Note that ANOVA is a specific case of a linear regression model, so the results are equivalent to those of a linear regression model with a categorical independent variable.

anova_fit <- aov(grade ~ reading, data = data)

summary(anova_fit)
##              Df Sum Sq Mean Sq F value Pr(>F)  
## reading       1   7.43   7.431   4.638  0.033 *
## Residuals   143 229.13   1.602                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tidy(anova_fit)
## # A tibble: 2 × 6
##   term         df  sumsq meansq statistic p.value
##   <chr>     <dbl>  <dbl>  <dbl>     <dbl>   <dbl>
## 1 reading       1   7.43   7.43      4.64  0.0330
## 2 Residuals   143 229.     1.60     NA    NA
# equivalent regression
lm(grade ~ reading, data = data) |> summary()
## 
## Call:
## lm(formula = grade ~ reading, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9928 -1.0395  0.4605  1.0072  1.4605 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.5395     0.1452  17.490   <2e-16 ***
## readingTRUE   0.4533     0.2105   2.153    0.033 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.266 on 143 degrees of freedom
## Multiple R-squared:  0.03141,    Adjusted R-squared:  0.02464 
## F-statistic: 4.638 on 1 and 143 DF,  p-value: 0.03296

Chi-square test

The chi-square test is used to test the independence of two categorical variables. The chisq.test() function calculates the chi-square statistic and the p-value for the test. Unlike the previous functions, the function does not allow for a data argument, and is therefore difficult to implement in tidy workflows.

chisq.test(data$reading, data$notes)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data$reading and data$notes
## X-squared = 0.14464, df = 1, p-value = 0.7037

Add contingency tables:

# with table()
table(data$reading, data$notes)
##        
##         FALSE TRUE
##   FALSE    34   42
##   TRUE     34   35
# with xtabs()
xtabs(~ reading + notes, data = data)
##        notes
## reading FALSE TRUE
##   FALSE    34   42
##   TRUE     34   35
# with count() and pivot_wider()
data |> 
  count(reading, notes) |>
  pivot_wider(names_from = notes, values_from = n, values_fill = 0)
## # A tibble: 2 × 3
##   reading `FALSE` `TRUE`
##   <lgl>     <int>  <int>
## 1 FALSE        34     42
## 2 TRUE         34     35
# proportions with table() and prop.table()
table(data$reading, data$notes) |> prop.table()
##        
##             FALSE      TRUE
##   FALSE 0.2344828 0.2896552
##   TRUE  0.2344828 0.2413793
# proportions with count() and pivot_wider()
data |> 
  count(reading, notes) |>
  mutate(prop = n / sum(n)) |>
  select(-n) |> 
  pivot_wider(names_from = notes, values_from = prop, values_fill = 0)
## # A tibble: 2 × 3
##   reading `FALSE` `TRUE`
##   <lgl>     <dbl>  <dbl>
## 1 FALSE     0.234  0.290
## 2 TRUE      0.234  0.241

Logistic regression

When it comes to predicting binary outcomes, linear regression has some problems, such as predicting values outside the 0-1 range. Therefore in those cases, we often use logistic regression instead. The glm() function fits a logistic regression model to the data. Other than the family argument, which specifies the distribution of the dependent variable, the function works in the same as lm(). A logistic regression uses family = "binomial".

logit_fit <- glm(reading ~ age, data = data, family = "binomial")

logit_fit |> summary()
## 
## Call:
## glm(formula = reading ~ age, family = "binomial", data = data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.219462   1.656757  -0.132    0.895
## age          0.006241   0.083743   0.075    0.941
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 200.67  on 144  degrees of freedom
## Residual deviance: 200.67  on 143  degrees of freedom
## AIC: 204.67
## 
## Number of Fisher Scoring iterations: 3
logit_fit |> tidy()
## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) -0.219      1.66     -0.132    0.895
## 2 age          0.00624    0.0837    0.0745   0.941
# display-ready table
stargazer(logit_fit, type = "text", 
          title = "Relationship between doing the reading and age")
## 
## Relationship between doing the reading and age
## =============================================
##                       Dependent variable:    
##                   ---------------------------
##                             reading          
## ---------------------------------------------
## age                          0.006           
##                             (0.084)          
##                                              
## Constant                    -0.219           
##                             (1.657)          
##                                              
## ---------------------------------------------
## Observations                  145            
## Log Likelihood             -100.335          
## Akaike Inf. Crit.           204.669          
## =============================================
## Note:             *p<0.1; **p<0.05; ***p<0.01

Non-parametric tests

Non-parametric tests are used when the assumptions of parametric tests are violated. Running them in R follows the same structure as running the parametric alternative, other than the function name itself and potential alternative optional arguments.

# Wilcoxon signed-rank test
wilcox.test(data$grade, mu = 2)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  data$grade
## V = 6767.5, p-value = 1.238e-10
## alternative hypothesis: true location is not equal to 2
# Mann-Whitney U test
wilcox.test(grade ~ reading, data = data)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  grade by reading
## W = 2119, p-value = 0.03795
## alternative hypothesis: true location shift is not equal to 0
# Kruskal-Wallis test (scholarship is a number in the data so convert to factor for this example)
kruskal.test(grade ~ factor(scholarship), data = data)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  grade by factor(scholarship)
## Kruskal-Wallis chi-squared = 20.368, df = 3, p-value = 0.0001424

PCA and factor analysis

Principal component analysis (PCA) and factor analysis are used to reduce the dimensionality of a dataset. The prcomp() function fits a PCA model to the data, and the factanal() function fits a factor analysis model. Both functions work with the formula syntax, and the data argument can be used to specify the data frame.

# PCA
pca_fit <- prcomp(~ grade + age + scholarship, data = data)

# summary of PCA
summary(pca_fit)
## Importance of components:
##                            PC1     PC2     PC3
## Standard deviation     19.4919 1.90658 1.24411
## Proportion of Variance  0.9865 0.00944 0.00402
## Cumulative Proportion   0.9865 0.99598 1.00000
# component loadings for each observation
tidy(pca_fit)
## # A tibble: 432 × 3
##    row      PC    value
##    <chr> <dbl>    <dbl>
##  1 1         1 -14.8   
##  2 1         2   1.05  
##  3 1         3  -1.01  
##  4 2         1 -14.8   
##  5 2         2   0.0756
##  6 2         3  -1.23  
##  7 3         1 -14.8   
##  8 3         2   1.05  
##  9 3         3  -1.01  
## 10 4         1 -14.7   
## # ℹ 422 more rows
# Factor analysis
fa_fit <- factanal(~ grade + age + scholarship, factors = 1, data = data)

# summary of factor analysis
fa_fit
## 
## Call:
## factanal(x = ~grade + age + scholarship, factors = 1, data = data)
## 
## Uniquenesses:
##       grade         age scholarship 
##       0.965       0.005       0.888 
## 
## Loadings:
##             Factor1
## grade        0.188 
## age          0.997 
## scholarship -0.334 
## 
##                Factor1
## SS loadings      1.142
## Proportion Var   0.381
## 
## The degrees of freedom for the model is 0 and the fit was 0.0014
tidy(fa_fit)
## # A tibble: 3 × 3
##   variable    uniqueness    fl1
##   <chr>            <dbl>  <dbl>
## 1 grade            0.965  0.188
## 2 age              0.005  0.997
## 3 scholarship      0.888 -0.334

Repeated measures ANOVA

Repeated measures ANOVA is used when the same subjects are measured multiple times. The aov() function can be used to fit these models, with the formula syntax y ~ x + Error(subject), where y is the dependent variable, x is the independent variable, and subject is the repeated measure. The Error() function specifies the repeated measure, and the data argument can be used to specify the data frame.

# repeated measures ANOVA
# redefine data to have two observations per person, the second with random noise added to the grade
bind_rows(data,
          data %>% mutate(grade = grade + rnorm(n()))) |>
  aov(grade ~ reading + Error(id), data = _)
## 
## Call:
## aov(formula = grade ~ reading + Error(id), data = bind_rows(data, 
##     data %>% mutate(grade = grade + rnorm(n()))))
## 
## Grand Mean: 2.693187
## 
## Stratum 1: id
## 
## Terms:
##                  reading
## Sum of Squares  26.45933
## Deg. of Freedom        1
## 
## Estimated effects are balanced
## 
## Stratum 2: Within
## 
## Terms:
##                  reading Residuals
## Sum of Squares   14.9978  628.3622
## Deg. of Freedom        1       287
## 
## Residual standard error: 1.479667
## Estimated effects are balanced

Go to