Introduction
This document provides an overview of the basic statistical functions in R, including descriptive statistics and summary tables.
This tutorial shows you examples of using statistical methods on the
diamonds
dataset, which comes pre-loaded with the tidyverse
package.
Let’s load the tidyverse
package and have a look at the diamonds
dataset:
library(tidyverse)
data(diamonds)
Descriptive statistics
Summaries in the R Console
To get a descriptive statistic of a single variable in a tibble, we can
use that variable as an argument to a relevant function (using $
to
refer to a variable in a tibble).
mean(diamonds$price)
## [1] 3932.8
median(diamonds$price)
## [1] 2401
sd(diamonds$price)
## [1] 3989.44
To get the frequencies of a categorical variable, we can use the
count()
function, with the sort = TRUE
argument returning the values
in descending frequency. count()
is a tidy function that works well
with pipe workflows and can count the joint frequencies of multiple
variables.
# frequencies of a single variable
count(diamonds, cut)
## # A tibble: 5 × 2
## cut n
## <ord> <int>
## 1 Fair 1610
## 2 Good 4906
## 3 Very Good 12082
## 4 Premium 13791
## 5 Ideal 21551
# joint frequency distribution
count(diamonds, cut, color)
## # A tibble: 35 × 3
## cut color n
## <ord> <ord> <int>
## 1 Fair D 163
## 2 Fair E 224
## 3 Fair F 312
## 4 Fair G 314
## 5 Fair H 303
## 6 Fair I 175
## 7 Fair J 119
## 8 Good D 662
## 9 Good E 933
## 10 Good F 909
## # ℹ 25 more rows
To get the correlation coefficient between two variables, we can use the
cor()
function in the same way we used other descriptives such as
mean()
.
cor(diamonds$price, diamonds$carat)
## [1] 0.9215913
The easiest way to get summary statistics of all variables in a tibble
is with the summary()
function: this function shows the distribution
of numeric variables, the frequencies of categorical variables, and the
number of missing values for each variable.
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
Publication-ready summaries with gtsummary
The summary()
function is useful for viewing the data in the Console,
but doesn’t export to outside of R nicely. There are a few packages
available for generating simple summary statistics tables that contain
information about the central tendencies and dispersion of the data that
also export nicely, such as gtsummary
.
library(gtsummary)
# tbl_summary() creates a summary table of the data
tbl_summary(diamonds)
Characteristic | N = 53,9401 |
---|---|
carat | 0.70 (0.40, 1.04) |
cut | |
Fair | 1,610 (3.0%) |
Good | 4,906 (9.1%) |
Very Good | 12,082 (22%) |
Premium | 13,791 (26%) |
Ideal | 21,551 (40%) |
color | |
D | 6,775 (13%) |
E | 9,797 (18%) |
F | 9,542 (18%) |
G | 11,292 (21%) |
H | 8,304 (15%) |
I | 5,422 (10%) |
J | 2,808 (5.2%) |
clarity | |
I1 | 741 (1.4%) |
SI2 | 9,194 (17%) |
SI1 | 13,065 (24%) |
VS2 | 12,258 (23%) |
VS1 | 8,171 (15%) |
VVS2 | 5,066 (9.4%) |
VVS1 | 3,655 (6.8%) |
IF | 1,790 (3.3%) |
depth | 61.80 (61.00, 62.50) |
table | 57.00 (56.00, 59.00) |
price | 2,401 (950, 5,325) |
x | 5.70 (4.71, 6.54) |
y | 5.71 (4.72, 6.54) |
z | 3.53 (2.91, 4.04) |
1 Median (Q1, Q3); n (%) |
You can stratify your summary by a grouping variable using the by
argument:
tbl_summary(diamonds, by = cut)
Characteristic | Fair N = 1,6101 |
Good N = 4,9061 |
Very Good N = 12,0821 |
Premium N = 13,7911 |
Ideal N = 21,5511 |
---|---|---|---|---|---|
carat | 1.00 (0.70, 1.20) | 0.82 (0.50, 1.01) | 0.71 (0.41, 1.02) | 0.86 (0.41, 1.20) | 0.54 (0.35, 1.01) |
color | |||||
D | 163 (10%) | 662 (13%) | 1,513 (13%) | 1,603 (12%) | 2,834 (13%) |
E | 224 (14%) | 933 (19%) | 2,400 (20%) | 2,337 (17%) | 3,903 (18%) |
F | 312 (19%) | 909 (19%) | 2,164 (18%) | 2,331 (17%) | 3,826 (18%) |
G | 314 (20%) | 871 (18%) | 2,299 (19%) | 2,924 (21%) | 4,884 (23%) |
H | 303 (19%) | 702 (14%) | 1,824 (15%) | 2,360 (17%) | 3,115 (14%) |
I | 175 (11%) | 522 (11%) | 1,204 (10.0%) | 1,428 (10%) | 2,093 (9.7%) |
J | 119 (7.4%) | 307 (6.3%) | 678 (5.6%) | 808 (5.9%) | 896 (4.2%) |
clarity | |||||
I1 | 210 (13%) | 96 (2.0%) | 84 (0.7%) | 205 (1.5%) | 146 (0.7%) |
SI2 | 466 (29%) | 1,081 (22%) | 2,100 (17%) | 2,949 (21%) | 2,598 (12%) |
SI1 | 408 (25%) | 1,560 (32%) | 3,240 (27%) | 3,575 (26%) | 4,282 (20%) |
VS2 | 261 (16%) | 978 (20%) | 2,591 (21%) | 3,357 (24%) | 5,071 (24%) |
VS1 | 170 (11%) | 648 (13%) | 1,775 (15%) | 1,989 (14%) | 3,589 (17%) |
VVS2 | 69 (4.3%) | 286 (5.8%) | 1,235 (10%) | 870 (6.3%) | 2,606 (12%) |
VVS1 | 17 (1.1%) | 186 (3.8%) | 789 (6.5%) | 616 (4.5%) | 2,047 (9.5%) |
IF | 9 (0.6%) | 71 (1.4%) | 268 (2.2%) | 230 (1.7%) | 1,212 (5.6%) |
depth | 65.00 (64.40, 65.90) | 63.40 (61.30, 63.80) | 62.10 (60.90, 62.90) | 61.40 (60.50, 62.20) | 61.80 (61.30, 62.20) |
table | 58.00 (56.00, 61.00) | 58.00 (56.00, 61.00) | 58.00 (56.00, 59.00) | 59.00 (58.00, 60.00) | 56.00 (55.00, 57.00) |
price | 3,282 (2,050, 5,208) | 3,051 (1,145, 5,028) | 2,648 (912, 5,373) | 3,185 (1,046, 6,296) | 1,810 (878, 4,679) |
x | 6.18 (5.63, 6.70) | 5.98 (5.02, 6.42) | 5.74 (4.75, 6.47) | 6.11 (4.80, 6.80) | 5.25 (4.54, 6.44) |
y | 6.10 (5.57, 6.64) | 5.99 (5.02, 6.44) | 5.77 (4.77, 6.51) | 6.06 (4.79, 6.76) | 5.26 (4.55, 6.45) |
z | 3.97 (3.61, 4.28) | 3.70 (3.07, 4.03) | 3.56 (2.95, 4.02) | 3.72 (2.94, 4.16) | 3.23 (2.80, 3.98) |
1 Median (Q1, Q3); n (%) |
# add p-value of difference between groups
tbl_summary(diamonds, by = cut) |>
add_p()
Characteristic | Fair N = 1,6101 |
Good N = 4,9061 |
Very Good N = 12,0821 |
Premium N = 13,7911 |
Ideal N = 21,5511 |
p-value2 |
---|---|---|---|---|---|---|
carat | 1.00 (0.70, 1.20) | 0.82 (0.50, 1.01) | 0.71 (0.41, 1.02) | 0.86 (0.41, 1.20) | 0.54 (0.35, 1.01) | <0.001 |
color | <0.001 | |||||
D | 163 (10%) | 662 (13%) | 1,513 (13%) | 1,603 (12%) | 2,834 (13%) | |
E | 224 (14%) | 933 (19%) | 2,400 (20%) | 2,337 (17%) | 3,903 (18%) | |
F | 312 (19%) | 909 (19%) | 2,164 (18%) | 2,331 (17%) | 3,826 (18%) | |
G | 314 (20%) | 871 (18%) | 2,299 (19%) | 2,924 (21%) | 4,884 (23%) | |
H | 303 (19%) | 702 (14%) | 1,824 (15%) | 2,360 (17%) | 3,115 (14%) | |
I | 175 (11%) | 522 (11%) | 1,204 (10.0%) | 1,428 (10%) | 2,093 (9.7%) | |
J | 119 (7.4%) | 307 (6.3%) | 678 (5.6%) | 808 (5.9%) | 896 (4.2%) | |
clarity | <0.001 | |||||
I1 | 210 (13%) | 96 (2.0%) | 84 (0.7%) | 205 (1.5%) | 146 (0.7%) | |
SI2 | 466 (29%) | 1,081 (22%) | 2,100 (17%) | 2,949 (21%) | 2,598 (12%) | |
SI1 | 408 (25%) | 1,560 (32%) | 3,240 (27%) | 3,575 (26%) | 4,282 (20%) | |
VS2 | 261 (16%) | 978 (20%) | 2,591 (21%) | 3,357 (24%) | 5,071 (24%) | |
VS1 | 170 (11%) | 648 (13%) | 1,775 (15%) | 1,989 (14%) | 3,589 (17%) | |
VVS2 | 69 (4.3%) | 286 (5.8%) | 1,235 (10%) | 870 (6.3%) | 2,606 (12%) | |
VVS1 | 17 (1.1%) | 186 (3.8%) | 789 (6.5%) | 616 (4.5%) | 2,047 (9.5%) | |
IF | 9 (0.6%) | 71 (1.4%) | 268 (2.2%) | 230 (1.7%) | 1,212 (5.6%) | |
depth | 65.00 (64.40, 65.90) | 63.40 (61.30, 63.80) | 62.10 (60.90, 62.90) | 61.40 (60.50, 62.20) | 61.80 (61.30, 62.20) | <0.001 |
table | 58.00 (56.00, 61.00) | 58.00 (56.00, 61.00) | 58.00 (56.00, 59.00) | 59.00 (58.00, 60.00) | 56.00 (55.00, 57.00) | <0.001 |
price | 3,282 (2,050, 5,208) | 3,051 (1,145, 5,028) | 2,648 (912, 5,373) | 3,185 (1,046, 6,296) | 1,810 (878, 4,679) | <0.001 |
x | 6.18 (5.63, 6.70) | 5.98 (5.02, 6.42) | 5.74 (4.75, 6.47) | 6.11 (4.80, 6.80) | 5.25 (4.54, 6.44) | <0.001 |
y | 6.10 (5.57, 6.64) | 5.99 (5.02, 6.44) | 5.77 (4.77, 6.51) | 6.06 (4.79, 6.76) | 5.26 (4.55, 6.45) | <0.001 |
z | 3.97 (3.61, 4.28) | 3.70 (3.07, 4.03) | 3.56 (2.95, 4.02) | 3.72 (2.94, 4.16) | 3.23 (2.80, 3.98) | <0.001 |
1 Median (Q1, Q3); n (%) | ||||||
2 Kruskal-Wallis rank sum test; Pearson’s Chi-squared test |
You can also customize the appearance of the table.
tbl_summary(diamonds, by = cut) |>
modify_header(label ~ "Variable") |>
modify_caption("Summary Table by Cut")
Variable | Fair N = 1,6101 |
Good N = 4,9061 |
Very Good N = 12,0821 |
Premium N = 13,7911 |
Ideal N = 21,5511 |
---|---|---|---|---|---|
carat | 1.00 (0.70, 1.20) | 0.82 (0.50, 1.01) | 0.71 (0.41, 1.02) | 0.86 (0.41, 1.20) | 0.54 (0.35, 1.01) |
color | |||||
D | 163 (10%) | 662 (13%) | 1,513 (13%) | 1,603 (12%) | 2,834 (13%) |
E | 224 (14%) | 933 (19%) | 2,400 (20%) | 2,337 (17%) | 3,903 (18%) |
F | 312 (19%) | 909 (19%) | 2,164 (18%) | 2,331 (17%) | 3,826 (18%) |
G | 314 (20%) | 871 (18%) | 2,299 (19%) | 2,924 (21%) | 4,884 (23%) |
H | 303 (19%) | 702 (14%) | 1,824 (15%) | 2,360 (17%) | 3,115 (14%) |
I | 175 (11%) | 522 (11%) | 1,204 (10.0%) | 1,428 (10%) | 2,093 (9.7%) |
J | 119 (7.4%) | 307 (6.3%) | 678 (5.6%) | 808 (5.9%) | 896 (4.2%) |
clarity | |||||
I1 | 210 (13%) | 96 (2.0%) | 84 (0.7%) | 205 (1.5%) | 146 (0.7%) |
SI2 | 466 (29%) | 1,081 (22%) | 2,100 (17%) | 2,949 (21%) | 2,598 (12%) |
SI1 | 408 (25%) | 1,560 (32%) | 3,240 (27%) | 3,575 (26%) | 4,282 (20%) |
VS2 | 261 (16%) | 978 (20%) | 2,591 (21%) | 3,357 (24%) | 5,071 (24%) |
VS1 | 170 (11%) | 648 (13%) | 1,775 (15%) | 1,989 (14%) | 3,589 (17%) |
VVS2 | 69 (4.3%) | 286 (5.8%) | 1,235 (10%) | 870 (6.3%) | 2,606 (12%) |
VVS1 | 17 (1.1%) | 186 (3.8%) | 789 (6.5%) | 616 (4.5%) | 2,047 (9.5%) |
IF | 9 (0.6%) | 71 (1.4%) | 268 (2.2%) | 230 (1.7%) | 1,212 (5.6%) |
depth | 65.00 (64.40, 65.90) | 63.40 (61.30, 63.80) | 62.10 (60.90, 62.90) | 61.40 (60.50, 62.20) | 61.80 (61.30, 62.20) |
table | 58.00 (56.00, 61.00) | 58.00 (56.00, 61.00) | 58.00 (56.00, 59.00) | 59.00 (58.00, 60.00) | 56.00 (55.00, 57.00) |
price | 3,282 (2,050, 5,208) | 3,051 (1,145, 5,028) | 2,648 (912, 5,373) | 3,185 (1,046, 6,296) | 1,810 (878, 4,679) |
x | 6.18 (5.63, 6.70) | 5.98 (5.02, 6.42) | 5.74 (4.75, 6.47) | 6.11 (4.80, 6.80) | 5.25 (4.54, 6.44) |
y | 6.10 (5.57, 6.64) | 5.99 (5.02, 6.44) | 5.77 (4.77, 6.51) | 6.06 (4.79, 6.76) | 5.26 (4.55, 6.45) |
z | 3.97 (3.61, 4.28) | 3.70 (3.07, 4.03) | 3.56 (2.95, 4.02) | 3.72 (2.94, 4.16) | 3.23 (2.80, 3.98) |
1 Median (Q1, Q3); n (%) |
To export the table as a Word document, use the gtsave()
function.
Note that we first use the as_gt()
function to convert the
tbl_summary()
output to a gt
object, and load the gt
package in
order to use the Word export function defined for the gt
package.
library(gt)
tbl_summary(diamonds, by = cut) |>
as_gt() |>
gtsave("summary_table.docx")
If you use LaTeX, you can also export as a LaTeX table, also relying on
the gt
package.
tbl_summary(diamonds, by = cut) |>
as_gt() |>
gtsave("summary_table.tex")
Hypothesis testing
Most of the simple statistical tests are from base R, so they don’t rely
on tidy principles, but many are compatible with tidy workflows to at
least some extent. In the following we’ll cover some of the key methods
that show up in methods and statistics courses at UCR. In addition, the
tidy()
function from the broom
package converts most text output
into simple tibbles, which are useful for exporting and visualizing
results.
library(broom)
t-tests
We can run one and two samples t-tests to evaluate group means with the
t.test()
function. The function supports various options and model
specifications: a simple one-sample t-test only requires specifying the
variable of interest, either with x = data$variable
or
x = variable, data = data
syntax. For two-sample t-tests, we can use
the formula syntax y ~ x
to specify the dependent and independent
variables or the x
and y
(and optionally data
) arguments.
Additional options include specifying the alternative hypothesis, the
confidence level, the value of $\mu$, and whether we want a paired
t-test and assume equal variances. Helper functions such as tidy()
convert the Console output to an easy-to-export tibble of results.
To demonstrate two-sample t-tests, we define a subset of the data that
contains only two possible values of cut
.
# simple t-test (H0: mean=mu)
t.test(diamonds$carat, mu = 1)
##
## One Sample t-test
##
## data: diamonds$carat
## t = -99.003, df = 53939, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 1
## 95 percent confidence interval:
## 0.7939395 0.8019400
## sample estimates:
## mean of x
## 0.7979397
# define data subsample of fair and good diamonds to have only two groups of cut
diamonds_sub <- diamonds |>
filter(cut %in% c("Fair", "Good"))
# can also use data argument instead of data$...
# price ~ cut is formula specification: variable ~ group
# H0: fair and good diamonds have the same average price
t.test(price ~ cut, alternative = "greater", data = diamonds_sub)
##
## Welch Two Sample t-test
##
## data: price by cut
## t = 4.1684, df = 2822.3, p-value = 1.58e-05
## alternative hypothesis: true difference in means between group Fair and group Good is greater than 0
## 95 percent confidence interval:
## 260.2001 Inf
## sample estimates:
## mean in group Fair mean in group Good
## 4358.758 3928.864
# tidy() function turns results into a tibble
t.test(price ~ cut, alternative = "greater", data = diamonds_sub) |> tidy()
## # A tibble: 1 × 10
## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 430. 4359. 3929. 4.17 0.0000158 2822. 260. Inf
## # ℹ 2 more variables: method <chr>, alternative <chr>
Correlation test
The cor.test()
function calculates the correlation between two
variables. Again, the function supports various specifications: x
and
y
arguments, formula syntax (see below for an example), and the data
argument.
cor.test( ~ price + carat, data = diamonds)
##
## Pearson's product-moment correlation
##
## data: price and carat
## t = 551.41, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9203098 0.9228530
## sample estimates:
## cor
## 0.9215913
# tidy() function turns results into a tibble
cor.test( ~ price + carat, data = diamonds) |> tidy()
## # A tibble: 1 × 8
## estimate statistic p.value parameter conf.low conf.high method alternative
## <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr> <chr>
## 1 0.922 551. 0 53938 0.920 0.923 Pearson's… two.sided
Simple regression
A key building block of statistical analysis is linear regression. The
lm()
function fits a linear model to the data, with a wide range of
specifications, passed as the formula argument (first argument if
unnamed). The formula syntax is y ~ x
, where y
is the dependent
variable and x
is the independent variable. Again, optional function
arguments allow for a lot of customization, but the default settings are
sufficient for most applications. Helper functions such as tidy()
and
summary()
provide extensive summaries of the model fit and
coefficients, and tbl_regression()
from the gtsummary
package
creates neat tables of the results. Assigning the result of a model to
an object saves computational time, as then we can work with the results
without having to re-run the analysis every time.
# assign outcome to object
fit <- lm(price ~ carat, data = diamonds)
# extensive result summary
fit |> summary()
##
## Call:
## lm(formula = price ~ carat, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18585.3 -804.8 -18.9 537.4 12731.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2256.36 13.06 -172.8 <2e-16 ***
## carat 7756.43 14.07 551.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared: 0.8493, Adjusted R-squared: 0.8493
## F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16
# tidy coefficients
fit |> tidy()
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2256. 13.1 -173. 0
## 2 carat 7756. 14.1 551. 0
# display-ready table with gtsummary
tbl_regression(fit)
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
carat | 7,756 | 7,729, 7,784 | <0.001 |
1 CI = Confidence Interval |
Multiple regression
Multiple regression extends simple regression to multiple independent
variables. The only difference is the formula specification, which now
connects multiple independent variables with +
signs. The formula
specification also allows for interactions between variables, which can
be specified with *
(if the main effects should be included) or :
(for only the interaction term). The DV ~ .~
syntax includes all
variables in the data except the dependent variable as independent
variables.
lm(price ~ x + y + z + table + depth, data = diamonds) |> summary()
##
## Call:
## lm(formula = price ~ x + y + z + table + depth, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9994 -1256 -197 945 32470
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8770.608 509.448 -17.216 < 2e-16 ***
## x 2918.492 43.346 67.330 < 2e-16 ***
## y 205.086 31.555 6.499 8.13e-11 ***
## z 91.814 54.802 1.675 0.0939 .
## table -84.855 3.813 -22.255 < 2e-16 ***
## depth -10.501 6.661 -1.576 0.1149
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1852 on 53934 degrees of freedom
## Multiple R-squared: 0.7846, Adjusted R-squared: 0.7846
## F-statistic: 3.929e+04 on 5 and 53934 DF, p-value: < 2.2e-16
# all variables in data
lm(price ~ ., data = diamonds) |> summary()
##
## Call:
## lm(formula = price ~ ., data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21376.0 -592.4 -183.5 376.4 10694.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5753.762 396.630 14.507 < 2e-16 ***
## carat 11256.978 48.628 231.494 < 2e-16 ***
## cut.L 584.457 22.478 26.001 < 2e-16 ***
## cut.Q -301.908 17.994 -16.778 < 2e-16 ***
## cut.C 148.035 15.483 9.561 < 2e-16 ***
## cut^4 -20.794 12.377 -1.680 0.09294 .
## color.L -1952.160 17.342 -112.570 < 2e-16 ***
## color.Q -672.054 15.777 -42.597 < 2e-16 ***
## color.C -165.283 14.725 -11.225 < 2e-16 ***
## color^4 38.195 13.527 2.824 0.00475 **
## color^5 -95.793 12.776 -7.498 6.59e-14 ***
## color^6 -48.466 11.614 -4.173 3.01e-05 ***
## clarity.L 4097.431 30.259 135.414 < 2e-16 ***
## clarity.Q -1925.004 28.227 -68.197 < 2e-16 ***
## clarity.C 982.205 24.152 40.668 < 2e-16 ***
## clarity^4 -364.918 19.285 -18.922 < 2e-16 ***
## clarity^5 233.563 15.752 14.828 < 2e-16 ***
## clarity^6 6.883 13.715 0.502 0.61575
## clarity^7 90.640 12.103 7.489 7.06e-14 ***
## depth -63.806 4.535 -14.071 < 2e-16 ***
## table -26.474 2.912 -9.092 < 2e-16 ***
## x -1008.261 32.898 -30.648 < 2e-16 ***
## y 9.609 19.333 0.497 0.61918
## z -50.119 33.486 -1.497 0.13448
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1130 on 53916 degrees of freedom
## Multiple R-squared: 0.9198, Adjusted R-squared: 0.9198
## F-statistic: 2.688e+04 on 23 and 53916 DF, p-value: < 2.2e-16
# interactions
lm(price ~ x * y, data = diamonds) |> summary()
##
## Call:
## lm(formula = price ~ x * y, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41574 -575 -16 286 38459
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11572.154 161.256 71.76 <2e-16 ***
## x -507.815 33.540 -15.14 <2e-16 ***
## y -5300.961 42.598 -124.44 <2e-16 ***
## x:y 752.458 4.618 162.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1523 on 53936 degrees of freedom
## Multiple R-squared: 0.8542, Adjusted R-squared: 0.8542
## F-statistic: 1.053e+05 on 3 and 53936 DF, p-value: < 2.2e-16
ANOVA
Analysis of variance (ANOVA) is a generalization of the t-test to
multiple groups. The aov()
function fits an ANOVA model to the data,
with the formula syntax y ~ x
, where y
is the dependent variable and
x
is the independent variable. The same helper functions as with
lm()
can be used to summarize the results.
Note that ANOVA is a specific case of a linear regression model, so the results are equivalent to those of a linear regression model with a categorical independent variable.
anova_fit <- aov(price ~ cut, data = diamonds)
summary(anova_fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## cut 4 1.104e+10 2.760e+09 175.7 <2e-16 ***
## Residuals 53935 8.474e+11 1.571e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tidy(anova_fit)
## # A tibble: 2 × 6
## term df sumsq meansq statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 cut 4 11041745359. 2760436340. 176. 8.43e-150
## 2 Residuals 53935 847431390159. 15712087. NA NA
# equivalent regression
lm(price ~ cut, data = diamonds) |> summary()
##
## Call:
## lm(formula = price ~ cut, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4258 -2741 -1494 1360 15348
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4062.24 25.40 159.923 < 2e-16 ***
## cut.L -362.73 68.04 -5.331 9.8e-08 ***
## cut.Q -225.58 60.65 -3.719 2e-04 ***
## cut.C -699.50 52.78 -13.253 < 2e-16 ***
## cut^4 -280.36 42.56 -6.588 4.5e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3964 on 53935 degrees of freedom
## Multiple R-squared: 0.01286, Adjusted R-squared: 0.01279
## F-statistic: 175.7 on 4 and 53935 DF, p-value: < 2.2e-16