Introduction

This tutorial shows you how you can use the mutate(), group_by() and summarize() functions to compute more complex measures of linguistic development than what was shown in the Data Center workshop.

If you have not done so yet, load the three packages we used in the workshop, and load Amy’s transcript data in the token-based format.

library(tidyverse)
library(tidytext)
library(childesr)

tok <- get_tokens(token = "*", collection = "Eng-NA", target_child = "Amy",
                  corpus = "VanKleeck", role = "target_child")

Computing new variables

mutate()

The mutate() function is useful when you want to create new variables or overwrite existing ones. For instance, you can calculate the length of each word Amy used by using the str_length() function, and assign the result to a new variable called length. I use the select() function to display only a subset of the variables, and the head() function to display only the first 10 observations.

tok_length <- tok %>% 
  mutate(length = str_length(gloss)) 

tok_length %>% 
  select(id, gloss, stem, length) %>% 
  head(10)
## # A tibble: 10 × 4
##         id gloss  stem    length
##      <int> <chr>  <chr>    <int>
##  1 3499800 white  "white"      5
##  2 3499804 xxx    ""           3
##  3 3499886 hi     "hi"         2
##  4 3499906 xxx    ""           3
##  5 3499911 okay   "okay"       4
##  6 3499912 bye    "bye"        3
##  7 3499938 the    "the"        3
##  8 3499939 farmer "farm"       6
##  9 3499961 okay   "okay"       4
## 10 3499976 yeah   "yeah"       4

You can also use mutate to create indicator variables: the next example uses a logical expression that is true if the word is a noun.

tok %>% 
  mutate(noun = part_of_speech == "n") %>% 
  select(id, gloss, stem, noun) %>% 
  head(10)
## # A tibble: 10 × 4
##         id gloss  stem    noun 
##      <int> <chr>  <chr>   <lgl>
##  1 3499800 white  "white" FALSE
##  2 3499804 xxx    ""      FALSE
##  3 3499886 hi     "hi"    FALSE
##  4 3499906 xxx    ""      FALSE
##  5 3499911 okay   "okay"  FALSE
##  6 3499912 bye    "bye"   FALSE
##  7 3499938 the    "the"   FALSE
##  8 3499939 farmer "farm"  TRUE 
##  9 3499961 okay   "okay"  FALSE
## 10 3499976 yeah   "yeah"  FALSE

group_by() and summarize()

While mutate() creates a new variable for each row of the data, summarize() collapses the data to a smaller dimension by applying a function (e.g. sum() or mean()) to the data.

For instance, if you use mutate() to get the length of each word, you can use summarize() to find the mean word length of all observations:

tok_length %>% 
  summarize(length = mean(length))
## # A tibble: 1 × 1
##   length
##    <dbl>
## 1   3.76

Amy’s average word in this data is almost 4 letters long.

You can also do more complex operations rather than averages. The following example uses the indicator variable for nouns to calculate the total fraction of nouns in Amy’s speech: we can use the sum() function to get the total number of nouns (the TRUE/FALSE indicator is treated as 1/0), and divide that by the total number of observations n().

tok %>% 
  mutate(noun = part_of_speech == "n") %>% 
  summarize(noun_prop = sum(noun)/n())
## # A tibble: 1 × 1
##   noun_prop
##       <dbl>
## 1     0.123

So approximately 12% of Amy’s words were nouns.

Sometimes you don’t want to aggregate the full dataset, but get summaries per group, e.g. the average word length per transcript. You can accomplish that by specifying one or more grouping variables with the group_by() function before calling summarize().

tok_length %>% 
  group_by(transcript_id) %>% 
  summarize(length = mean(length))
## # A tibble: 2 × 2
##   transcript_id length
##           <int>  <dbl>
## 1          4258   3.66
## 2          4259   3.81

More materials

You can find a list of helpful links for data manipulation and visualization here.