Working with text

Introduction

This tutorial introduces how to treat text as data in R using the tidytext package. It introduces methods of importing text, tokenizing text, looking for (partial) matches with simple regular expressions, and analyzing word frequencies.

We start by installing and loading the tidytext package, and loading the tidyverse package.

# install.packages("tidytext")
library(tidytext)
library(tidyverse)

Tidy text format

Normally, we think of data as being a table of values, mostly numbers or categories with a limited number of levels. In tidy data, each row corresponds to an observation (e.g. a person completing a survey, or a country about which we know particular characteristics). However, text can also be a kind of data that we can work with in a systematic manner.

When thinking of the “tidy text format”, we need to think of each row as a unit of speech, and each column as a variable that describes the properties of that particular unit of speech. For example a unit of speech can be a word, and a variable could be the word itself, its stem, an indicator of what part of speech the word is (noun, verb, etc.), and so on. For basic text analysis, all we need is the actual text, and therefore we can convert any existing text to a tidy text format in R without needing additional information.

For example, we can start by defining our text as a single object called raw_text that contains the full text we want to analyze (notice the quotation marks around the text to show that it is a character vector).

raw_text <- "Text analysis in R provides valuable insights by uncovering patterns, trends, and relationships within textual data. It can reveal sentiment, topic distributions, keyword frequencies, and even hidden structures using techniques like natural language processing (NLP) and machine learning. Users can analyze customer reviews, social media posts, or academic papers to identify recurring themes, sentiment shifts, and linguistic trends, ultimately aiding decision-making, market research, and content optimization."

If we want to convert this object to tidy text, we first need to choose what unit we want to work with in our analysis. If our goal is to see which words appear most frequently in the text, we should conduct on our analysis on the word level. In that case, each word in the text is considered as one token, and the text is made up of many tokens arranged in a meaningful order.

In order to convert a vector like raw_text to a tidy format, we first need to convert the vector to a tibble with the as_tibble() function. Then we can use the unnest_tokens() function from tidytext to split the text into tokens so that each token represents one row in our tidy data. When using unnest_tokens(), we need to specify the variable name of our original text (in this case called value), the name of the new token-level variable we want to create (let’s call it word), and the token type we want to use (in this case “words” but could be e.g. “sentences”, “lines”, “ngrams” (combinations of n words next to each other), and so on). You can look at all the token types in the help file of unnest_tokens() by typing ?unnest_tokens().

Notice that the code chunk below uses the pipe operator. You can read more about how the pipe is used here.

clean_text <- raw_text |> 
  as_tibble() |> 
  unnest_tokens(output = word, input = value, token = "words")

In the clean_text tibble we now have one variable called word, which in each row contains one word from the original text, in the order they initially appeared. By default, unnest_tokens() also cleans the text by converting all letters to lowercase and removing punctuation.

Counting word frequencies

The simplest method of getting a quick overview of a long text is to count the number of times each word appears in the text, and looking at what the most frequent words are. We can get these word frequencies using the count() function, specifying which variable we want to count.

clean_text |> 
  count(word)
## # A tibble: 60 × 2
##    word         n
##    <chr>    <int>
##  1 academic     1
##  2 aiding       1
##  3 analysis     1
##  4 analyze      1
##  5 and          5
##  6 by           1
##  7 can          2
##  8 content      1
##  9 customer     1
## 10 data         1
## # ℹ 50 more rows

The count() function has an argument sort, which allows us to sort the output from most frequent to least frequent words.

clean_text |> 
  count(word, sort = TRUE)
## # A tibble: 60 × 2
##    word          n
##    <chr>     <int>
##  1 and           5
##  2 can           2
##  3 sentiment     2
##  4 trends        2
##  5 academic      1
##  6 aiding        1
##  7 analysis      1
##  8 analyze       1
##  9 by            1
## 10 content       1
## # ℹ 50 more rows

Looking for exact and partial word matches

In many cases we are interested only in analyzing parts of a text that contain our topic of interest. For example, we may want to find which parts of a text talk about “data” and in what context. In that case, we can use the filter() function to keep only observations that meet a particular criteria. You can see some more explanation and general examples of the filter() function in this tutorial.

When we are working with words, exact matches are often enough for our purposes. For example, we can look at how many rows in our clean_text tibble have “data” as the value of the word variable.

clean_text |> 
  filter(word == "data")
## # A tibble: 1 × 1
##   word 
##   <chr>
## 1 data

However, this filter tells us nothing about the context in which “data” appears. For that, it would be better to split the text into sentences, and find which sentence contains the word “data”. But if we tokenize the text into sentences, an exact match won’t find the sentence we’re looking for.

# split the text into sentences
sentences <- raw_text |> 
  as_tibble() |> 
  unnest_tokens(output = sentence, input = value, token = "sentences")

# keep only rows with an exact match to "data" (no such rows)
sentences |> 
  filter(sentence == "data")
## # A tibble: 0 × 1
## # ℹ 1 variable: sentence <chr>

If we want to find a partial string match (i.e. a sentence that among other content contains the word “data”), we need to use a special function to detect partial matches. This function is called str_detect() and takes the arguments of the variable that contains the elements you want to evaluate and the pattern you’re looking for. In our case, this variable is sentence and the pattern is data. str_detect() returns a logical vector, i.e. for each element of your variable it tells you whether it matches the pattern (TRUE) or not (FALSE).

# example of str_detect()
str_detect(string = c("A", "AB", "BB"), pattern = "A")
## [1]  TRUE  TRUE FALSE
# look for partial match to "data" (one sentence)
sentences |> 
  filter(str_detect(string = sentence, pattern = "data"))
## # A tibble: 1 × 1
##   sentence                                                                      
##   <chr>                                                                         
## 1 text analysis in r provides valuable insights by uncovering patterns, trends,…