Introduction
This tutorial is based on multiple chapters of “Text Mining with R: A Tidy Approach” by Julia Silge and David Robinson.
Throughout this tutorial, we will use the tidytext
package to analyze
text data, in particular the contents of Alice in Wonderland and
Winnie-the-Pooh. This book and many others are available via the
gutenberg_download()
function in the gutenbergr
package, which
provides access to the Project Gutenberg collection of public domain
books.
# install.packages("tidytext")
# install.packages("gutenbergr")
library(tidyverse)
library(tidytext)
library(gutenbergr)
# Download books based on their Gutenberg ID
# https://gutenberg.org/ebooks/19033
# https://gutenberg.org/ebooks/67098
books <- gutenberg_download(c(19033, 67098))
Representing text as data
Tidy text format
Currently in the books
tibble, each row represents a line of text from
one of the books. It is often useful to represent text data in a tidy
format, where each row represents a word or token, as then we can apply
data wrangling operations on the word level. We can use the
unnest_tokens()
function from the tidytext
package to easily split
the text into words, or into other levels of analysis such as
characters, sentences or paragraphs. This function also takes care of
removing punctuation, converting words to lowercase, and dropping empty
rows. Before tokenizing, we may want to remove the contents of the
titlepage, as the actual book contents only start on line 38 for Alice
in Wonderland and line 79 for Winnie-the-Pooh.
# remove title page and add book title variable
books_content <- books |>
# add book title
mutate(book_title = ifelse(gutenberg_id == 19033, "Alice", "Winnie")) |>
# restart counting row numbers for each book
group_by(book_title) |>
filter((book_title == "Alice" & row_number() >= 38) |
(book_title == "Winnie" & row_number() >= 79)) |>
ungroup()
words <- books_content |>
# split books into words
unnest_tokens(output = word, input = text)
You may notice that not all words are completely clean or relevant: some are surrounded by underscores and some are numbers. We can clean these up manually with regular expressions.
words <- words |>
mutate(word = str_remove_all(word, "_")) |>
filter(!str_detect(word, "^\\d+$"))
Term frequency
The term frequency (tf) of a word is the number of times it appears in a document, divided by the total number of words in the document. It is simply the result of counting the number of times a word appears in the document.
tf <- words |>
# count the number of times each word appears in each book
count(book_title, word) |>
# divide by number of words in each book
group_by(book_title) |>
mutate(tf = n / sum(n)) |>
ungroup()
Term frequency - inverse document frequency (tf-idf)
The inverse document frequency (idf) of a word is the logarithm of the
total number of documents divided by the number of documents that
contain the word. It is a measure of how unique or rare a word is across
the entire corpus. The intuition for why idf matters is that words that
appear in many documents are less informative than words that appear in
only a few documents. Therefore we often combine tf and idf into a
single metric called term frequency-inverse document frequency (tf-idf),
which is the product of tf and idf. The bind_tf_idf()
function from
the tidytext
package can be used to calculate tf-idf values (the
function also generates tf and idf separately).
tf_idf <- words |>
count(book_title, word) |>
bind_tf_idf(word, book_title, n)
Document-term matrix
Documents (in out case, books) can be represented as a document-term
matrix, where each row represents a document and each column represents
a word. The value in each cell is equal to the frequency of the word in
the document. This matrix is sometimes called the bag-of-words
representation of the text data (although sometimes that contains 0/1
values based on whether the word appears in the document), because it
ignores the ordering of the words in the text. These matrices can be
created using the cast_dtm()
function from the tidytext
package.
Note that these matrices can get very large depending on the size of the
vocabulary and the number of documents. Our data has less than 3000
unique words and 2 documents, so it is manageable.
# DTM with pivot_wider() (generic tibble)
dtm <- words |>
count(book_title, word) |>
pivot_wider(names_from = word, values_from = n, values_fill = 0)
# DTM with cast_dtm() (DocumentTermMatrix object)
dtm <- words |>
count(book_title, word) |>
cast_dtm(document = book_title, term = word, value = n)
Text analysis
Visualize word frequencies
The easiest way to represent the contents of a document is to show the
most frequent words. We can use a bar chart to show the words with
highest term frequency in each book, or use a word cloud where the size
of the word is proportional to its frequency (using the ggwordcloud
package).
library(ggwordcloud)
# Most frequent words
tf_idf |>
group_by(book_title) |>
slice_max(tf, n = 20) |>
ggplot(aes(tf, reorder_within(word, tf, book_title))) +
geom_col() +
facet_wrap(~book_title, scales = "free") +
scale_y_reordered() +
labs(x = "Term frequency", y = NULL, title = "Most frequent words") +
theme_minimal()
# Word cloud
tf |>
group_by(book_title) |>
top_n(50, tf) |>
ggplot(aes(label = word, size = tf)) +
geom_text_wordcloud() +
facet_wrap(~book_title) +
theme_minimal()
Using simple word frequencies is often uninformative because common words like “the” or “and” will dominate the results. One way to address this problem is to display the words with highest tf-idf values: in the case of two books, this will show the words that are unique to each book, as words that show up in all document have idf=tf-idf=0.
tf_idf |>
group_by(book_title) |>
slice_max(tf_idf, n = 20) |>
ggplot(aes(tf_idf, reorder_within(word, tf_idf, book_title))) +
geom_col() +
facet_wrap(~book_title, scales = "free") +
scale_y_reordered() +
labs(x = "tf-idf", y = NULL, title = "Highest tf-idf words") +
theme_minimal()
An alternative method is to remove common words (called stopwords) from
the analysis, using a stopword list. tidytext
provides a list of
stopwords with the get_stopwords()
function, which can be used to
filter out common words from the analysis.
stopwords <- get_stopwords() |> pull(word)
tf_idf |>
group_by(book_title) |>
# remove stopwords
filter(!word %in% stopwords) |>
slice_max(tf_idf, n = 20) |>
ggplot(aes(tf, reorder_within(word, tf, book_title))) +
geom_col() +
facet_wrap(~book_title, scales = "free") +
scale_y_reordered() +
labs(x = "Term frequency", y = NULL, title = "Most frequent words (excl. stopwords)") +
theme_minimal()
Bigrams, n-grams
Bigrams are pairs of words that appear next to each other in a document;
n-grams are sequences of n words. They can be useful to capture the
context in which words appear, as the meaning of a word can depend on
the words that surround it. By specifying the token
argument in the
unnest_tokens()
function, we can split the text into bigrams or
n-grams.
bigrams <- books_content |>
unnest_tokens(bigram, text, token = "ngrams", n = 2)
We can visualize the most common bigrams the same way we did for unigrams (single words).
bigrams |>
drop_na() |>
count(book_title, bigram) |>
group_by(book_title) |>
slice_max(n, n = 20) |>
ggplot(aes(n, reorder_within(bigram, n, book_title))) +
geom_col() +
facet_wrap(~book_title, scales = "free") +
scale_y_reordered() +
labs(x = "Frequency", y = NULL, title = "Most frequent bigrams") +
theme_minimal()
In addition, we can make use of the extra context information provided
by bigrams, and visualize which words are most likely to appear after a
given word. For that, we need to separate the bigrams into two columns,
one for the first word and one for the second word. To keep the
vocabulary relatively small, we will only consider bigrams where neither
of the words is a stopword. We can use these frequencies to create a
network visualization of the most common bigrams with the igraph
and
ggraph
packages.
library(igraph)
library(ggraph)
# create graph object
bigram_graph <- bigrams |>
drop_na() |>
# separate bigrams into two columns
separate(bigram, c("word1", "word2"), sep = " ") |>
# remove stopwords
filter(!word1 %in% stopwords & !word2 %in% stopwords) |>
# count word frequencies
count(word1, word2) |>
# remove bigrams that appear less than 5 times
filter(n > 5) |>
# create graph object
graph_from_data_frame()
# plot graph
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_width = n), show.legend = FALSE) +
geom_node_point() +
geom_node_text(aes(label = name), repel = TRUE) +
scale_edge_width(range = c(0.1, 2)) +
theme_void()
Sentiment analysis
Sentiment analysis is the process of determining the sentiment of a
piece of text, i.e. whether it is positive, negative, or neutral. One
way to do this is to use a sentiment lexicon, which is a list of words
and their associated sentiment scores. There are multiple different
sentiment lexicons available, such as Bing, AFINN, and NRC. These differ
in their training data and the sentiment categories they use, but are
all available with the get_sentiments()
function. So we can use the
tidy words tibble and merge it with the sentiment lexicon to assign
sentiment scores to each word.
# get sentiment lexicons
bing <- get_sentiments("bing")
afinn <- get_sentiments("afinn")
# plot the most common positive and negative words with the Bing lexicon
words |>
inner_join(bing, by = "word") |>
count(book_title, word, sentiment) |>
group_by(book_title, sentiment) |>
slice_max(n, n = 10) |>
ggplot(aes(n, reorder_within(word, n, book_title), fill = sentiment)) +
geom_col() +
scale_y_reordered() +
labs(x = "Frequency", y = NULL) +
facet_wrap(~book_title, scales = "free") +
theme_minimal()
# calculate sentiment scores per book with the AFINN lexicon
words |>
count(book_title, word) |>
inner_join(afinn, by = "word") |>
# calculate each word's contribution to the sentiment score
mutate(value_n = value * n) |>
group_by(book_title) |>
# calculate the sentiment score for each book (sum of sentiment scores / number of words)
summarize(score = sum(value_n) / sum(n))
## # A tibble: 2 × 2
## book_title score
## <chr> <dbl>
## 1 Alice 0.124
## 2 Winnie 0.810
Topic modelling
Topic modelling is a method to discover the topics that are present in a
collection of documents. It is an unsupervised learning method, meaning
that it does not require labeled data. One popular topic modelling
method is latent Dirichlet allocation (LDA), which assumes that each
document is a mixture of different topics, and each topic is a mixture
of words. The LDA()
function from the topicmodels
package can be
used to fit an LDA model to a document-term matrix. The function
requires a document-term matrix as created by cast_dtm()
, so first we
should create a clean version of our previous dtm
object (remove
stopwords).
Before we fit a model, we need to decide how many topics to use. If we have previous expectations about what results we want to see, we can choose a specific number of topics, otherwise we can try multiple values until we find sensible results. The model also includes a random initialization step, so it is a good idea to set a seed to ensure that we get the same results every time. In this case, we might try to fit a model with 2 topics if we hope that the model can separate the topics of the two documents.
library(topicmodels)
dtm <- words |>
count(book_title, word) |>
filter(!word %in% stopwords) |>
cast_dtm(document = book_title, term = word, value = n)
# fit LDA model with 2 topics
lda <- LDA(dtm, k = 2, control = list(seed = 1))
When interpreting LDA results, we consider two sets of parameters: the
document-topic matrix and the topic-word matrix. The document-topic
matrix tells us how much of each topic is present in each document,
while the topic-word matrix tells us which words are associated with
each topic. We can use the tidy()
function from the topicmodels
package to extract these matrices into a tidy format, specifying
matrix = "beta"
for the topic-word matrix and matrix = "gamma"
for
the document-topic matrix.
topic_word <- tidy(lda, matrix = "beta")
document_topic <- tidy(lda, matrix = "gamma")
The topic-word matrix helps us give meaning to the topics by showing which words are the most strongly associated with each topic. We can plot these word probabilities to visualize the topics.
topic_word |>
group_by(topic) |>
slice_max(beta, n = 10) |>
ggplot(aes(beta, reorder_within(term, beta, topic))) +
geom_col() +
facet_wrap(~topic, scales = "free") +
scale_y_reordered() +
labs(x = "Per-topic-per-word probability", y = NULL) +
theme_minimal()
It seems like the topics can separate the two books well, although it might not succeed with a different random seed.
The per-document-per-topic probabilities confirm that the the documents are clearly split into topics, with each having a near-1 probability for one topic and near-0 for the other.
document_topic |>
pivot_wider(names_from = topic, values_from = gamma)
## # A tibble: 2 × 3
## document `1` `2`
## <chr> <dbl> <dbl>
## 1 Alice 0.00000788 1.00
## 2 Winnie 1.00 0.00000338
Nevertheless, topic modelling can be very useful for larger collections of documents, where it can help to identify the main themes present in the corpus.
Word embeddings
Word embeddings are a way to represent words as vectors in a
high-dimensional space, where words with similar meanings are close to
each other. There are pre-trained models (such as GloVe or BERT), that
were trained on large corpora of text data, but we can also create our
own word embeddings from our own data, which will be specific to the
context at hand. One popular method to create our own word embeddings is
word2vec, which is implemented in the word2vec
package. word2vec is
the simplest type of embedding model, but more complex, more
contextualized embeddings form the basis of the current large language
models.
word2vec()
takes a character vector containing the full text, so let’s
create a 2-element vector where each element corresponds to the full
text of one of the books, using the cleaned version of the text from
words
. We can specify a lot of model parameters, such as the dimension
of the word vectors (the length of the vector associated with a word) or
the context window (the number of words around each word to consider as
the context), but we can also go with the default settings. The
numerical values of the embeddings aren’t informative, the information
is in the similarities and differences between different words.
library(word2vec)
text <- words |>
group_by(book_title) |>
# collapse words into a single string
summarize(text = paste(word, collapse = " ")) |>
# extract the text as a vector
pull(text)
# create word embeddings
embeddings <- word2vec(text)
# view word embeddings
predict(embeddings, words$word, type = "embedding")[1:6, 1:6]
## [,1] [,2] [,3] [,4] [,5] [,6]
## i 2.100090 0.6038027 -1.241109 0.6840506 -1.1386383 1.673814
## down 1.994635 0.4940750 -1.411342 0.5578233 -0.8436571 1.528763
## the 1.939298 0.5112578 -1.395995 0.5517772 -0.7790172 1.485731
## rabbit 2.059652 0.5316093 -1.333344 0.6338458 -0.9531527 1.594034
## hole 2.010487 0.5072647 -1.378551 0.5895357 -0.8947219 1.553197
## alice 2.022050 0.5381439 -1.396282 0.6215151 -0.9162369 1.557800
Embeddings are useful for evaluating which words are the most similar to
a particular word. In this case, similarity doesn’t necessarily mean
similarity in meaning, but rather that the words could replace each
other in a sentence or occur near each other. We can use the generic
predict()
function to find the most similar words to a given word.
# most similar words to "alice" and "pooh"
predict(embeddings, c("alice", "pooh"), type = "nearest", top_n = 5) |>
bind_rows()
## term1 term2 similarity rank
## 1 alice near 0.9998999 1
## 2 alice perhaps 0.9998997 2
## 3 alice blue 0.9998993 3
## 4 alice tried 0.9998991 4
## 5 alice ran 0.9998983 5
## 6 pooh yes 0.9998373 1
## 7 pooh piglet 0.9997662 2
## 8 pooh am 0.9997537 3
## 9 pooh re 0.9997448 4
## 10 pooh is 0.9997326 5
A more systematic way to evaluate the embeddings is to use them to visualize the words in a lower-dimensional space. We can use principal component analysis (PCA) to reduce the dimensionality of the embeddings to 2 dimensions, and then plot the words in this space. You don’t need to understand how PCA works, other than that it takes the initial high-dimensional data, and basically tries to find the directions in which the data varies the most, by looking for combinations of the original variables. So the first two dimensions capture as much variation in the embedding space as possible in two dimensions.
To get the visualization, we first get all the embedding vectors with
the predict()
function used above, then use the prcomp()
function to
perform PCA, and predict the first two components with the predict()
function again. We can limit the visualization to the 100 words with the
highest tf-idf values to keep the plot readable and remove overlapping
to keep the plot readable.
# get embedding vectors
vectors <- predict(embeddings, words$word, type = "embedding") |>
as.data.frame() |>
drop_na()
# get PCA dimensions
prcomp(vectors) |>
# predict the first two components
predict() |>
as.data.frame() |>
rownames_to_column("word") |>
# keep only the 100 words with the highest tf-idf values
filter(word %in% slice_max(tf_idf, tf_idf, n = 100)$word) |>
ggplot(aes(PC1, PC2, label = word)) +
geom_text(size = 3, check_overlap = TRUE) +
theme_void()
Apparently, the most different words in the two books are “balloon” and “forest” along one dimension, while “christopher” and “robin” are very different from all other words. It doesn’t seem like the results make much sense, which is probably because we trained the model on a small sample. The model can perform much better on larger datasets, and indeed much of the power of state-of-the-art language models comes from the large amounts of data they are trained on (together with a very large number of model parameters).