Data
In this workshop, we will work with Beazley Archive data and focus more on text analysis and fuzzy dates. This workshop is based on work done by Kalle Valkeakari.
As always, we start by loading tidyverse
.
library(tidyverse)
We then load the data. We use the same dataset we had from the first workshop.
data <- read_csv("https://github.com/ucrdatacenter/projects/raw/main/AH-ANTQ103/2024h1/Beazley_Archive.csv")
We start by looking at the data. We can use the following functions to look at the data.
head(data)
glimpse(data)
summary(data)
names(data)
After looking at the data, we have decided that we don’t need the 9th
and the 12th through the 29th columns, so we remove them. We can do this
using the select()
function. Think about why we don’t need these
columns.
data_short <- data %>%
select(-9, -12:-29)
Note that we can also do this by defining the columns we do want to keep
data_short <- data %>%
select(1:8, 10:11)
Or by stating the column names we want to keep
data_short <- data %>%
select(URI, Vase_Number, Fabric, Technique, Sub_Technique, Shape_Name, Provenance, Date, Attributed_To, Decoration)
We might want to check if there are multiple entries for the same
object. Luckily there is a Vase Number column, so we can check if there
are multiple entries for the same vase number. We can do this using the
duplicated()
function. We can then use the filter()
function to
filter the data to only include the duplicated rows.
data_short %>%
filter(duplicated(Vase_Number))
Given that the tibble that this returns is empty, we can conclude that there are no rows with the same vase number.
We can also check if there are any rows with missing data. We can do
this using the is.na()
function. This returns a logical vector, which
we can then use to filter the data.
data_short %>%
filter(is.na(Vase_Number))
Given that the tibble that this returns is empty, we can conclude that there are no rows with missing data.
Text analysis
Now that we know that there are no rows with missing data, and no rows with the same vase number, we can start looking at the data in more detail. We are interested in the decorations on the vases, so we will look at the Decoration column.
data_short %>%
select(Decoration)
This is text data. An interesting question may be what the most common
words are in the Decoration column. These would correspond with the most
common decorations. We can do this using the unnest_tokens()
function.
This function takes a column of text data, and splits it into individual
words. It then returns a tibble with the words in a column called word.
The unnest_tokens()
function is a part of the tidytext
package,
which we must download and load before we can use the function.
install.packages("tidytext")
library(tidytext)
data_words <- data_short %>%
unnest_tokens(word, Decoration)
Lets take a look at the most common words.
data_words %>%
count(word) %>%
arrange(desc(n))
We can already see that there are some NA
values. We can remove these
easily using the drop_na()
function.
data_words %>%
drop_na(word) %>%
count(word) %>%
arrange(desc(n))
There are still some words that are not interesting for our analysis.
These are called stop words. We can remove these using the anti_join()
function. This function takes two tibbles, and returns a tibble with the
rows from the first tibble that are not in the second tibble. The stop
words are in a tibble in the tidytext
package.
stop_words %>%
View()
Alternatively we can define our own custom stop words.
tibble <- tibble(
word = c("a", "and", "with", "an", "or", "the", "of", "to", "in", "for", "on", "at", "from", "by", "about", "as", "into", "like", "through", "after", "over", "between", "out", "against", "during", "without", "before", "under", "around", "among"),
)
We can then use the anti_join()
function to remove the stop words from
the data.
word_counts <- data_words %>%
drop_na(word) %>%
anti_join(tibble, by = "word") %>%
count(word) %>%
arrange(desc(n))
Here we still have the letter “b”. We may decide to remove all words
that are only one letter long. We can do this using the filter()
function. The str_length()
function returns the number of characters
in a string. We can use this to filter the data.
word_counts <- word_counts %>%
filter(str_length(word) > 1) %>%
print()
There are still a lot of words in here. We decide to only look at the
top 20 words. We can do this using the top_n()
function.
word_counts_top_20 <- word_counts %>%
top_n(20, n) %>%
print()
We can then create a plot of these words.
ggplot(word_counts_top_20, aes(x = reorder(word, n), y = n)) +
geom_col() +
coord_flip() +
xlab("Word") +
ylab("Number of occurrences") +
ggtitle("Most common words in the Decoration column") +
theme_bw()
For the homework, the separate()
function is essential. This function
takes a tibble, a column name, a new column name and a separator. It
then splits the column into two (or more) columns, and returns a tibble
with the new columns. Here I give a silly example of how to use this
function. It will be actually useful in the homework. If we take a look
at the decoration column, we can see that some vases have an entry that
is one word, then a colon and then more words. Imagine we want to
isolate the first word and the words after the colon. We can do this
using the separate()
function, by setting the separator to “:”.
data_short %>%
separate(Decoration, c("Decoration_1", "Decoration_2"), sep = ":")
Fuzzy dates
Usually data for origins of artefacts are given in ranges (e.g. 450-375 BC) and computers do not really like this type of “fuzzy” or uncertain time. There are two ways to deal with this type of data. The first is quite simple and implies taking the mean of these two values, but can lead to big errors if you do not understand the limitations to this approach. Let’s look at the underlying technique used for pottery, that is, the colors of the decorations on the pottery. Then we can ask, how does the technique change over time?
We can separate the dates into two columns, and then take the mean of
the two columns. We can do this using the separate()
function.
data_short_dates <- data_short %>%
separate(Date, c("Date_start", "Date_end"), sep = " to ")
These columns are now character columns. We can convert them to numeric
columns using the as.numeric()
function.
data_short_dates <- data_short_dates %>%
mutate(Date_start = as.numeric(Date_start),
Date_end = as.numeric(Date_end))
We can then calculate the mean for each row.
data_short_dates <- data_short_dates %>%
mutate(Date_mean = (Date_start + Date_end) / 2)
We only keep the rows with a black figure or red figure technique, as this allows for easier comparison with the plot we will make later.
plotting_data <- data_short_dates %>%
filter(Technique == "BLACK-FIGURE" | Technique == "RED-FIGURE")
We can then create a plot of the technique over time. We use the fill
aesthetic to fill the bars with the technique.
ggplot(plotting_data, aes(x = Date_mean, fill = Technique)) +
geom_histogram(binwidth = 25, position = "dodge")
Alternatively, we can keep the times fuzzy. For this we use the
datplot
package.
install.packages("datplot")
library(datplot)
In this package, there is already a Beazley dataset, which we will use
from now on. We can load this dataset using the data()
function.
data(Beazley)
For the type of plot we want to make, we need to have the data in a
specific format, namely ID, Factor, date_min, date_max. The datplot
dataset is already in this format.
The following code creates a copy of every year it could possibly be, given the range of dates. We can define the step size, which is the number of years between each copy.
result <- datsteps(Beazley, stepsize = 1)
It also calculates a weight for each date. This is the inverse of the number of copies. This means that if there are more copies of a date, the weight will be lower.
We can use this data to plot the technique over time. Note that we
define the weight as the weight
aesthetic. This means that the weight
will be used to determine the height of the density curve. Also note
than the Technique
column is here called variable
.
ggplot(result, aes(x = DAT_step, fill = variable, weight = weight)) +
geom_density(alpha = 0.5) +
labs(x = "Dating", y = "Density") +
ggtitle("Technique over time")
Homework assignments
Assignment 1
- Create a plot of the most common descriptions of the vases (As we did in the tutorial) in a new dataset. Use this dataset. Reuse the code from the lecture to create the plot, and compare the results. What do you notice?
Note that you can import the data from GitHub using this link:
"https://github.com/ucrdatacenter/projects/raw/main/AH-ANTQ103/2024h1/Beazley_Archive_2.csv"
Assignment 2
-
Create a plot of the most common shapes of the pottery in a new dataset. Use the dataset from 1.1. We are looking just for the main shapes, so for example: “CUP FRAGMENT” and “CUP” should be counted as the same shape, namely “CUP”. Show the top 10. You will need to split a column. Hint: The separator you want to use is either a comma or a space You can use this expression to do that
[, ]
. -
Use the code you wrote for 1.1 to create a plot of the most common shapes of the pottery in the original dataset. Again, “CUP FRAGMENT” and “CUP” (and similar definitions) should be counted as the same shape. Adapting the code should be very straightforward.
Assignment 3
Create a plot of the technique over time using the original dataset, which includes the fuzzy dates.
Hint: The dataset you use for the datsteps
function needs to be in a
specific format.
You can use this code to create a dataset you can use as a starting point.
data_short_dates <- data_short %>%
separate(Date, c("Date_start", "Date_end"), sep = " to ")
data_short_dates <- data_short_dates %>%
mutate(Date_start = as.numeric(Date_start),
Date_end = as.numeric(Date_end))
The plot should look like this:
Assignment 4
This is an open assignment. Create a new, original plot that you think is interesting and report your findings. This could be a text analysis on a new Beazley dataset and compare the results to what we found earlier, or you can perform another time analysis on the Beazley dataset. You can also use a different dataset and compare your findings. Lastly, if you can think of another interesting analysis using archaeological data, you can do that as well. While working on this assignment, please be aware that this is the file you will be asked to work with for the final report.
Some ideas may include, but are not limited to:
-
Comparing the provenances of vases from several large museums
-
Look at the occurrences of mythological figures on vases
-
Compare the shapes of vases for shapes that may be used in every day life