AH-ANTQ103: Workshop 2


Spring 2024

Introduction

In this workshop, we will expand on some of the basic functions, use tidyverse more and introduce some archaeology specific functions. It will follow the flow of a more typical data analysis workflow. We will again work with data about lithics from the Jerimalai rockshelter in East-Timor. This workshop has been inspired by the ‘tidyverse for archaeologists’ workshop by Professor Ben Marwick.

Some of the code will be evaluated here, but some will not so please copy and paste the code into your own script and run it there.

Load the tidyverse and rio packages.

library(tidyverse)
library(rio)

Data

Remember that the data comes in the form of an excel file from a link, and that we load it in using the import function from the rio package.

data <- import("https://bit.ly/j_data_xlsx", setclass = "tbl_df")

Let’s have a more detailed look at the data.

View(data)
summary(data)

This is a lot of information, and not that useful for us, as it is more useful when dealing with true numerical data, something that happens more rarely in archaeological data science than in other disciplines.

Let’s try out some other new functions to look at a data frame that might be more useful in this case.

names(data)
##  [1] "Site"       "Square"     "Spit"       "Group"      "Artno"     
##  [6] "Material"   "Colour"     "Weight"     "Length"     "Artclas"   
## [11] "Cortex"     "Cortype"    "Initiat"    "Breaks"     "Noseg"     
## [16] "Platwid"    "Plat"       "Focal"      "Overhang"   "NoDS"      
## [21] "NoPS"       "Rtch"       "Plat_2"     "Focal_2"    "Overhang_2"
## [26] "NoDS_2"     "NoPS_2"     "Rtch_2"     "RetOri"     "Retype"    
## [31] "Retloc"     "Retlen"     "Retdep"     "Portion"    "Heat"      
## [36] "EdgeDam"    "Weathering" "Corerot"    "RetScNo"    "Term"      
## [41] "Janus"      "Recycled"   "Redirect"   "Width"      "Thick"     
## [46] "Platthic"   "area"       "elongation"
head(data)
## # A tibble: 6 × 48
##   Site  Square  Spit Group Artno Material Colour   Weight Length Artclas Cortex
##   <chr> <chr>  <dbl> <dbl> <dbl> <chr>    <chr>     <dbl>  <dbl> <chr>    <dbl>
## 1 J     B          1     1     1 Chert    Brown      1.83  22.0  Flake        0
## 2 J     B          1     1     2 Chert    Yellow     0.61  14.6  RetF         0
## 3 J     B          1     1     3 Chert    Dk Grey    3.81  29.6  Core         0
## 4 J     B          1     1     4 Chert    Grey       0.13   5.83 RetFrag      0
## 5 J     B          1     1     5 Chert    Grey       0.05   7.07 Flake        0
## 6 J     B          1     1     6 Chert    Lt Brown   0.58  11.1  Core        NA
## # ℹ 37 more variables: Cortype <chr>, Initiat <chr>, Breaks <chr>, Noseg <dbl>,
## #   Platwid <dbl>, Plat <chr>, Focal <lgl>, Overhang <chr>, NoDS <dbl>,
## #   NoPS <dbl>, Rtch <chr>, Plat_2 <chr>, Focal_2 <lgl>, Overhang_2 <chr>,
## #   NoDS_2 <dbl>, NoPS_2 <dbl>, Rtch_2 <chr>, RetOri <chr>, Retype <chr>,
## #   Retloc <chr>, Retlen <dbl>, Retdep <dbl>, Portion <chr>, Heat <dbl>,
## #   EdgeDam <dbl>, Weathering <dbl>, Corerot <dbl>, RetScNo <dbl>, Term <chr>,
## #   Janus <lgl>, Recycled <lgl>, Redirect <lgl>, Width <dbl>, Thick <dbl>, …

We will select the Square, Spit, Material, Weight, Thick and Colour columns. Note that we can select multiple columns in one go using the comma.

data |> 
  select(Square, Spit, Material, Weight, Thick, Colour)

Plot 1:

We now want to get into plotting our data. For this we are again only interested in the Square, Spit, Material, Weight and Thick columns.

Wile we work with data, we often change our data sets quite a lot depending on what we want to investigate, so it is necessary to cache our data sets under different names. That way we always keep track of which data is stored where, which is particularly important in archaeology given the vastly different types of analysis we can conduct with our data.

data_plotting <- data |> 
  select(Square, Spit, Material, Weight, Thick)

print(data_plotting)
## # A tibble: 9,752 × 5
##    Square  Spit Material Weight Thick
##    <chr>  <dbl> <chr>     <dbl> <dbl>
##  1 B          1 Chert      1.83  3.34
##  2 B          1 Chert      0.61  2.74
##  3 B          1 Chert      3.81  7.25
##  4 B          1 Chert      0.13  2.05
##  5 B          1 Chert      0.05  1.13
##  6 B          1 Chert      0.58  4.79
##  7 B          1 Chert      0.03  1.16
##  8 B          1 Chert      0.02  1.09
##  9 B          1 Chert      0.21  3.91
## 10 B          1 Chert      0.25  3.13
## # ℹ 9,742 more rows

We can now create a histogram of the Weight column. This will show how many artefacts have a certain weight. Creating histograms of a variable is a common practice in archaeology, as it provides a quick visual overview of patterns. We can do this using the important ggplot() function. This function takes a tibble, and can then create all kinds of plots. ggplot() works with layers.

We can add these layers with a + sign. The base command is ggplot(). This creates a white canvas, and you can add information by adding layers.

ggplot(data = data_plotting)

Since we have not told ggplot how to display the data yet, we still have a blank canvas.

As we want to make a histogram, we only need to specify the x axis, which is done by the aes() function within the mapping argument.

ggplot(data = data_plotting, mapping = aes(x = Weight))

We have now filled in the two most important arguments from ggplot(), data and mapping. Since these arguments are always the first two arguments in the ggplot() function, we often do not explicitly name them to save us some time.

However, we still do not see a histogram, so now we need to add a layer that actually tells ggplot that we want the plot to be a histogram.

The layers that define the type of plot are known as geoms. There are many different geoms, but we want to create a histogram, so we will use the geom_histogram() function.

ggplot(data_plotting, aes(x = Weight)) +
  geom_histogram()

We can see our data on the plot!

However, there appear to be some weights that are very large and cause the graph to stretch to 600, whilst most are close to zero. If we change the x-axis scale to a logaritmic scale, we can see the distribution better. We edit the scale by adding a new layer to our plot, in our case in the form of the scale_x_log10() function.

A logarithm is a mathematical function that increases slowly at first, and then faster and faster. This is useful for data that has a large range (something you will be encountering more often with archaeological data), as it compresses the data and makes it easier to see the distribution. This specific logarithm is the base 10 logarithm, which means that 10 is raised to the power of the number on the x axis. For example, 10^2 = 100, so the number 2 would be at the position 100 on the x axis.

ggplot(data_plotting, aes(x = Weight)) +
  geom_histogram() +
  scale_x_log10()

Now that it finally looks like an actual histogram, we can give it a last touch by changing the theme of the plot. This is purely an optical change and there are many themes available, but we will just use the theme_bw() theme for now, which we add like any other layer to our ggplot().

ggplot(data_plotting, aes(x = Weight)) +
  geom_histogram() +
  scale_x_log10() +
  theme_bw()

Plot 2:

for our next plot, we only want to keep rows that have the “A” value for the Square column. We can do this using the filter() function. Keep in mind that “A” is a string, so we need to use quotation marks and two equal signs.

data_plotting_2 <- data |> 
  filter(Square == "A")

Now, from these rows we only want light objects, lets say with a Weight less than 10. We can do this by adding another condition to our filter() with the & (“and”) operator.

data_plotting_2 |>
  filter(Square == "A" & Weight < 10)
## # A tibble: 4,416 × 48
##    Site  Square  Spit Group Artno Material Colour Weight Length Artclas  Cortex
##    <chr> <chr>  <dbl> <dbl> <dbl> <chr>    <chr>   <dbl>  <dbl> <chr>     <dbl>
##  1 J     A          1    NA  4332 Chert    Peach    1.74  18.3  Flake         0
##  2 J     A          1    NA  4333 Chert    Grey     1.64  20.5  Flake        10
##  3 J     A          1    NA  4334 Obsidian Black    0.12   9.53 Flake         0
##  4 J     A          1    NA  4335 Chert    Grey     2.12  17.6  Flake         0
##  5 J     A          1    NA  4336 Chert    Pink     2.25  19.3  RetF          0
##  6 J     A          1    NA  4337 Chert    Orange   4.88  25.1  Non-diag      0
##  7 J     A          1    NA  4338 Chert    White    8.95  21.0  Flake         0
##  8 J     A          1    NA  4339 Chert    Brown    3.02  20.1  Flake-b       0
##  9 J     A          1    NA  4340 Chert    Grey     2.71  19.5  Flake         0
## 10 J     A          1    NA  4341 Chert    Grey     1.31   9.36 Flake         0
## # ℹ 4,406 more rows
## # ℹ 37 more variables: Cortype <chr>, Initiat <chr>, Breaks <chr>, Noseg <dbl>,
## #   Platwid <dbl>, Plat <chr>, Focal <lgl>, Overhang <chr>, NoDS <dbl>,
## #   NoPS <dbl>, Rtch <chr>, Plat_2 <chr>, Focal_2 <lgl>, Overhang_2 <chr>,
## #   NoDS_2 <dbl>, NoPS_2 <dbl>, Rtch_2 <chr>, RetOri <chr>, Retype <chr>,
## #   Retloc <chr>, Retlen <dbl>, Retdep <dbl>, Portion <chr>, Heat <dbl>,
## #   EdgeDam <dbl>, Weathering <dbl>, Corerot <dbl>, RetScNo <dbl>, …

& is a very useful logical operator. Another very common logical operator is | (Shift + /), which stands for “or”. If we wanted to include all observations that have an “A” or “a” value in the Square column for example, we could write the following code:

data_example_or <- data |>
  filter(Square == "A" | Square == "a")

For now we will continue and add to our previous filters though.

We also want a thickness between 0.2 and 20. We can either do this by adding two additional conditions, or by using the between() function.

data_plotting_2 |> 
  filter(Square == "A" & Weight < 10 & Thick > 0.2 & Thick < 20)
## # A tibble: 4,390 × 48
##    Site  Square  Spit Group Artno Material Colour Weight Length Artclas  Cortex
##    <chr> <chr>  <dbl> <dbl> <dbl> <chr>    <chr>   <dbl>  <dbl> <chr>     <dbl>
##  1 J     A          1    NA  4332 Chert    Peach    1.74  18.3  Flake         0
##  2 J     A          1    NA  4333 Chert    Grey     1.64  20.5  Flake        10
##  3 J     A          1    NA  4334 Obsidian Black    0.12   9.53 Flake         0
##  4 J     A          1    NA  4335 Chert    Grey     2.12  17.6  Flake         0
##  5 J     A          1    NA  4336 Chert    Pink     2.25  19.3  RetF          0
##  6 J     A          1    NA  4337 Chert    Orange   4.88  25.1  Non-diag      0
##  7 J     A          1    NA  4338 Chert    White    8.95  21.0  Flake         0
##  8 J     A          1    NA  4339 Chert    Brown    3.02  20.1  Flake-b       0
##  9 J     A          1    NA  4340 Chert    Grey     2.71  19.5  Flake         0
## 10 J     A          1    NA  4341 Chert    Grey     1.31   9.36 Flake         0
## # ℹ 4,380 more rows
## # ℹ 37 more variables: Cortype <chr>, Initiat <chr>, Breaks <chr>, Noseg <dbl>,
## #   Platwid <dbl>, Plat <chr>, Focal <lgl>, Overhang <chr>, NoDS <dbl>,
## #   NoPS <dbl>, Rtch <chr>, Plat_2 <chr>, Focal_2 <lgl>, Overhang_2 <chr>,
## #   NoDS_2 <dbl>, NoPS_2 <dbl>, Rtch_2 <chr>, RetOri <chr>, Retype <chr>,
## #   Retloc <chr>, Retlen <dbl>, Retdep <dbl>, Portion <chr>, Heat <dbl>,
## #   EdgeDam <dbl>, Weathering <dbl>, Corerot <dbl>, RetScNo <dbl>, …
data_plotting_2 |> 
  filter(Square == "A" & Weight < 10 & between(Thick, 0.2, 20))
## # A tibble: 4,392 × 48
##    Site  Square  Spit Group Artno Material Colour Weight Length Artclas  Cortex
##    <chr> <chr>  <dbl> <dbl> <dbl> <chr>    <chr>   <dbl>  <dbl> <chr>     <dbl>
##  1 J     A          1    NA  4332 Chert    Peach    1.74  18.3  Flake         0
##  2 J     A          1    NA  4333 Chert    Grey     1.64  20.5  Flake        10
##  3 J     A          1    NA  4334 Obsidian Black    0.12   9.53 Flake         0
##  4 J     A          1    NA  4335 Chert    Grey     2.12  17.6  Flake         0
##  5 J     A          1    NA  4336 Chert    Pink     2.25  19.3  RetF          0
##  6 J     A          1    NA  4337 Chert    Orange   4.88  25.1  Non-diag      0
##  7 J     A          1    NA  4338 Chert    White    8.95  21.0  Flake         0
##  8 J     A          1    NA  4339 Chert    Brown    3.02  20.1  Flake-b       0
##  9 J     A          1    NA  4340 Chert    Grey     2.71  19.5  Flake         0
## 10 J     A          1    NA  4341 Chert    Grey     1.31   9.36 Flake         0
## # ℹ 4,382 more rows
## # ℹ 37 more variables: Cortype <chr>, Initiat <chr>, Breaks <chr>, Noseg <dbl>,
## #   Platwid <dbl>, Plat <chr>, Focal <lgl>, Overhang <chr>, NoDS <dbl>,
## #   NoPS <dbl>, Rtch <chr>, Plat_2 <chr>, Focal_2 <lgl>, Overhang_2 <chr>,
## #   NoDS_2 <dbl>, NoPS_2 <dbl>, Rtch_2 <chr>, RetOri <chr>, Retype <chr>,
## #   Retloc <chr>, Retlen <dbl>, Retdep <dbl>, Portion <chr>, Heat <dbl>,
## #   EdgeDam <dbl>, Weathering <dbl>, Corerot <dbl>, RetScNo <dbl>, …

We also don’t want rows that have NA values in the Material column.

data_plotting_2 <- data_plotting_2|> 
  filter(Square == "A" & Weight < 10 & between(Thick, 0.2, 20) & !is.na(Material))

We now want to create a scatterplot of the Weight and Thick columns. Scatterplots can be used in archaeology to examine correlations between two variable of an object, for example the depth from which it was excavated and the time it can be dated to. In that sense it also allows us to visually find outliers that might infer an unusual activity or previously unexplained pattern.

Create the empty plot, now define both x and y.

 ggplot(data_plotting_2, aes(x = Weight, y = Thick))

Add the geom_point() layer that creates a scatterplot.

 ggplot(data_plotting_2, aes(x = Weight, y = Thick)) +
  geom_point()

It may again be useful to change the x axis to a logarithmic scale.

 ggplot(data_plotting_2, aes(x = Weight, y = Thick)) +
  geom_point() +
  scale_x_log10()

We can also do the same for the y axis this time.

 ggplot(data_plotting_2, aes(x = Weight, y = Thick)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10()

We now set the theme, as well as changing the axis labels, for which we use the labs() function that has all types of useful arguments to change the labels of our plots.

 ggplot(data_plotting_2, aes(x = Weight, y = Thick)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  theme_bw() +
  labs(x = "Weight (g)", y = "Thickness (mm)")

We can also add a colour to the points based on the material the object is made of. We do this by adding the colour argument to the aes() function and telling it to base the color on the Material column.

 ggplot(data_plotting_2, aes(x = Weight, y = Thick, colour = Material)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  theme_bw() +
  labs(x = "Weight (g)", y = "Thickness (mm)")

Using the theme() function (not to be mixed up with the different themes functions), we can change the position of the legend. We need to specify the x and y coordinates of the legend. This uses a coordinate system between 0 and 1, so 0.5 is in the middle. When you create your own graphs, finding the right position might take some try and error.

ggplot(data_plotting_2, aes(x = Weight, y = Thick, colour = Material)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  theme_bw() +
  labs(x = "Weight (g)", y = "Thickness (mm)") +
  theme(legend.position = c(0.85, 0.3))

Plot 3:

Our goal now is to make a boxplot of a plat_area column, grouped by the Material column. Boxplots are useful to use in archaeology to estimate size distributions by a category like we are doing it here. Lets get the data first. We need to create the plat_area column first.

data_plat_area <- data |> 
  mutate(plat_area = Platwid * Platthic)

Now we only keep materials that have more than 5 objects and remove NA values. To create subgroups within a data frame use the group_by() function that requires a column as input based on which groups will be created.

data_plotting_3 <- data_plat_area |> 
  group_by(Material) |>
  filter(n() > 5 & !is.na(Material))

Now create the plot using the geom_boxplot() function. We leave the code for a homework assignment.

Plot 4:

We now want to create a plot that shows the number of objects per material. First get the data.

To get the number of objects we make use of the count() function. The count function counts the number of unique values in a column and returns a summary table with the values and their counts.

data_plotting_4 <- data_plotting_3 |> 
  group_by(Material) |> 
  count()

Now create the plot using the geom_col() function.

ggplot(data = data_plotting_4) +  
  aes(x = Material,  
      y = n) +  
  geom_col() +  
  scale_y_log10() +  
  labs(x = "Raw Material",  
       y =  "Number of Objects") +  
  theme_classic(base_size = 14)

We can change the order of the columns using the reorder() function. We need to specify the column we want to reorder, and the column we want to order by. In this case, we want to order the Material column by the n column to order the columns by height.

ggplot(data = data_plotting_4) +  
  aes(x = reorder(Material, n),  
      y = n) +  
  geom_col() +  
  scale_y_log10() +  
  labs(x = "Raw Material",  
       y =  "Number of Objects") +  
  theme_classic(base_size = 14)

If we wanted to create a top 5 plot, we would first arrange the data by the n column, and then only keep the top 5 rows. Here we make use of the arrange() function which can rearrange columns. By default, arrange() orders in ascending order, so if we want to arrange in descending order, we have to add the desc() function in the arrange() function.

data_plotting_4 <- data_plotting_4 |> 
  arrange(desc(n)) |> 
  head(5)

ggplot(data = data_plotting_4, 
       aes(x = reorder(Material, n), y = n)
       ) +
  geom_col() +
  scale_y_log10() +
  labs(x = "Raw Material", y = "Number of Objects") +
  theme_classic(base_size = 14)

Plot 5:

Now lets say we want to make a plot of the distributions of length of each object. We can either manually create each plot by filtering for each material, or we can simply use the facet_wrap() function that splits a plot up by a column.

ggplot(data = data_plotting_3) +  
  aes(x = Length) +  
  geom_histogram() +  
  facet_wrap(~Material) +  
  labs(x = "Length (mm)",  
       y =  "Number of Objects") +  
  theme_classic(base_size = 14)

Due to the amount of chert items in the data, this is not very useful, but with the scales argument we can set the y axis to be free for each plot.

ggplot(data = data_plotting_3) +  
  aes(x = Length) +  
  geom_histogram() +  
  facet_wrap(~Material, scales = "free_y") +  
  labs(x = "Length (mm)",  
       y =  "Number of Objects") +  
  theme_classic(base_size = 14)

Homework assignments

For all assignments, consider how your findings allow you to infer relevant archaeological conclusions from the data.

Assignment 1

Create the plot below. This is the same plot as we made earlier in the workshop.

Assignment 2

Create the plot below of the most common colors. You will need to perform some data manipulation first.

Assignment 3

What is the most common term in the dataset? Create a tibble that contains the term and the number of times it occurs.