Introduction
In this workshop, we will expand on some of the basic functions, use tidyverse more and introduce some archaeology specific functions. It will follow the flow of a more typical data analysis workflow. We will work with data about lithics from the Jerimalai rockshelter in East-Timor. This workshop has been inspired by the ‘tidyverse for archaeologists’ workshop by Professor Ben Marwick.
Some of the code will be evaluated here, but some will not so please copy and paste the code into your own script and run it there.
Load the tidyverse
package.
library(tidyverse)
Data
This data is coming from an URL, and it is an excel file. We
unfortunately cannot use tidyverse to read excel files, so we will use
the rio
package.
Install the rio
package.
install.packages("rio")
library(rio)
Load the data using import
.
data <- import("https://bit.ly/j_data_xlsx", setclass = "tbl_df")
Let’s have a look at the data.
View(data)
summary(data)
nrow(data)
## [1] 9752
ncol(data)
## [1] 48
names(data)
## [1] "Site" "Square" "Spit" "Group" "Artno"
## [6] "Material" "Colour" "Weight" "Length" "Artclas"
## [11] "Cortex" "Cortype" "Initiat" "Breaks" "Noseg"
## [16] "Platwid" "Plat" "Focal" "Overhang" "NoDS"
## [21] "NoPS" "Rtch" "Plat_2" "Focal_2" "Overhang_2"
## [26] "NoDS_2" "NoPS_2" "Rtch_2" "RetOri" "Retype"
## [31] "Retloc" "Retlen" "Retdep" "Portion" "Heat"
## [36] "EdgeDam" "Weathering" "Corerot" "RetScNo" "Term"
## [41] "Janus" "Recycled" "Redirect" "Width" "Thick"
## [46] "Platthic" "area" "elongation"
head(data)
## # A tibble: 6 × 48
## Site Square Spit Group Artno Material Colour Weight Length Artclas Cortex
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 J B 1 1 1 Chert Brown 1.83 22.0 Flake 0
## 2 J B 1 1 2 Chert Yellow 0.61 14.6 RetF 0
## 3 J B 1 1 3 Chert Dk Grey 3.81 29.6 Core 0
## 4 J B 1 1 4 Chert Grey 0.13 5.83 RetFrag 0
## 5 J B 1 1 5 Chert Grey 0.05 7.07 Flake 0
## 6 J B 1 1 6 Chert Lt Brown 0.58 11.1 Core NA
## # ℹ 37 more variables: Cortype <chr>, Initiat <chr>, Breaks <chr>, Noseg <dbl>,
## # Platwid <dbl>, Plat <chr>, Focal <lgl>, Overhang <chr>, NoDS <dbl>,
## # NoPS <dbl>, Rtch <chr>, Plat_2 <chr>, Focal_2 <lgl>, Overhang_2 <chr>,
## # NoDS_2 <dbl>, NoPS_2 <dbl>, Rtch_2 <chr>, RetOri <chr>, Retype <chr>,
## # Retloc <chr>, Retlen <dbl>, Retdep <dbl>, Portion <chr>, Heat <dbl>,
## # EdgeDam <dbl>, Weathering <dbl>, Corerot <dbl>, RetScNo <dbl>, Term <chr>,
## # Janus <lgl>, Recycled <lgl>, Redirect <lgl>, Width <dbl>, Thick <dbl>, …
We will select the Square
, Spit
and Weight
columns. Note that we
can select multiple columns in one go using the comma.
data %>%
select(Square, Spit, Weight)
Plotting
We now want to get into plotting our data. For this we are only
interested in the Square
, Spit
, Material
, Weight
and Thick
columns.
plotting_data <- data %>%
select(Square, Spit, Material, Weight, Thick)
plotting_data
## # A tibble: 9,752 × 5
## Square Spit Material Weight Thick
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 B 1 Chert 1.83 3.34
## 2 B 1 Chert 0.61 2.74
## 3 B 1 Chert 3.81 7.25
## 4 B 1 Chert 0.13 2.05
## 5 B 1 Chert 0.05 1.13
## 6 B 1 Chert 0.58 4.79
## 7 B 1 Chert 0.03 1.16
## 8 B 1 Chert 0.02 1.09
## 9 B 1 Chert 0.21 3.91
## 10 B 1 Chert 0.25 3.13
## # ℹ 9,742 more rows
We can now create a histogram of the Weight
column. This will show how
many artefacts have a certain weight. We can do this using the
ggplot()
function. This function takes a tibble, and creates a plot.
ggplot works with layers. We can add these layers with a + sign. The
base command is ggplot()
. This creates a white canvas, and you can add
information that is passed to the layers.
ggplot(plotting_data)
As we want to make a histogram, we only need to specify the x axis.
ggplot(plotting_data, aes(x = Weight))
Now we need to add a layer that displays the data. These layers are
known as geoms. There are many different geoms, but we want to create a
histogram, so we will use the geom_histogram()
function.
ggplot(plotting_data, aes(x = Weight)) +
geom_histogram()
There are some weights that are very large, but most are close to zero.
If we change the x axis scale to a logaritmic scale, we can see the
distribution better. We do this by adding the scale_x_log10()
layer.
A logarithm is a mathematical function that increases slowly at first, and then faster and faster. This is useful for data that has a large range, as it compresses the data and makes it easier to see the distribution. This specific logarithm is the base 10 logarithm, which means that 10 is raised to the power of the number on the x axis. For example, 10^2 = 100, so the number 2 would be at the position 100 on the x axis.
logs_tibble <- tibble(x = c(10^0, 10^1, 10^2, 10^3, 10^4), y = c("10^0", "10^1", "10^2", "10^3", "10^4"), z = log10(x))
logs_tibble
## # A tibble: 5 × 3
## x y z
## <dbl> <chr> <dbl>
## 1 1 10^0 0
## 2 10 10^1 1
## 3 100 10^2 2
## 4 1000 10^3 3
## 5 10000 10^4 4
ggplot(plotting_data, aes(x = Weight)) +
geom_histogram() +
scale_x_log10()
We can also change the theme of the plot. There are many themes
available, but we will use the theme_bw()
theme.
ggplot(plotting_data, aes(x = Weight)) +
geom_histogram() +
scale_x_log10() +
theme_bw()
We now only want to keep rows that have the “A” value for the Square
column. We can do this using the filter()
function. Keep in mind that
“A” is a string, so we need to use quotation marks and two equal signs.
plotting_data %>%
filter(Square == "A")
## # A tibble: 4,436 × 5
## Square Spit Material Weight Thick
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 A 1 Chert 1.74 4.37
## 2 A 1 Chert 1.64 4.09
## 3 A 1 Obsidian 0.12 1.41
## 4 A 1 Chert 2.12 5.96
## 5 A 1 Chert 2.25 7.62
## 6 A 1 Chert 4.88 8.97
## 7 A 1 Chert 8.95 16.0
## 8 A 1 Chert 3.02 8.15
## 9 A 1 Chert 2.71 6.44
## 10 A 1 Chert 1.31 5.84
## # ℹ 4,426 more rows
Now, from these rows we only want light objects, lets say with a
Weight
less than 10. We can do this by adding another filter.
plotting_data %>%
filter(Square == "A") %>%
filter(Weight < 10)
## # A tibble: 4,416 × 5
## Square Spit Material Weight Thick
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 A 1 Chert 1.74 4.37
## 2 A 1 Chert 1.64 4.09
## 3 A 1 Obsidian 0.12 1.41
## 4 A 1 Chert 2.12 5.96
## 5 A 1 Chert 2.25 7.62
## 6 A 1 Chert 4.88 8.97
## 7 A 1 Chert 8.95 16.0
## 8 A 1 Chert 3.02 8.15
## 9 A 1 Chert 2.71 6.44
## 10 A 1 Chert 1.31 5.84
## # ℹ 4,406 more rows
And a thickness between 0.2 and 20. We can either do this by adding two
filters, or by using the between()
function. Note that there is an
important difference between > and >=, and < and <=.
plotting_data %>%
filter(Square == "A") %>%
filter(Weight < 10) %>%
filter(Thick > 0.2) %>%
filter(Thick < 20)
## # A tibble: 4,390 × 5
## Square Spit Material Weight Thick
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 A 1 Chert 1.74 4.37
## 2 A 1 Chert 1.64 4.09
## 3 A 1 Obsidian 0.12 1.41
## 4 A 1 Chert 2.12 5.96
## 5 A 1 Chert 2.25 7.62
## 6 A 1 Chert 4.88 8.97
## 7 A 1 Chert 8.95 16.0
## 8 A 1 Chert 3.02 8.15
## 9 A 1 Chert 2.71 6.44
## 10 A 1 Chert 1.31 5.84
## # ℹ 4,380 more rows
plotting_data %>%
filter(Square == "A") %>%
filter(Weight < 10) %>%
filter(between(Thick, 0.2, 20))
## # A tibble: 4,392 × 5
## Square Spit Material Weight Thick
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 A 1 Chert 1.74 4.37
## 2 A 1 Chert 1.64 4.09
## 3 A 1 Obsidian 0.12 1.41
## 4 A 1 Chert 2.12 5.96
## 5 A 1 Chert 2.25 7.62
## 6 A 1 Chert 4.88 8.97
## 7 A 1 Chert 8.95 16.0
## 8 A 1 Chert 3.02 8.15
## 9 A 1 Chert 2.71 6.44
## 10 A 1 Chert 1.31 5.84
## # ℹ 4,382 more rows
Alternatively, you can combine them in one filter function.
plotting_data %>%
filter(Square == "A" & Weight < 10 & between(Thick, 0.2, 20))
## # A tibble: 4,392 × 5
## Square Spit Material Weight Thick
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 A 1 Chert 1.74 4.37
## 2 A 1 Chert 1.64 4.09
## 3 A 1 Obsidian 0.12 1.41
## 4 A 1 Chert 2.12 5.96
## 5 A 1 Chert 2.25 7.62
## 6 A 1 Chert 4.88 8.97
## 7 A 1 Chert 8.95 16.0
## 8 A 1 Chert 3.02 8.15
## 9 A 1 Chert 2.71 6.44
## 10 A 1 Chert 1.31 5.84
## # ℹ 4,382 more rows
We also don’t want rows that have NA
values in the Material
column.
plotting_data %>%
filter(Square == "A") %>%
filter(Weight < 10) %>%
filter(between(Thick, 0.2, 20)) %>%
drop_na(Material)
## # A tibble: 4,392 × 5
## Square Spit Material Weight Thick
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 A 1 Chert 1.74 4.37
## 2 A 1 Chert 1.64 4.09
## 3 A 1 Obsidian 0.12 1.41
## 4 A 1 Chert 2.12 5.96
## 5 A 1 Chert 2.25 7.62
## 6 A 1 Chert 4.88 8.97
## 7 A 1 Chert 8.95 16.0
## 8 A 1 Chert 3.02 8.15
## 9 A 1 Chert 2.71 6.44
## 10 A 1 Chert 1.31 5.84
## # ℹ 4,382 more rows
We now want to make a scatterplot of the Weight
and Thick
columns.
new_plotting_data <- plotting_data %>%
filter(Square == "A") %>%
filter(Weight < 10) %>%
filter(between(Thick, 0.2, 20)) %>%
drop_na(Material)
Create the empty plot.
ggplot(new_plotting_data, aes(x = Weight, y = Thick))
Add the points layer.
ggplot(new_plotting_data, aes(x = Weight, y = Thick)) +
geom_point()
It may again be useful to change the x axis to a logaritmic scale.
ggplot(new_plotting_data, aes(x = Weight, y = Thick)) +
geom_point() +
scale_x_log10()
We can also do the same for the y axis.
ggplot(new_plotting_data, aes(x = Weight, y = Thick)) +
geom_point() +
scale_x_log10() +
scale_y_log10()
We now set the theme, as well as changing the axis labels.
ggplot(new_plotting_data, aes(x = Weight, y = Thick)) +
geom_point() +
scale_x_log10() +
scale_y_log10() +
theme_bw() +
labs(x = "Weight (g)", y = "Thickness (mm)")
We can also add a colour to the points based on the material the object
is made of. We do this by adding the colour = Material
argument to the
aes()
function.
ggplot(new_plotting_data, aes(x = Weight, y = Thick, colour = Material)) +
geom_point() +
scale_x_log10() +
scale_y_log10() +
theme_bw() +
labs(x = "Weight (g)", y = "Thickness (mm)")
We can change the position of the legend using the theme()
function.
We need to specify the x and y coordinates of the legend. This uses a
coordinate system between 0 and 1, so 0.5 is in the middle.
ggplot(new_plotting_data, aes(x = Weight, y = Thick, colour = Material)) +
geom_point() +
scale_x_log10() +
scale_y_log10() +
theme_bw() +
labs(x = "Weight (g)", y = "Thickness (mm)") +
theme(legend.position = c(0.85, 0.3))
Our goal now is to make a boxplot of the plat_area
column, grouped by
the Material
column. Lets get the data first. We need to create the
plat_area
column first.
plat_area_data <- data %>%
mutate(plat_area = Platwid * Platthic)
Now we only keep materials that have more than 5 objects and remove NA
values.
plat_area_data <- plat_area_data %>%
group_by(Material) %>%
filter(n() > 5) %>%
drop_na(Material)
Now create the plot using the geom_boxplot()
function. We leave the
code for a homework assignment.
We now want to create a plot that shows the number of objects per material. First get the data.
material_data <- plat_area_data %>%
group_by(Material) %>%
count()
Now create the plot using the geom_col()
function.
ggplot(data = material_data) +
aes(x = Material,
y = n) +
geom_col() +
scale_y_log10() +
labs(x = "Raw Material",
y = "Number of Objects") +
theme_classic(base_size = 14)
We can change the order of the columns using the reorder()
function.
We need to specify the column we want to reorder, and the column we want
to order by. In this case, we want to order the Material
column by the
n
column.
ggplot(data = material_data) +
aes(x = reorder(Material, n),
y = n) +
geom_col() +
scale_y_log10() +
labs(x = "Raw Material",
y = "Number of Objects") +
theme_classic(base_size = 14)
If we wanted to create a top 5 plot, we would first arrange the data by
the n
column, and then only keep the top 5 rows.
material_data <- material_data %>%
arrange(desc(n)) %>% # Note we arrange the rows in descending order, so the largest value is at the top
head(5)
ggplot(data = material_data) +
aes(x = reorder(Material, n), y = n) +
geom_col() +
scale_y_log10() +
labs(x = "Raw Material", y = "Number of Objects") +
theme_classic(base_size = 14)
Now lets say we want to make a plot of the distributions of length of
each object. We can either manually create each plot by filtering for
each material, or we can use the facet_wrap()
function.
ggplot(data = plat_area_data) +
aes(x = Length) +
geom_histogram() +
facet_wrap(~Material) +
labs(x = "Length (mm)",
y = "Number of Objects") +
theme_classic(base_size = 14)
Due to the amount of chert items in the data, this is not very useful,
but we can set the y
axis to be free for each plot.
ggplot(data = plat_area_data) +
aes(x = Length) +
geom_histogram() +
facet_wrap(~Material, scales = "free_y") +
labs(x = "Length (mm)",
y = "Number of Objects") +
theme_classic(base_size = 14)
Homework assignments
Assignment 1
Create the plot below. This is the same plot as we made earlier in the workshop.
Assignment 2
Create the plot below of the most common colours. You will need to perform some data manipulation first.
Assignment 3
What is the most common term in the dataset? Create a tibble that contains the term and the number of times it occurs.