Data Center Apprenticeship: Data visualization


Spring 2024

Schedule:

  • day 1: finish coordinate systems, maybe start with facets
  • day 2: (facets), patchwork, theme, other packages

Introduction

This tutorial introduces data visualization in R, primarily using the ggplot2 package (included in tidyverse). The tutorial is based on A ggplot2 Tutorial for Beautiful Plotting in R by Cédric Scherer and Modern Data Visualization with R by Robert Kabacoff.

Visualization with ggplot2

Packages and data

You will use the following packages and dataset to practice data visualization. The data contains information about Chicago’s weather on a daily basis between 1997–2000.

library(tidyverse)
library(patchwork) # multiple plots
library(GGally) # correlation scatterplot matrix
library(scatterplot3d) # 3-D scatterplots
library(ggalluvial) # static alluvial diagram
library(networkD3) # interactive alluvial diagram
library(plotly) # interactive plots

# Load the data from GitHub
chic <- read_csv("https://raw.githubusercontent.com/ucrdatacenter/projects/main/apprenticeship/2024h1/3_visualization/chicago.csv")

Basic plots

The ggplot2 package builds up figures in layers, by adding elements one at a time. You always start with a base ggplot where you specify the data used by the plot and possibly the variables to place on each axis. These variables are specified within an aes() function, which stands for aesthetics.

The ggplot() function in itself only creates a blank canvas; we need to add so-called geoms to actually plot the data. You can choose from a wide range of geoms, and also use multiple geoms in one plot. You can add elements to a ggplot objects with the + sign. You should think of the + sign in ggplot workflows in the same way you think of the pipe operators in data wrangling workflows.

# Create the base of a plot with date on the x-axis and temperature on the y-axis
ggplot(chic, aes(x = date, y = temp))

# Create a scatterplot
ggplot(chic, aes(x = date, y = temp)) + 
  geom_point()

# Create a line plot
ggplot(chic, aes(x = date, y = temp)) + 
  geom_line()

# Combine both points and lines in the plot
ggplot(chic, aes(x = date, y = temp)) +
  geom_point() +
  geom_line()

# Customize the appearance of points and lines
ggplot(chic, aes(x = date, y = temp)) +
  geom_point(color = "firebrick", shape = "diamond", size = 2) + 
  geom_line(color = "firebrick", linetype = "dotted", size = .3) + 
  theme_light() # Apply a light theme

The previous example used the theme_light() function to change the design of the plot. Instead of specifying it per individual plot, you can change the default setting for all future plots with the theme_set() function.

# Set the default theme for all following plots
theme_set(theme_light())

There are multiple ways to make axis titles more informative, such as the xlab() and ylab() functions or the x and y arguments in the labs() function. These elements can be added to a ggplot object just like any geom, theme, or other customization option.

# Add axis labels with xlab() and ylab()
ggplot(chic, aes(x = date, y = temp)) +
  geom_point(color = "firebrick") +
  xlab("Year") +
  ylab("Temperature")

# Add axis labels with labs(), use math expressions
ggplot(chic, aes(x = date, y = temp)) +
  geom_point(color = "firebrick") +
  labs(x = "Year", y = expression(paste("Temperature (", degree ~ F, ")")))

Additional aesthetics and legends

In addition to using the x and y axes to show variable values, you can use other characteristics of geoms to vary based on variables. You can add these additional characteristics – such as color, fill, size, shape – to the aes() function.

Notice that R treats characters as factors, and arranges them in alphabetical order. If you would like to change this default behavior because the variable has another meaningful order (e.g. for seasons Spring should come after Winter), you can convert the variable to a factor with the levels defined in the correct order. The code below is also an example of how you can transition from a pipe workflow of data wrangling to immediately visualizing the data without needing to save the intermediate dataset as a separate object.

# Color the scatterplot points by season
ggplot(chic, aes(x = date, y = temp, color = season)) +
  geom_point() +
  labs(x = "Year", y = "Temperature")

chic |> 
  # Convert season to a factor with seasons in the correct order
  mutate(season = factor(season, levels = c("Winter", "Spring", "Summer", "Autumn"))) |> 
  # Determine the color and shape of the points by season
  ggplot(aes(x = date, y = temp, color = season, shape = season)) +
  geom_point() +
  labs(x = "Year", y = "Temperature")

If a step of data wrangling should apply to all plots, it is easier to save the resulting data as a new object or overwrite the original data (if it doesn’t lead to a loss of information).

# Convert season to a factor in the original data
chic <- chic |> 
  mutate(season = factor(season, levels = c("Winter", "Spring", "Summer", "Autumn")))

Other geoms

So far you have seen point and line geoms, but there are many more. The R Graph Gallery provides a long list of common plot types, and so do Chapters 4 and 5 of Modern Data Visualization with R. Both resources group geoms by the type of variable(s) plotted.

For the frequency distribution of a continuous variable, you’d often use a histogram or density plot, while for the frequencies of a categorical variable you’d use a bar chart.

# Histogram of temperatures
ggplot(chic, aes(temp)) +
  geom_histogram(fill = "grey", color = "red")

# Density plot of temperatures
ggplot(chic, aes(temp)) +
  geom_density(fill = "grey", alpha = 0.5)

# Density plot of temperatures per season
ggplot(chic, aes(temp, fill = season)) +
  geom_density(alpha = 0.3)

# Number of observations per month
ggplot(chic, aes(month)) +
  geom_bar()

# Number of observations per month, colors by year
ggplot(chic, aes(month, fill = factor(year))) +
  geom_bar()

# Number of observations per month, colors by year
ggplot(chic, aes(season, fill = factor(year))) +
  geom_bar(position = "dodge")

To make relationships between two continuous variables clearer, you can add smoothing curves – you can keep the curve flexible or restrict it to a straight line.

# Add a smooth curve to the scatterplot
ggplot(chic, aes(date, temp)) +
  geom_point() +
  geom_smooth() +
  labs(x = "Year", y = "Temperature")

# Use a linear fit and remove confidence interval around the line
ggplot(chic, aes(date, temp)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Year", y = "Temperature")

You can use text as geoms as well, either on top of other geoms to label them, or as the main data markers. In this case the y-coordinates of the labels correspond to the average temperature of the season.

# Plot the mean temperature per season
chic |> 
  # Calculate mean temperature per year and season
  group_by(year, season) |> 
  summarize(temp = mean(temp)) |> 
  ggplot(aes(year, temp)) +
  geom_text(aes(label = season)) +
  labs(x = NULL, y = "Mean temperature per season")

# geom_label is the same as geom_text but with a filled background
chic |> 
  group_by(year, season) |> 
  summarize(temp = mean(temp)) |> 
  ggplot(aes(year, temp)) +
  # Fill label background per season
  geom_label(aes(label = season, fill = season)) +
  labs(x = NULL, y = "Mean temperature per season")

You can use various other geoms to highlight particular points or boundaries of your plot, e.g. by adding horizontal or vertical lines at key locations. An easy way to highlight particular observations is to add a new layer of geoms, where instead of the full dataset you use a filtered version with only the highlighted observations. In that case you can override the global data choice in ggplot() by adding a data argument within a geom. In this case chic |> filter(yday %in% 358:360) keeps the Chistmas period in all years, and highlights them in red. The dashed blue line at 32F temperature (0 Celsius) makes it easy to determine whether a particular year had below-freezing temperatures on Christmas.

ggplot(chic, aes(date, temp)) +
  geom_point(color = "grey70", alpha = 0.5, size = 2) +
  # Highlight selected points: Christmas from each year
  geom_point(data = chic |> filter(yday %in% 358:360), color = "red", size = 2) +
  # Add a horizontal line at temp == 32
  geom_hline(yintercept = 32, color = "blue", linetype = "dashed", size = 1.5) +
  labs(x = "Year", y = "Temperature")

If you’d like to plot intervals, ribbon geoms may be useful. For each x (y) value the ribbon needs a minimum and maximum point along the y (x) axis. In this case the monthly minimum and maximum temperatures form the lower and upper bounds of the ribbon. It can also be used to plot confidence intervals or standard errors around estimated values.

# Add a ribbon showing monthly range of temperatures
chic |> 
  # Calculate lowest and highest temperature per month
  group_by(year, month) |> 
  mutate(upper = max(temp), 
         lower = min(temp)) |> 
  ggplot() +
  # use those temperatures as upper and lower bound of a ribbon
  geom_ribbon(aes(x = date, ymin = lower, ymax = upper), alpha = 0.2) +
  geom_point(aes(date, temp))

Heatmaps have a variety of uses; they are most commonly used to show correlations between predictors, but they can have any two categorical variables on the axes, and use color to show how the value of a (usually continuous) variable differs between different combinations of the categorical variables. For example this figure shows how average temperature changes in different years and seasons.

# Heatmap of average temperature per season
chic |> 
  # Calculate average temperature per season
  group_by(year, season) |> 
  summarize(temp = mean(temp)) |> 
  ggplot() +
  # geom_tile with fill aesthetic creates heatmap
  geom_tile(aes(year, season, fill = temp)) +
  # Add text to display average temperatures
  geom_text(aes(year, season, label = round(temp, 1)), color = "white") +
  labs(x = NULL, y = NULL, fill = "Average\ntemperature")

To show how relationships evole over time, you can use geom_path that connects observations based on their ordering in the original dataframe (so make sure that your data is properly sorted, otherwise you’ll end up with nonsense results). If you have may observations, paths can look very cluttered (which is why the following example restricts the data to a single month). In those cases it may be useful to add markers to notable points, or use arrows to specify the direction of the path. In addition to highlighting observations as you have seen before, you can add segment geoms to draw lines or arrows on the plot. While you can specify the coordinates of these segments within the ggplot workflow, it is often clearer to store these coordinates in a separate tibble.

# Scatterplot of temperature and ozone levels in Dec 2000 - evolution over time is unclear
chic |> 
  filter(year == 2000, month == "Dec") |> 
  ggplot(aes(temp, o3)) +
  geom_point() +
  labs(x = "Temperature", y = "Ozone", title = "December 2000")

# Define the coordinates of an arrow pointing to Dec 1 2000
arrow_data <- tibble(
  x = 20,
  y = 25,
  xend = chic |> filter(date == ymd(20001201)) |> pull(temp),
  yend = chic |> filter(date == ymd(20001201)) |> pull(o3)
)

# Path of temperature and ozone levels make it clear how values change over time
chic |> 
  filter(year == 2000, month == "Dec") |> 
  ggplot(aes(temp, o3)) + 
  # Plot the path of temperature and ozone
  geom_path() +
  # Highlight Dec 31
  geom_point(data = filter(chic, date == ymd(20001231))) +
  # Add the previously defined arrow with geom_segment
  geom_segment(data = arrow_data, aes(x = x, y = y, xend = xend, yend = yend),
               # Do not use the global aesthetics from ggplot(aes())
               inherit.aes = FALSE, color = "red",
               # Specify that the segment is an arrow
               arrow = arrow()) +
  # Add a label to Dec 1 with coordinates defined within aes()
  geom_text(data = NULL, aes(x = 20, y = 26, label = "Dec 1, 2000"), color = "red") +
  labs(x = "Temperature", y = "Ozone", title = "December 2000")

Scales

For any aesthetic you specify, you can override the default behavior by adding scale_... elements. You can manually specify legend titles, color palettes, labels, etc.

ggplot(chic, aes(x = date, y = temp, color = season)) +
  geom_point() +
  labs(x = "Year", y = "Temperature") +
  # Specify the title and legend labels of the color scale
  scale_color_discrete(
    name = "Seasons:",
    labels = c("Mar—May", "Jun—Aug", "Sep—Nov", "Dec—Feb")
  )

ggplot(chic, aes(x = date, y = temp, color = season)) +
  geom_point() +
  labs(x = "Year", y = "Temperature") +
  # Manually specify the colors per season
  scale_color_manual(values = c("darkblue", "green3", "pink", "gold"))

ggplot(chic, aes(x = date, y = temp, color = season)) +
  geom_point() +
  labs(x = "Year", y = "Temperature") +
  # Use a predefined color palette from RColorBrewer
  scale_colour_brewer(type = "qual", palette = 2)

ggplot(chic, aes(x = date, y = temp, color = o3, shape = season)) +
  geom_point() +
  labs(x = "Year", y = "Temperature") +
  # Use a gradient color palette specifying the endpoints
  scale_color_gradient(low = "lightblue", high = "darkblue") +
  # Manually specify the shapes per season
  scale_shape_manual(values = c(15, 16, 17, 18))

Coordinate systems

Similarly to scales, you can adjust the default behavior of the x and y axes as well (e.g. specify different axis limits, breaks, or labels), either by the scale_x_...()/scale_y_...() arguments or by changing the coordinate system with coord_...().

scale_x_...()/scale_y_...() is most useful for changing the axis breaks and labels, while coord_...() can e.g. flip the axes, adjust the aspect ratio (see coord_fixed()), and determine whether observations beyond the plot boundaries should be displayed.

Both scale and coord can adjust axis limits, with a subtle difference in their behavior.

ggplot(chic, aes(x = date, y = temp)) +
  geom_point(color = "firebrick") +
  labs(x = "Year", y = "Temperature") +
  # Only plot observations with temp between 0-50 
  scale_y_continuous(limits = c(0, 50))

ggplot(chic, aes(x = date, y = temp)) +
  geom_point(color = "firebrick") +
  labs(x = "Year", y = "Temperature") +
  # Limit the y-axis between 0 and 50, but don't filter out points
  coord_cartesian(y = c(0, 50))

ggplot(chic, aes(x = date, y = temp)) +
  geom_point(color = "firebrick") +
  labs(x = "Year", y = "Temperature") +
  # Limit the y-axis between 0 and 50, and let points show beyond the plot panel up to the plot margins
  coord_cartesian(y = c(0, 50), clip = "off")

ggplot(chic, aes(x = date, y = temp)) +
  geom_point(color = "firebrick") +
  labs(x = "Year", y = "Temperature") +
  # Customize y-axis breaks to be 0, 10, 20, ... 90
  scale_y_continuous(breaks = seq(0, 90, 10), minor_breaks = NULL) +
  # Customize x-axis breaks to be every 6 months in the format of year-month
  scale_x_date(date_breaks = "6 months", date_labels = "%Y-%b", minor_breaks = NULL)

ggplot(chic, aes(temp)) +
  geom_histogram(fill = "grey", color = "red") +
  # Flip the x and y axes
  coord_flip()

ggplot(chic, aes(temp)) +
  geom_histogram(fill = "grey", color = "red") +
  # Reverse the x axis
  scale_x_reverse()

Multiple plots

Often you would like to present multiple plots side-by-side: maybe show how the relationship between variables differs depending on some groupings of observations (and putting all observations on a single plot looks too cluttered), or compare how different outcome variables react to changes in an explanatory variable. The two main ways to nicely arrange plots are

  • creating a single plot with multiple panels, known as facets;
  • creating multiple independent plots, and arranging them into one figure afterwards.

Using facets

By specifying a grouping variable along which to facet, you can create a separate plot for each value of your facet variable. You can create faceted plots by adding facet_wrap() to your ggplot object, and specify the faceting variable with a tilde (~). You can also specify additional arguments such as whether to allow the axis limits to vary between panels. Only use variables with relatively few unique values as your facet dimension, otherwise R will attempt to create far too many plots, which takes a long time and might even crash your R session.

# Change the default theme to bw
theme_set(theme_bw())

ggplot(chic, aes(x = date, y = temp)) +
  geom_point(color = "orangered", alpha = .3) +
  labs(x = "Year", y = "Temperature") +
  # Create separate plots per year
  facet_wrap(~year)

ggplot(chic, aes(x = date, y = temp)) +
  geom_point(color = "orangered", alpha = .3) +
  labs(x = "Year", y = "Temperature") +
  # Create separate plots per year and allow different x-axes per plot
  facet_wrap(~year, scales = "free_x")

ggplot(chic, aes(x = date, y = temp)) +
  geom_point(color = "orangered", alpha = .3) +
  labs(x = "Year", y = "Temperature") +
  # Create separate plots per season, arrange all plots in one row, let all scales vary
  facet_wrap(~season, nrow = 1, scales = "free")

If you would like to group your data based on two variables, you can use facet_grid(), separating your two variables with a tilde. However, note that axis limit customization options are more limited with facet_grid(), so if you need your axes to vary within columns/rows as well, you can use facet_wrap() with the same two-variable argument as well. However, with facet_wrap() your faceting variables are “stuck” together, which makes the overview of which dimension corresponds to changes in whcih faceting variable less clear.

ggplot(chic, aes(x = date, y = temp)) +
  geom_point(color = "orangered", alpha = .3) +
  labs(x = "Year", y = "Temperature") +
  # Arrange plots vertically per year and horizontally per season with facet_grid
  facet_grid(year~season)

ggplot(chic, aes(x = date, y = temp)) +
  geom_point(color = "orangered", alpha = .3) +
  labs(x = "Year", y = "Temperature") +
  # Same but with facet_wrap
  facet_wrap(year~season, scales = "free")

Combining independent plots with patchwork

In order to nicely arrange separate plots and same them as a single file, you can use the patchwork packages. First you need to save each of your plots into an R object (here p1, p2, p3), then use + signs to combine the plots horizontally, and / signs to combine them vertically. You can make these layouts as complex as you want to by using parentheses to group rows. Alternatively, you can specify custom layout options, including additional options such as whether to repeat or collect legends, by adding a plot_layout() function to the plot objects.

# Create and save plots of temp, ozone level and dewpoint over time

p1 <- ggplot(chic, aes(x = date, y = temp, color = season)) +
  geom_point() +
  labs(x = "Year", y = "Temperature")

p2 <- ggplot(chic, aes(x = date, y = o3, color = season)) +
  geom_point() +
  labs(x = "Year", y = "Ozone")

p3 <- ggplot(chic, aes(x = date, y = dewpoint, color = season)) +
  geom_point() +
  labs(x = "Year", y = "Dewpoint")

# Combine temp and ozone horizontally
p1 + p2

# Combine temp and ozone vertically
p1 / p2

# Combine temp and ozone vertically with plot_layout, do not repeat legends
p1 + p2 + plot_layout(ncol = 1, guides = "collect")

# Arrange p1 and p2 horizontally on top, p1, p2 and p3 horizontally below
(p1 + p2) / (p1 + p2 + p3) + plot_layout(guides = "collect")

Often it is possible to achieve your desired plot layout both by faceting or with patchwork, but depending on your goal and the structure of your data, one approach may be easier than the other. A good rule to keep in mind is that faceting wants data in long format, while patchwork often wants wide format: faceting needs a grouping variable that for each observation defines which facet it should go on, while in patchwork you can change the aesthetics between plots, so you can easily switch which variables to use per plot.

The following example shows how we can create the same plots with faceting as with patchwork by converting the data to long format where the name variable specifies whether the value is the value of temperature, ozone, or dewpoint.

# Side-by-side plots of temperature, ozone and dewpoint with facet_wrap
chic |> 
  # Convert data to long format with variable names to "name" and values to "value" 
  pivot_longer(c(temp, o3, dewpoint)) |> 
  ggplot(aes(date, value, color = season)) +
  geom_point() + 
  # Facet by variable name, arrange in one column
  facet_wrap(~name, ncol = 1)

Customizing plot elements

So far we modified the default theme (aka theme_grey()) by specifying alternative predefined themes (e.g. theme_light()). A way to customize plot elements even further is to change the theme() function by redefining particular plot elements.

Every design element of a plot (panel, grid, axes, text, legend keys, etc.) can be changed with an element_...() function (e.g. element_text() for text, element_line() for lines, element_blank() if the element shouldn’t be displayed). Each element type has different characteristics you can customize; a few examples are shown here. The help file of the theme() functions describes the options in detail.

ggplot(chic, aes(x = date, y = temp)) +
  geom_point(color = "firebrick") +
  labs(x = "Year", y = "Temperature (F)") +
  # Customize axis title fonts
  theme(axis.title = element_text(size = 15, color = "firebrick", face = "italic"),
        # Remove y-axis ticks
        axis.ticks.y = element_blank(),
        # Change minor grid to dashed lines
        panel.grid.minor = element_line(linetype = "dashed"))

ggplot(chic, aes(x = date, y = temp, color = season)) +
  geom_point() +
  labs(x = "Year", y = "Temperature", color = "Season") +
  # Move legend to above the plot
  theme(legend.position = "top",
        # Change legend background color
        legend.background = element_rect(fill = "grey90"),
        # Remove legend title
        legend.title = element_blank())

ggplot(chic, aes(x = date, y = temp, color = season)) +
  geom_point() +
  labs(x = "Year", y = "Temperature") +
  # Move legend to coordinates within the plot
  theme(legend.position = c(0.85, 0.2),
        # Add whitespace to the left side of the plot
        plot.margin = margin(l = 50))

In addition to the theme() function there are also other ways to customize legends. You can change what key shape to use by specifying the key_glyph within a geom (for the available options of shapes, see here). You can also add a guides() function to your ggplot object: some options there overlap with options in theme(), but there are also additional arguments such as arranging legend items in multiple rows/columns or specifying the order of legends if there are multiple.

ggplot(chic, aes(x = date, y = temp, color = season)) +
  geom_point(key_glyph = "vline") +
  labs(x = "Year", y = "Temperature") +
  # Customize the legend further with guide_legend()
  guides(color = guide_legend("Season:", title.hjust = 0.5, nrow = 2))

Saving plots

You can save ggplot objects to use outside of the R environment with the ggsave function. You can specify an existing ggplot object as the first argument, but by default ggsave() saves the last plot displayed in your Plots panel. You always need to specify the file path of the saved plot, including the preferred file format (e.g. .png, .jpg, .pdf). You can adjust the plot size with the scale argument or by specifying the height and width in your preferred units (the default units are inches).

ggplot(chic, aes(x = date, y = temp)) + 
  geom_point()

# Save last plot
ggsave("figures/plot1.png", scale = 1.5)

p <- ggplot(chic, aes(x = date, y = temp)) + 
  geom_point()

# Save plot saved to the Environment
ggsave(p, "figures/plot2.png", height = 10, width = 15, units = "cm")

Other plotting packages and plot types

While ggplot is extremely flexible and therefore sufficient for most of your plotting needs, it is good to be aware of how to use a few other packages developed for more specific purposes.

Correlation scatterplot matrix with GGally

To estimate the relationships between a set of (continuous) variables in a dataset, you’d usually calculate a correlation matrix. The ggpairs() function from the GGally packages presents both this correlation matrix, and plots the distribution of each variable and the relationships for each variable pair. You can adjust the default function e.g. by specifying an additional grouping variable; for more options, see the examples in Modern Data Visualization with R.

chic |> 
  # Keep only 4 continuous variables
  select(temp, o3, dewpoint, pm10) |> 
  # Create correlation matrix of the selected variables
  ggpairs()

chic |> 
  # Keep 4 continuous variables and season as the grouping variable
  select(temp, o3, dewpoint, pm10, season) |> 
  # Recreate the previous plot but group observations by season
  # Use only the first 4 columns for the plots (exclude season)
  ggpairs(columns = 1:4, ggplot2::aes(color = season))

Pie chart

While data scientists do not recommend using pie charts (because humans are quite bad at comparing areas, and better at comparing lengths such as on bar charts), you can nevertheless create pie charts in ggplot. In order to do so, you need to create a bar chart, and change the coordinate system to polar coordinates. To make the bars look nice, you should specify your aesthetics and geoms similarly to the example below.

chic |> 
  # Get the number of observations per month
  count(month) |> 
  # Specify y as the counts and fill as the categorical variable
  ggplot(aes(x = "", y = n, fill = month)) +
  # Create bars with white borders
  geom_bar(stat = "identity", width = 1, color = "white") +
  # Change coordinate system to polar coordinates instead of Cartesian
  coord_polar("y", start = 0) +
  # Remove background theme elements
  theme_void()

3-D scatterplot

Similarly to pie charts, 3-D plots are also discouraged because they are hard to interpret, so use them only if absolutely necessary. In that case, the scatterplot3d package contains the scatterplot3d() function where you need to specify the variables to put on each axis. Note that this function does not have a separate data argument, so you need to specify the variables by extracting the column from the dataframe with the $ operator.

# Create a 3-D scatterplot of temperature, dewpoint and ozone levels
scatterplot3d(x = chic$temp,
              y = chic$dewpoint, 
              z = chic$o3)

Alluvial/Sankey diagrams with ggalluvial and networkD3

An interesting plot type is an alluvial diagram, also known as a Sankey diagram. It shows flows between different categories, and is frequently used e.g. to show changes over time in the share of observations belonging to particular groups (e.g. what energy sources households use in 1950 versus in 2000). For demonstration purposes, we’ll instead look at how observations per different seasons are split between high and low temperatures (e.g. expecting to see mostly low temperatures in Winter).

In order to create a static alluvial diagram, you can use the ggalluvial package and standard ggplot workflows with alluvium and stratum geoms.

chic |> 
  # Redefine temperature as a categorical variable: above mean temperature is high, below is low
  mutate(temp = ifelse(temp > mean(temp), "High temp", "Low temp")) |> 
  # Get number of observations per year, season, temp category
  count(year, season, temp) |> 
  # axis1, axis2, axis3 are the categorical grouping variables, y is the number of observations per group
  ggplot(aes(axis1 = year, axis2 = season, axis3 = temp, y = n)) +
  # Create flows, with colors per year
  geom_alluvium(aes(fill = factor(year))) +
  # Add rectangles for the categories of each variable
  geom_stratum() +
  # Label the rectangles
  geom_text(stat = "stratum", 
            aes(label = after_stat(stratum))) +
  # Remove background plot elements and legend
  theme_void() +
  theme(legend.position = "none")

For an interactive version of the previous plot, you can use the networkD3 package, however, note that this package requires the input data to have quite a specific format. You need a dataframe of the widths (frequencies/shares) of each link, specifying a numeric identifier for the source and target nodes. In addition, you need a dataframe of nodes that matches the numeric IDs to node names. Once you have both of these dataframes, you can use the sankeyNetwork() function. Note that creating the input dataframes gets significantly more complex if the diagram has more than two levels.

(Interactive plots do not display properly on the website.)

# Create a tibble of links with 3 variables: source, target, number of observations
links <- chic |> 
  # Redefine temp to hig/low categorical
  mutate(temp = ifelse(temp > mean(temp), "High temp", "Low temp")) |> 
  # Number of observations per season-temperature combination
  count(season, temp)

# Create a tibble of nodes by listing the unique categories in the links tibble
nodes <- tibble(name = unique(c(links$season, links$temp)))

# Add numerical identifiers of the nodes to the links tibble by using the row index in the nodes tibble 
# Subtract 1 to start the count at 0
links$IDseason <- match(links$season, nodes$name)-1
links$IDtemp <- match(links$temp, nodes$name)-1

# Create an interactive Sankey diagram from the links and nodes tibbles, the numerical category IDs, the observation counts per link and the variable name of the nodes tibble
sankeyNetwork(Links = links, Nodes = nodes, Source = "IDseason",
              Target = "IDtemp", Value = "n", NodeID = "name")

Interactive charts with plotly

If you work with interactive documents, you might want to make your plots interactive as well, e.g. by highlighting the data points that a user hovers over. It is very easy to turn ggplot objects into interactive plots with the plotly package: just save the ggplot to an object, and use that object as the argument of the ggplotly() function.

(Interactive plots do not display properly on the website.)

# Define a simple plot and assign to an object
p <- ggplot(chic, aes(x = date, y = temp, color = season)) +
  geom_point() +
  labs(x = "Year", y = "Temperature")

# Display the plot as an interactive plot 
ggplotly(p)