The tidy workflow

Video tutorial

Please watch this video (4:03), then read and follow along with the written tutorial below. Compare your own output to what you see printed below to make sure all of your code runs as expected.

Introduction

In this tutorial, we introduce you to the tidy workflow, a set of principles and tools that help you work with data in a structured and efficient way. This workflow allows you to combine different steps of the data cleaning process into a single pipeline, making your code more readable and easier to maintain.

This tutorial shows you an example of how to use the tidy workflow on the diamonds dataset, which comes pre-loaded with the tidyverse package.

Let’s load the tidyverse package and have a look at the diamonds dataset:

# load tidyverse
library(tidyverse)

# add diamonds to the environment
data(diamonds)

Establishing the data cleaning steps

Data files downloaded from online sources are not always in a convenient format for analysis in R: variable names are not always intuitive or consistent, you might need to make some additional calculations, recode variables, or remove some variables/observations. You can make a list of steps you need to take by observing the structure of your data.

You can view your data with the function View(dataname), while the output of summary(dataname) shows you all the variables you have, including their type and some details about their content. In case you need more information about what your variables mean, you can look at the documentation on website of your data source (or the help file if you use data from a package).

Our goal now is to look at only Ideal cut diamonds and compare the carat and price of these diamonds.

The tidy workflow

A useful tool to keep in mind when planning your data cleaning is the pipe operator |> (keyboard shortcut: Ctrl/Command+Shift+M). The base pipe operator |> does the same as the tidyverse pipe operator %>% – you can use either one.

Using it in a workflow means that the next function uses the previous result as an input, and helps you work in a linear fashion. E.g. “first I need to take the diamonds tibble and filter observations for which the cut is Ideal, then I need to select only the carat and price variables, and then I need to calculate the price per carat”.

To work with the result of the pipeline, you can assign it to a new object, e.g. diamonds_new. The filter(), select(), and mutate() functions are data cleaning functions from tidyverse, introduced in other tutorials (filter and select; mutate), which all take the input dataset as the first argument that we replace with the pipe here.

# start with the diamonds tibble
diamonds_new <- diamonds |>
  # filter the data for Ideal cut diamonds
  filter(cut == "Ideal") |>
  # select only the carat and price variables
  select(carat, price) |>
  # calculate the price per carat
  mutate(price_per_carat = price / carat)

This code is equivalent to the following:

# create a new object with the filtered data
diamonds_ideal <- filter(diamonds, cut == "Ideal")

# select only the carat and price variables
diamonds_selected <- select(diamonds_ideal, carat, price)

# calculate the price per carat
diamonds_new <- mutate(diamonds_selected, price_per_carat = price / carat)

The tidy workflow allows you to combine these steps into a single pipeline, so you don’t need to create intermediate objects.