Data wrangling: creating new variables

Video tutorial

Please watch this video (2:46), then read and follow along with the written tutorial below. Compare your own output to what you see printed below to make sure all of your code runs as expected.

Introduction

In some cases you might need to do additional calculations with your data. In this tutorial, we show you how to define new variables and overwrite existing ones using the mutate() function from tidyverse functions and the diamonds dataset (which comes pre-loaded with tidyverse so you don’t need to import it).

Let’s load the tidyverse package and have a look at the diamonds dataset:

# load tidyverse
library(tidyverse)

# add diamonds to the environment
data(diamonds)

Creating new variables

The mutate() function is used to create new variables in a dataset. The syntax is mutate(data, variable = expression). The data argument is the dataset you want to modify, variable is the name of the new variable, and expression is the calculation you want to perform. If variable already exists in the dataset, mutate() will overwrite it.

Let’s create a new variable called price_per_carat that calculates the price per carat of each diamond:

# create a new variable price_per_carat
mutate(diamonds, price_per_carat = price / carat)
## # A tibble: 53,940 × 11
##    carat cut   color clarity depth table price     x     y     z price_per_carat
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>           <dbl>
##  1  0.23 Ideal E     SI2      61.5    55   326  3.95  3.98  2.43           1417.
##  2  0.21 Prem… E     SI1      59.8    61   326  3.89  3.84  2.31           1552.
##  3  0.23 Good  E     VS1      56.9    65   327  4.05  4.07  2.31           1422.
##  4  0.29 Prem… I     VS2      62.4    58   334  4.2   4.23  2.63           1152.
##  5  0.31 Good  J     SI2      63.3    58   335  4.34  4.35  2.75           1081.
##  6  0.24 Very… J     VVS2     62.8    57   336  3.94  3.96  2.48           1400 
##  7  0.24 Very… I     VVS1     62.3    57   336  3.95  3.98  2.47           1400 
##  8  0.26 Very… H     SI1      61.9    55   337  4.07  4.11  2.53           1296.
##  9  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49           1532.
## 10  0.23 Very… H     VS1      59.4    61   338  4     4.05  2.39           1470.
## # ℹ 53,930 more rows

You can also create multiple variables at once by separating them with a comma. It is good practice to start each new variable on a new line to keep your code readable.

Let’s create a second variable: a logical (TRUE/FALSE) variable that checks if the diamond costs more than $10,000. In this case, the expression is a logical condition.

# create two new variables
mutate(diamonds,
       price_per_carat = price / carat,
       expensive = price > 10000)
## # A tibble: 53,940 × 12
##    carat cut   color clarity depth table price     x     y     z price_per_carat
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>           <dbl>
##  1  0.23 Ideal E     SI2      61.5    55   326  3.95  3.98  2.43           1417.
##  2  0.21 Prem… E     SI1      59.8    61   326  3.89  3.84  2.31           1552.
##  3  0.23 Good  E     VS1      56.9    65   327  4.05  4.07  2.31           1422.
##  4  0.29 Prem… I     VS2      62.4    58   334  4.2   4.23  2.63           1152.
##  5  0.31 Good  J     SI2      63.3    58   335  4.34  4.35  2.75           1081.
##  6  0.24 Very… J     VVS2     62.8    57   336  3.94  3.96  2.48           1400 
##  7  0.24 Very… I     VVS1     62.3    57   336  3.95  3.98  2.47           1400 
##  8  0.26 Very… H     SI1      61.9    55   337  4.07  4.11  2.53           1296.
##  9  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49           1532.
## 10  0.23 Very… H     VS1      59.4    61   338  4     4.05  2.39           1470.
## # ℹ 53,930 more rows
## # ℹ 1 more variable: expensive <lgl>

To be able to work with this new dataset, you need to save it to a new object. Let’s assign the result of the mutate() function to a new object called diamonds_new:

# save the result to a new object
diamonds_new <- mutate(diamonds,
                        price_per_carat = price / carat,
                        expensive = price > 10000)