Data wrangling: filtering rows and selecting columns

Video tutorial

Please watch this video (4:45), then read and follow along with the written tutorial below. Compare your own output to what you see printed below to make sure all of your code runs as expected.

Introduction

Often you don’t need all the data in your dataset, but only a subset of it. Maybe you are only interested in a specific subset of observations or you only need a subset of variables. In this tutorial, we show you how to filter rows and select columns from a tibble using tidyverse functions and the diamonds dataset (which comes pre-loaded with tidyverse so you don’t need to import it).

Let’s load the tidyverse package and have a look at the diamonds dataset:

# load tidyverse
library(tidyverse)

# add diamonds to the environment
data(diamonds)

Selecting and renaming variables

Often you don’t need all variables included in your downloaded dataset. Then you can select the subset of variables you need (or the subset you would like to remove). The function for doing so is select(), and the arguments of the function are your dataset, followed by the names of the variables you would like to keep (or remove, if the variable names are preceded by -). The following examples show how to use the function with and without the pipe operator.

# keep only variables price and carat
select(diamonds, price, carat)

## # A tibble: 53,940 × 2
##    price carat
##    <int> <dbl>
##  1   326  0.23
##  2   326  0.21
##  3   327  0.23
##  4   334  0.29
##  5   335  0.31
##  6   336  0.24
##  7   336  0.24
##  8   337  0.26
##  9   337  0.22
## 10   338  0.23
## # ℹ 53,930 more rows

# remove variables price and carat
select(diamonds, -price, -carat)

## # A tibble: 53,940 × 8
##    cut       color clarity depth table     x     y     z
##    <ord>     <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Ideal     E     SI2      61.5    55  3.95  3.98  2.43
##  2 Premium   E     SI1      59.8    61  3.89  3.84  2.31
##  3 Good      E     VS1      56.9    65  4.05  4.07  2.31
##  4 Premium   I     VS2      62.4    58  4.2   4.23  2.63
##  5 Good      J     SI2      63.3    58  4.34  4.35  2.75
##  6 Very Good J     VVS2     62.8    57  3.94  3.96  2.48
##  7 Very Good I     VVS1     62.3    57  3.95  3.98  2.47
##  8 Very Good H     SI1      61.9    55  4.07  4.11  2.53
##  9 Fair      E     VS2      65.1    61  3.87  3.78  2.49
## 10 Very Good H     VS1      59.4    61  4     4.05  2.39
## # ℹ 53,930 more rows

If you use the distinct() function instead of select(), you not only select the listed variables, but keep only unique rows based on these variables. If you use the distinct() function without specifying any variables, you remove all duplicates from the dataset, keeping all variables.

# keep only unique rows based on the variables price and carat
distinct(diamonds, price, carat)

## # A tibble: 28,988 × 2
##    price carat
##    <int> <dbl>
##  1   326  0.23
##  2   326  0.21
##  3   327  0.23
##  4   334  0.29
##  5   335  0.31
##  6   336  0.24
##  7   337  0.26
##  8   337  0.22
##  9   338  0.23
## 10   339  0.3 
## # ℹ 28,978 more rows

Especially if you want to combine datasets from different sources, you may want variable names to be consistent across datasets or convenient to work with. This problem can be easily fixed using the rename() function, which has the format rename(data, "new_name" = "old_name"). For example, let’s rename the variable price to price_USD to make it clear that the price is in US dollars:

rename(diamonds, price_USD = price)

## # A tibble: 53,940 × 10
##    carat cut       color clarity depth table price_USD     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl>     <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55       326  3.95  3.98  2.43
##  2  0.21 Premium   E     SI1      59.8    61       326  3.89  3.84  2.31
##  3  0.23 Good      E     VS1      56.9    65       327  4.05  4.07  2.31
##  4  0.29 Premium   I     VS2      62.4    58       334  4.2   4.23  2.63
##  5  0.31 Good      J     SI2      63.3    58       335  4.34  4.35  2.75
##  6  0.24 Very Good J     VVS2     62.8    57       336  3.94  3.96  2.48
##  7  0.24 Very Good I     VVS1     62.3    57       336  3.95  3.98  2.47
##  8  0.26 Very Good H     SI1      61.9    55       337  4.07  4.11  2.53
##  9  0.22 Fair      E     VS2      65.1    61       337  3.87  3.78  2.49
## 10  0.23 Very Good H     VS1      59.4    61       338  4     4.05  2.39
## # ℹ 53,930 more rows

Filtering observations

Often you only need a subset of your data, e.g. observations from a particular location, after a given year, or meeting some other condition. You can filter your dataset using the filter() function and logical expressions (e.g. keep if the value for the price of the diamond is greater than $10,000, or keep if the value for the variable cut is “Ideal”). The first function argument is the name of the dataset, and the second argument is the logical expression. When defining your logical expression, you need to use the logical operators: == means equal to, != means not equal to, and >=,<=,>,< define numeric comparisons.

# keep only diamonds with a price greater than $10,000
filter(diamonds, price > 10000)

## # A tibble: 5,222 × 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  1.7  Ideal     J     VS2      60.5    58 10002  7.73  7.74  4.68
##  2  1.03 Ideal     E     VVS2     60.6    59 10003  6.5   6.53  3.95
##  3  1.23 Very Good G     VVS2     60.6    55 10004  6.93  7.02  4.23
##  4  1.25 Ideal     F     VS2      61.6    55 10006  6.93  6.96  4.28
##  5  2.01 Very Good I     SI2      61.4    63 10009  8.19  7.96  4.96
##  6  1.21 Very Good F     VS1      62.3    58 10009  6.76  6.85  4.24
##  7  1.51 Premium   I     VS2      59.9    60 10010  7.42  7.36  4.43
##  8  1.01 Fair      D     SI2      64.6    58 10011  6.25  6.2   4.02
##  9  1.05 Ideal     F     VVS2     60.5    55 10011  6.67  6.58  4.01
## 10  1.6  Ideal     J     VS1      62      53 10011  7.57  7.56  4.69
## # ℹ 5,212 more rows

# keep only diamonds with cut equal to "Ideal"
filter(diamonds, cut == "Ideal")

## # A tibble: 21,551 × 10
##    carat cut   color clarity depth table price     x     y     z
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.23 Ideal J     VS1      62.8    56   340  3.93  3.9   2.46
##  3  0.31 Ideal J     SI2      62.2    54   344  4.35  4.37  2.71
##  4  0.3  Ideal I     SI2      62      54   348  4.31  4.34  2.68
##  5  0.33 Ideal I     SI2      61.8    55   403  4.49  4.51  2.78
##  6  0.33 Ideal I     SI2      61.2    56   403  4.49  4.5   2.75
##  7  0.33 Ideal J     SI1      61.1    56   403  4.49  4.55  2.76
##  8  0.23 Ideal G     VS1      61.9    54   404  3.93  3.95  2.44
##  9  0.32 Ideal I     SI1      60.9    55   404  4.45  4.48  2.72
## 10  0.3  Ideal I     SI2      61      59   405  4.3   4.33  2.63
## # ℹ 21,541 more rows

You can combine multiple logical expressions using the logical operators & (AND) and | (OR) in one filter() function.

# keep only diamonds with a price greater than $10,000 and cut equal to "Ideal"
filter(diamonds, price > 10000 & cut == "Ideal")

## # A tibble: 1,770 × 10
##    carat cut   color clarity depth table price     x     y     z
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  1.7  Ideal J     VS2      60.5    58 10002  7.73  7.74  4.68
##  2  1.03 Ideal E     VVS2     60.6    59 10003  6.5   6.53  3.95
##  3  1.25 Ideal F     VS2      61.6    55 10006  6.93  6.96  4.28
##  4  1.05 Ideal F     VVS2     60.5    55 10011  6.67  6.58  4.01
##  5  1.6  Ideal J     VS1      62      53 10011  7.57  7.56  4.69
##  6  1.51 Ideal H     SI1      61.3    56 10012  7.44  7.4   4.55
##  7  1.13 Ideal F     VS1      60.9    57 10016  6.73  6.76  4.11
##  8  1.04 Ideal E     VVS2     62.9    55 10019  6.47  6.51  4.08
##  9  1.22 Ideal G     VVS2     62.3    56 10038  6.81  6.84  4.25
## 10  1.3  Ideal G     VS1      62      55 10038  6.98  7.02  4.34
## # ℹ 1,760 more rows

The %in% operator is useful when you want to filter observations based on multiple values of a variable, for example, if cut should be either “Ideal” or “Premium”. Then you concatenate “Ideal” and “Premium” into a vector and use %in% to filter the dataset based on a match with any element of this vector.

# keep only diamonds with cut equal to "Ideal" or "Premium"
filter(diamonds, cut %in% c("Ideal", "Premium"))

## # A tibble: 35,342 × 10
##    carat cut     color clarity depth table price     x     y     z
##    <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31
##  3  0.29 Premium I     VS2      62.4    58   334  4.2   4.23  2.63
##  4  0.23 Ideal   J     VS1      62.8    56   340  3.93  3.9   2.46
##  5  0.22 Premium F     SI1      60.4    61   342  3.88  3.84  2.33
##  6  0.31 Ideal   J     SI2      62.2    54   344  4.35  4.37  2.71
##  7  0.2  Premium E     SI2      60.2    62   345  3.79  3.75  2.27
##  8  0.32 Premium E     I1       60.9    58   345  4.38  4.42  2.68
##  9  0.3  Ideal   I     SI2      62      54   348  4.31  4.34  2.68
## 10  0.24 Premium I     VS1      62.5    57   355  3.97  3.94  2.47
## # ℹ 35,332 more rows

To be able to work with this new dataset, you need to save it to a new object. Let’s assign the result of the mutate() function to a new object called diamonds_filtered:

# save the result to a new object
diamonds_filtered <- filter(diamonds, cut %in% c("Ideal", "Premium"))