16 Subsetting
Subsetting involves making a dataset smaller by selecting a portion of the data that we are interested in.
We can take subsets of a dataset either by columns or by rows. For example, we might have a dataset with extra columns we do not care about, or we might want to look at rows where a certain condition is true (such as age greater than or equal to 18).
16.1 Extractors
We have previously seen that we can index vectors by position, name, or
condition to extract and replace elements. Now we extend this to
data frames. In addition to the square bracket extractors, [ ]
, we
also have two “list” extractors, the dollar sign , $
, and the
doubled square brackets, [[ ]]
. Here it is relevant that a data frame
is a special kind of list.
The square brackets,
[ ]
, allow us to index both the rows and the columns of a data frame. We can index by position, name, or condition.When used with data frames (or matrices), we index by two positions, rows and columns. The square brackets should contain two objects, a vector indexing the rows and a vector indexing the columns. To select all the elements along one dimension, omit that vector but include the comma.
Try these examples:
mtcars["Valiant", "mpg"] # a single element mtcars[ , c("mpg", "wt")] # all rows, two columns mtcars[1:6, ] # rows 1-6, all columns mtcars[ , ] # everything
The next example illustrates indexing rows by position and columns by name.
mtcars[1:6, c("mpg", "wt")] # [rows, columns]
mpg wt Mazda RX4 21.0 2.620 Mazda RX4 Wag 21.0 2.875 Datsun 710 22.8 2.320 Hornet 4 Drive 21.4 3.215 Hornet Sportabout 18.7 3.440 Valiant 18.1 3.460
The dollar sign extractor,
$
, allows us to index a single named element of a list. In a data frame, that would be a column, a single variable. (We’ll use thehead
function to show just the first few values.)head( mtcars$mpg )
[1] 21.0 21.0 22.8 21.4 18.7 18.1
The double brackets,
[[ ]]
, allow us to index a single element of a list by either name or position. This can be useful for programming repeated tasks.head( mtcars[[3]] ) # by position, displacement
[1] 160 160 108 258 360 225
head( mtcars[["disp"]] ) # by name
[1] 160 160 108 258 360 225
In addition to these extractors, we also have a
subset
function that provides a very useful variation on the square bracket extractor.subset(mtcars, subset=carb >=6, # rows select=c("mpg", "wt")) # columns
mpg wt Ferrari Dino 19.7 2.77 Maserati Bora 15.0 3.57
16.2 Subsetting by Columns
Now, load the dplyr
package, which contains the functions we need to subset.
library(dplyr)
The select()
function from dplyr
gives us a number of ways to take a subset of columns. Four ways to select columns are covered below: By Name, By Pattern, By Range, and By Type. For a full list of the ways to select a subset of columns, run ?tidyselect::language
.
16.2.1 By Name
The simplest and most straightforward way to select columns is to pass their names to select()
. The output of select()
will be our original dataframe, but it will only include the columns we name, and they will be in the specified order. In the mtcars
dataset, cyl
is before hp
, but if we specify hp
before cyl
in select()
, they will appear in that order.
(Note that the code below uses head()
to only show the first six rows for space reasons.)
colnames(mtcars)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
mtcars |>
select(mpg, hp, cyl) |>
head()
mpg hp cyl
Mazda RX4 21.0 110 6
Mazda RX4 Wag 21.0 110 6
Datsun 710 22.8 93 4
Hornet 4 Drive 21.4 110 6
Hornet Sportabout 18.7 175 8
Valiant 18.1 105 6
As it was used above, select()
only keeps the columns we name. We can also have it only drop the columns we name by using a -
in front of the column name.
mtcars |>
select(-disp, -drat, -wt, -qsec, -vs, -am, -gear, -carb) |>
head()
mpg cyl hp
Mazda RX4 21.0 6 110
Mazda RX4 Wag 21.0 6 110
Datsun 710 22.8 4 93
Hornet 4 Drive 21.4 6 110
Hornet Sportabout 18.7 8 175
Valiant 18.1 6 105
In some cases, we might want to select a certain subset of columns multiple times, and we do not want to have to retype them every time. In that case, we can pass a character vector to select()
. We can also add another column, hp
, without adding it to our list when running the command.
myCols <- c("mpg", "vs", "carb")
mtcars |>
select(myCols, hp) |>
head()
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(myCols)
# Now:
data %>% select(all_of(myCols))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
mpg vs carb hp
Mazda RX4 21.0 0 4 110
Mazda RX4 Wag 21.0 0 4 110
Datsun 710 22.8 1 1 93
Hornet 4 Drive 21.4 1 1 110
Hornet Sportabout 18.7 0 2 175
Valiant 18.1 1 1 105
We can also simply reorder the columns by giving select()
the columns we would like first and then everything()
.
mtcars |>
select(myCols, hp, everything()) |>
head()
mpg vs carb hp cyl disp drat wt qsec am gear
Mazda RX4 21.0 0 4 110 6 160 3.90 2.620 16.46 1 4
Mazda RX4 Wag 21.0 0 4 110 6 160 3.90 2.875 17.02 1 4
Datsun 710 22.8 1 1 93 4 108 3.85 2.320 18.61 1 4
Hornet 4 Drive 21.4 1 1 110 6 258 3.08 3.215 19.44 0 3
Hornet Sportabout 18.7 0 2 175 8 360 3.15 3.440 17.02 0 3
Valiant 18.1 1 1 105 6 225 2.76 3.460 20.22 0 3
select()
includes the functionality of rename()
, so we can do in one line (second example) what normally takes two (first example). The pattern is new_name = old_name
.
mtcars |>
rename(horsepower = hp) |>
select(horsepower, mpg) |>
head()
horsepower mpg
Mazda RX4 110 21.0
Mazda RX4 Wag 110 21.0
Datsun 710 93 22.8
Hornet 4 Drive 110 21.4
Hornet Sportabout 175 18.7
Valiant 105 18.1
mtcars |>
select(horsepower = hp, mpg) |>
head()
horsepower mpg
Mazda RX4 110 21.0
Mazda RX4 Wag 110 21.0
Datsun 710 93 22.8
Hornet 4 Drive 110 21.4
Hornet Sportabout 175 18.7
Valiant 105 18.1
16.2.2 By Pattern
Columns can be selected by a pattern, such as the letters they start with or contain. To do this, we need to use other functions inside of select()
.
starts_with()
returns columns that start with a character string. Both “disp” and “drat” start with “d”.
mtcars |>
select(starts_with("d")) |>
head()
disp drat
Mazda RX4 160 3.90
Mazda RX4 Wag 160 3.90
Datsun 710 108 3.85
Hornet 4 Drive 258 3.08
Hornet Sportabout 360 3.15
Valiant 225 2.76
ends_with()
does the opposite, and looks for a character string at the end of the column names. “disp” and “hp” end with “p”.
mtcars |>
select(ends_with("p")) |>
head()
disp hp
Mazda RX4 160 110
Mazda RX4 Wag 160 110
Datsun 710 108 93
Hornet 4 Drive 258 110
Hornet Sportabout 360 175
Valiant 225 105
contains()
is more general than both starts_with()
and ends_with()
since it looks for a character string anywhere in the column name. “ar” is found in both “gear” and “carb”.
mtcars |>
select(contains("ar")) |>
head()
gear carb
Mazda RX4 4 4
Mazda RX4 Wag 4 4
Datsun 710 4 1
Hornet 4 Drive 3 1
Hornet Sportabout 3 2
Valiant 3 1
The functions above all use literal character strings. dplyr
also allows for the use of regular expressions with the matches()
function. The code below, "^[a-g].a"
, looks for a string that starts with (^
) the letters a-g ([a-g]
), then has any character (.
), and then the letter a (a
). Both “drat” and “gear” fulfill these matching criteria.
mtcars |>
select(matches("^[a-g].a"))
drat gear
Mazda RX4 3.90 4
Mazda RX4 Wag 3.90 4
Datsun 710 3.85 4
Hornet 4 Drive 3.08 3
Hornet Sportabout 3.15 3
Valiant 2.76 3
Duster 360 3.21 3
Merc 240D 3.69 4
Merc 230 3.92 4
Merc 280 3.92 4
Merc 280C 3.92 4
Merc 450SE 3.07 3
Merc 450SL 3.07 3
Merc 450SLC 3.07 3
Cadillac Fleetwood 2.93 3
Lincoln Continental 3.00 3
Chrysler Imperial 3.23 3
Fiat 128 4.08 4
Honda Civic 4.93 4
Toyota Corolla 4.22 4
Toyota Corona 3.70 3
Dodge Challenger 2.76 3
AMC Javelin 3.15 3
Camaro Z28 3.73 3
Pontiac Firebird 3.08 3
Fiat X1-9 4.08 4
Porsche 914-2 4.43 5
Lotus Europa 3.77 5
Ford Pantera L 4.22 5
Ferrari Dino 3.62 5
Maserati Bora 3.54 5
Volvo 142E 4.11 4
For more on regular expressions, see the chapter on Character Vectors or RStudio’s Regular Expressions cheatsheet.
16.2.3 By Range
The :
operator can be used to select a range of variables, either by position or by number. Be careful to check that your columns are in the order you think they are!
mtcars |>
select(2:4) |>
head()
cyl disp hp
Mazda RX4 6 160 110
Mazda RX4 Wag 6 160 110
Datsun 710 4 108 93
Hornet 4 Drive 6 258 110
Hornet Sportabout 8 360 175
Valiant 6 225 105
mtcars |>
select(cyl:hp) |>
head()
cyl disp hp
Mazda RX4 6 160 110
Mazda RX4 Wag 6 160 110
Datsun 710 4 108 93
Hornet 4 Drive 6 258 110
Hornet Sportabout 8 360 175
Valiant 6 225 105
If we have multiple columns with a common prefix and a numeric suffix, we can select a range with num_range()
. First, let’s change the column names of mtcars
to create a fictional dataset with individuals’ incomes in different years, where each row is an individual and each column is their reported income for a different year.
mtcars_income <- mtcars
colnames(mtcars_income) <- paste0("income", 1960:1970)
colnames(mtcars_income)
[1] "income1960" "income1961" "income1962" "income1963" "income1964" "income1965" "income1966" "income1967"
[9] "income1968" "income1969" "income1970"
mtcars_income |>
select(num_range("income", 1962:1964)) |>
head()
income1962 income1963 income1964
Mazda RX4 160 110 3.90
Mazda RX4 Wag 160 110 3.90
Datsun 710 108 93 3.85
Hornet 4 Drive 258 110 3.08
Hornet Sportabout 360 175 3.15
Valiant 225 105 2.76
mtcars_income |>
select(num_range("income", seq(1960, 1970, 2))) |>
head()
income1960 income1962 income1964 income1966 income1968 income1970
Mazda RX4 21.0 160 3.90 16.46 1 4
Mazda RX4 Wag 21.0 160 3.90 17.02 1 4
Datsun 710 22.8 108 3.85 18.61 1 1
Hornet 4 Drive 21.4 258 3.08 19.44 0 1
Hornet Sportabout 18.7 360 3.15 17.02 0 2
Valiant 18.1 225 2.76 20.22 0 1
To use num_range()
, give it the prefix ("income"
) and a numeric vector with the desired numeric suffixes. The first example specifies the range 1962:1964
, giving us 1962, 1963, and 1964. The second example uses seq(1960, 1970, 2)
, so we will get every two years from 1960 to 1970.
mtcars_income |>
select(num_range("income", 1962:1964)) |>
head()
income1962 income1963 income1964
Mazda RX4 160 110 3.90
Mazda RX4 Wag 160 110 3.90
Datsun 710 108 93 3.85
Hornet 4 Drive 258 110 3.08
Hornet Sportabout 360 175 3.15
Valiant 225 105 2.76
mtcars_income |>
select(num_range("income", seq(1960, 1970, 2))) |>
head()
income1960 income1962 income1964 income1966 income1968 income1970
Mazda RX4 21.0 160 3.90 16.46 1 4
Mazda RX4 Wag 21.0 160 3.90 17.02 1 4
Datsun 710 22.8 108 3.85 18.61 1 1
Hornet 4 Drive 21.4 258 3.08 19.44 0 1
Hornet Sportabout 18.7 360 3.15 17.02 0 2
Valiant 18.1 225 2.76 20.22 0 1
If the numbers are out of order in the dataframe, this is not an issue, unlike with :
. The examples below first select the columns income1962
, income1964
, and income1963
, in that order. The second example with :
fails to select income1963
because it selects columns by position, not by name.
mtcars_income |>
select(income1962, income1964, income1963) |>
select(num_range("income", 1962:1964)) |>
head()
income1962 income1963 income1964
Mazda RX4 160 110 3.90
Mazda RX4 Wag 160 110 3.90
Datsun 710 108 93 3.85
Hornet 4 Drive 258 110 3.08
Hornet Sportabout 360 175 3.15
Valiant 225 105 2.76
mtcars_income |>
select(income1962, income1964, income1963) |>
select(income1962:income1964) |>
head()
income1962 income1964
Mazda RX4 160 3.90
Mazda RX4 Wag 160 3.90
Datsun 710 108 3.85
Hornet 4 Drive 258 3.08
Hornet Sportabout 360 3.15
Valiant 225 2.76
It is also okay if some columns are not in the dataset. Of course, select()
will not return these non-existent columns, but it will also not have an error. To see this, we can drop income1963
.
mtcars_income |>
select(-income1963) |>
select(num_range("income", 1962:1964)) |>
head()
income1962 income1964
Mazda RX4 160 3.90
Mazda RX4 Wag 160 3.90
Datsun 710 108 3.85
Hornet 4 Drive 258 3.08
Hornet Sportabout 360 3.15
Valiant 225 2.76
16.2.4 By Type
where()
can help us select columns that are of a certain type, such as numeric or character. chickwts
has one numeric column and one character column.
chickwts |>
head()
weight feed feed2 feed3
1 179 horsebean horsebean horsebean
2 160 horsebean horsebean horsebean
3 136 horsebean horsebean horsebean
4 227 horsebean horsebean horsebean
5 217 horsebean horsebean horsebean
6 168 horsebean horsebean horsebean
chickwts |>
select(where(is.numeric)) |>
head()
weight
1 179
2 160
3 136
4 227
5 217
6 168
16.2.5 Multiple Criteria
For more complex tasks, multiple criteria for selecting columns can be combined with logical operators (&
, |
, !
).
mtcars |>
select((starts_with("d") & !where(is.character)) | contains("ar"), mpg) |>
head()
disp drat gear carb mpg
Mazda RX4 160 3.90 4 4 21.0
Mazda RX4 Wag 160 3.90 4 4 21.0
Datsun 710 108 3.85 4 1 22.8
Hornet 4 Drive 258 3.08 3 1 21.4
Hornet Sportabout 360 3.15 3 2 18.7
Valiant 225 2.76 3 1 18.1
16.3 Subsetting by Rows
To select a subset of rows from a dataframe, we can provide one or more conditions to the filter()
function.
If we want to see rows from mtcars
with mpg
greater than 25, we can do this as follows:
mtcars |>
filter(mpg > 25)
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Multiple conditions can be given to filter()
, and these are separated by commas. If we give multiple conditions to filter()
, it will only return rows that meet all of our conditions. In other words, we can think of each ,
like &
, and the rows in the output are those for which all conditions are TRUE
. Note how the two blocks of code below return the same output.
mtcars |>
filter(mpg > 25,
hp < 100,
vs == 1)
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
mtcars |>
filter(mpg > 25 &
hp < 100 &
vs == 1)
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Because filter()
runs logical tests on the rows of our dataset, we should make use of R’s logical operators (&
, |
, !
), relational operators (<
, >
, <=
, >=
, ==
, !=
), and other functions that return logical values (%in%
, is.na()
, etc.). If it has been a while, or if you jumped to this chapter, you will find the chapter on Logical Vectors helpful.
To illustrate the use of is.na()
, we first need some missing data. Since mtcars
has no missing data, we can use the mutate()
function to temporarily add a column with NA
values. The mutate()
code below gives us a new column called missingdata
that repeats the vector c(1, 2, 3, 4, 5, 6, 7, NA)
four times. (mtcars
has 32 rows, so we need to repeat a vector with 8 numbers 4 times.) !is.na()
will then return rows that do not (!
) have NA
in missingdata
.
mtcars |>
mutate(missingdata = rep(c(1:7, NA), 4)) |>
filter(cyl %in% c(4, 6),
mpg > 17 & mpg <= 21,
am != 0 | gear == 4,
!is.na(missingdata))
mpg cyl disp hp drat wt qsec vs am gear carb missingdata
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 2
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 3
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 6
16.4 Exercises
Use the state.x77
dataset for 1 and 2. You will need to first convert it into a dataframe.
Drop the
Population
column.Drop columns with spaces in their names.
Using the
iris
dataset, keep columns that begin with “S”.Using the
airquality
dataset, find days where the value ofOzone
is in the range 30-50.
16.5 Advanced Exercises
Using the
iris
dataset, keep rows whereSepal.Length
is divisible by 1.0. (Hint: see%%
in?Arithmetic
.)Using the
mtcars
dataset, create a 3x3 scatterplot matrix (pass a dataframe toplot()
) of the mpg, weight, and horsepower of cars that have four or six cylinders, at least four forward gears, and manual transmission. Seehelp(mtcars)
to figure out which column is which.