library(tidyverse)
12 Subsetting Dataframes
12.1 Warm-Up
Imagine you have a dataset with these column names:
id race edu a11 b11 c11 a12 b12 c12 a13 b13 c13 a14 b14 c14 income2022 income2024
You realize you only need some of these for your research questions. To systematically select these columns, we need to think of “rules” that would return the column names we want.
For example, if we wanted a11 a12 a13 a14
, our rule would be “starts with ‘a’.”
What rules would return these sets of columns?
a11 b11 c11 a12 b12 c12
id race edu
a12 b12 c12 a14 b14 c14 income2022 income2024
12.2 Outcomes
Objective: To systematically subset datasets by selecting columns and rows.
Why it matters: When working with datasets that have hundreds of variables and millions of observations, this skill is critical for creating a manageable dataset suited to your research question. Subsetting is also useful in the analysis of any dataset in order to exclude certain cases or to perform subgroup analyses.
Learning outcomes:
Fundamental Skills | Extended Skills |
|
|
Key functions and operators:
select()
starts_with()
ends_with()
contains()
filter()
12.3 Subsetting Columns
Subsetting involves making a dataset smaller by selecting a portion of the data that we are interested in. Selecting a portion of the columns in a dataset is primarily for convenience. It is easier to work with a dataset with fewer columns.
The functions in this section can also be used within other functions like pivot_longer()
and across()
that support tidyselect
syntax.
Load the tidyverse
, which has functions for subsetting datasets by columns and rows.
12.3.1 By Name
The simplest way to select columns is to pass their names to select()
. The output of select()
will be our original dataframe, but it will only include the columns we name, and they will be in the specified order. In the mtcars
dataset, cyl
is before hp
, but if we specify hp
before cyl
in select()
, they will appear in that order.
colnames(mtcars)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
[11] "carb"
|>
mtcars select(mpg, hp, cyl) |>
head()
mpg hp cyl
Mazda RX4 21.0 110 6
Mazda RX4 Wag 21.0 110 6
Datsun 710 22.8 93 4
Hornet 4 Drive 21.4 110 6
Hornet Sportabout 18.7 175 8
Valiant 18.1 105 6
The column subsetting code on this page uses head()
to just display the first six rows to save space.
The above example only keeps the rows we name. If we prefix column names with -
, it will only drop those we name:
|>
mtcars select(-mpg, -hp, -cyl) |>
head()
disp drat wt qsec vs am gear carb
Mazda RX4 160 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 160 3.90 2.875 17.02 0 1 4 4
Datsun 710 108 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 258 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 360 3.15 3.440 17.02 0 0 3 2
Valiant 225 2.76 3.460 20.22 1 0 3 1
12.3.2 By Pattern
Columns can be selected by a pattern, such as the letters they start with or contain. To do this, we need to use other functions inside of select()
.
starts_with()
returns columns that start with a character string. Both disp
and drat
start with “d”.
|>
mtcars select(starts_with("d")) |>
head()
disp drat
Mazda RX4 160 3.90
Mazda RX4 Wag 160 3.90
Datsun 710 108 3.85
Hornet 4 Drive 258 3.08
Hornet Sportabout 360 3.15
Valiant 225 2.76
ends_with()
does the opposite, and looks for a character string at the end of the column names. disp
and hp
both end with “p”.
|>
mtcars select(ends_with("p")) |>
head()
disp hp
Mazda RX4 160 110
Mazda RX4 Wag 160 110
Datsun 710 108 93
Hornet 4 Drive 258 110
Hornet Sportabout 360 175
Valiant 225 105
contains()
is more general than both starts_with()
and ends_with()
since it looks for a character string anywhere in the column name. “ar” is found in both gear
and carb
.
|>
mtcars select(contains("ar")) |>
head()
gear carb
Mazda RX4 4 4
Mazda RX4 Wag 4 4
Datsun 710 4 1
Hornet 4 Drive 3 1
Hornet Sportabout 3 2
Valiant 3 1
matches()
allows use of regex (see the chapter on character vectors). The regex ^[a-g].a
finds a string that starts with (^
) the letters a-g ([a-g]
), then has any character (.
), and then the letter a (a
). Both drat
and gear
fulfill these criteria.
|>
mtcars select(matches("^[a-g].a")) |>
head()
drat gear
Mazda RX4 3.90 4
Mazda RX4 Wag 3.90 4
Datsun 710 3.85 4
Hornet 4 Drive 3.08 3
Hornet Sportabout 3.15 3
Valiant 2.76 3
12.3.3 By Type
where()
can help us select columns that are of a certain type, such as numeric or character. chickwts
has one numeric column and one factor column:
glimpse(chickwts)
Rows: 71
Columns: 2
$ weight <dbl> 179, 160, 136, 227, 217, 168, 108, 124, 143, 140, 309, 229, 181…
$ feed <fct> horsebean, horsebean, horsebean, horsebean, horsebean, horsebea…
Use the appropriate type-checking function to get either the numeric (is.numeric
) or factor column (is.factor
):
|>
chickwts select(where(is.factor)) |>
head()
feed
1 horsebean
2 horsebean
3 horsebean
4 horsebean
5 horsebean
6 horsebean
12.3.4 By Range
If we have multiple columns with a common prefix and a numeric infix or suffix, we can select a range with num_range()
.
First, make up some data:
<-
dat matrix(sample(1:20),
ncol = 10) |>
as.data.frame()
dat
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 10 6 17 15 11 8 16 18 7 5
2 19 20 13 1 4 12 9 14 3 2
Rename the columns to be gene names, BCR1-5 and CTL6-10:
colnames(dat) <-
c("BCR1", "BCR2", "BCR3", "BCR4", "BCR5",
"CTL6", "CTL7", "CTL8", "CTL9", "CTL10")
num_range()
takes a prefix and a numeric range. If we want BCR2-5, we would specify a prefix
of "BCR"
and a range
of 2:5
:
|>
dat select(num_range(prefix = "BCR",
range = 2:5))
BCR2 BCR3 BCR4 BCR5
1 6 17 15 11
2 20 13 1 4
The range
argument can take any numeric vector, such as a sequence:
|>
dat select(num_range(prefix = "BCR",
range = seq(1, 5, 2)))
BCR1 BCR3 BCR5
1 10 17 11
2 19 13 4
Or some arbitrary numbers we combine with c()
:
|>
dat select(num_range(prefix = "BCR",
range = c(1, 2, 4)))
BCR1 BCR2 BCR4
1 10 6 15
2 19 20 1
If we want multiple prefixes to be returned, we just need to specify more than one num_range()
within select()
. To also return CTL6-8, specify a prefix
of "CTL"
and a range
of 6:8
:
|>
dat select(num_range(prefix = "BCR",
range = 2:5),
num_range(prefix = "CTL",
range = 6:8))
BCR2 BCR3 BCR4 BCR5 CTL6 CTL7 CTL8
1 6 17 15 11 8 16 18
2 20 13 1 4 12 9 14
In this case, select()
only requires that a column fulfill at least one of the criteria. See the next section on combining multiple criteria.
Here, the column names all end with a number, but we may sometimes have an additional suffix.
Rename the columns to reflect fictional income data, with suffixes denoting whether it is gross or net income:
colnames(dat) <-
paste0("income_",
rep(2000:2004, each = 2),
rep(c("_gross", "_net"), times = 2))
dat
income_2000_gross income_2000_net income_2001_gross income_2001_net
1 10 6 17 15
2 19 20 13 1
income_2002_gross income_2002_net income_2003_gross income_2003_net
1 11 8 16 18
2 4 12 9 14
income_2004_gross income_2004_net
1 7 5
2 3 2
num_range()
has a suffix
argument that is set to ""
by default. Since it is an argument, we can change it.
To get the gross income for years 2002-2003, specify a prefix
of "income_"
, a range
of 2002:2003
, and a suffix of "_gross"
:
|>
dat select(num_range(prefix = "income_",
range = 2002:2003,
suffix = "_gross"))
income_2002_gross income_2003_gross
1 11 16
2 4 9
12.3.5 Multiple Criteria
Multiple criteria can be combined within select()
. Commas function as the logical OR
, so to find columns that start with “d” OR
end with “p”, we would specify:
|>
mtcars select(starts_with("d"),
ends_with("p")) |>
head()
disp drat hp
Mazda RX4 160 3.90 110
Mazda RX4 Wag 160 3.90 110
Datsun 710 108 3.85 93
Hornet 4 Drive 258 3.08 110
Hornet Sportabout 360 3.15 175
Valiant 225 2.76 105
Substituting the comma for an OR
operator (|
) yields the same result:
|>
mtcars select(starts_with("d") |
ends_with("p")) |>
head()
disp drat hp
Mazda RX4 160 3.90 110
Mazda RX4 Wag 160 3.90 110
Datsun 710 108 3.85 93
Hornet 4 Drive 258 3.08 110
Hornet Sportabout 360 3.15 175
Valiant 225 2.76 105
If we wanted columns whose names fulfill both criteria, use the logical AND
(&
):
|>
mtcars select(starts_with("d") & ends_with("p")) |>
head()
disp
Mazda RX4 160
Mazda RX4 Wag 160
Datsun 710 108
Hornet 4 Drive 258
Hornet Sportabout 360
Valiant 225
Combine more criteria together with more logical operators, using names, pattern-functions, ranges, and/or type-functions:
|>
mtcars select(starts_with("d") & ends_with("p") & !where(is.character),
contains("ar") & !starts_with("c"),
|>
mpg) head()
disp gear mpg
Mazda RX4 160 4 21.0
Mazda RX4 Wag 160 4 21.0
Datsun 710 108 4 22.8
Hornet 4 Drive 258 3 21.4
Hornet Sportabout 360 3 18.7
Valiant 225 3 18.1
12.4 Subsetting Rows
While selecting columns from a dataset is primarily for convenience, filtering the rows in a dataset has consequences for subsequent analyses and visualizations because it alters the sample.
12.4.1 By Condition
The filter()
function takes one or more logical expressions. Only rows where all of the conditions evaluate to TRUE
are returned.
The built-in airquality
dataset has daily measurements for four metrics of air quality:
head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
Keep rows for only the first day of the month, where Day == 1
:
|>
airquality filter(Day == 1)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 NA 286 8.6 78 6 1
3 135 269 4.1 84 7 1
4 39 83 6.9 81 8 1
5 96 167 6.9 91 9 1
Keep rows where Solar.R
is over 315:
|>
airquality filter(Solar.R > 315)
Ozone Solar.R Wind Temp Month Day
1 14 334 11.5 64 5 16
2 30 322 11.5 68 5 19
3 11 320 16.6 73 5 22
4 39 323 11.5 87 6 10
5 NA 332 13.8 80 6 14
6 NA 322 11.5 79 6 15
Keep rows where Solar.R
is over 315 and Temp
is at least 70:
|>
airquality filter(Solar.R > 315,
>= 70) Temp
Ozone Solar.R Wind Temp Month Day
1 11 320 16.6 73 5 22
2 39 323 11.5 87 6 10
3 NA 332 13.8 80 6 14
4 NA 322 11.5 79 6 15
The commas within filter()
function as the logical AND
(&
). Replacing them with this operator yields the same results:
|>
airquality filter(Solar.R > 315 &
>= 70) Temp
Ozone Solar.R Wind Temp Month Day
1 11 320 16.6 73 5 22
2 39 323 11.5 87 6 10
3 NA 332 13.8 80 6 14
4 NA 322 11.5 79 6 15
Further subset the data to drop rows where Ozone
is missing. We want to drop missing values, but filter()
wants a condition that evaluates to TRUE
, so we need to specify what we want to keep. We want to keep observations where Ozone
is not (!
) missing (is.na()
): !is.na(Ozone)
|>
airquality filter(Solar.R > 315 &
>= 70,
Temp !is.na(Ozone))
Ozone Solar.R Wind Temp Month Day
1 11 320 16.6 73 5 22
2 39 323 11.5 87 6 10
12.4.2 By Values
The slice_min()
and slice_max()
functions are used to return rows with the lowest or highest values for some variable.
To see two rows from airquality
with the highest value of Temp
, specify the name of the variable and the number of rows to return (n
):
|>
airquality slice_max(Temp, n = 2)
Ozone Solar.R Wind Temp Month Day
1 76 203 9.7 97 8 28
2 84 237 6.3 96 8 30
In the case of a tie, we will get more rows than n
. Asking for three rows returns four:
|>
airquality slice_max(Temp, n = 3)
Ozone Solar.R Wind Temp Month Day
1 76 203 9.7 97 8 28
2 84 237 6.3 96 8 30
3 118 225 2.3 94 8 29
4 85 188 6.3 94 8 31
slice_min()
works just like slice_max()
but, as its name suggests, returns the rows with the lowest values rather than the greatest values. Find rows with the five lowest (or more if there is a tie) values of Temp
:
|>
airquality slice_min(Temp, n = 5)
Ozone Solar.R Wind Temp Month Day
1 NA NA 14.3 56 5 5
2 6 78 18.4 57 5 18
3 NA 66 16.6 57 5 25
4 NA NA 8.0 57 5 27
5 18 65 13.2 58 5 15
6 NA 266 14.9 58 5 26
12.5 Exercises
12.5.1 Fundamental
Use the penguins
dataset for these exercises. After subsetting, the dataset will have three columns and eight rows.
Drop penguins from Dream Island.
Keep rows where penguins’
body_mass
is in the range 3000-3200.Keep columns containing the letters “b” or “p”.
Drop columns without an underscore (
_
) in their name.
12.5.2 Extended
Drop character and date columns from
penguins_raw
.Use this dataset:
<- read.csv("https://sscc.wisc.edu/sscc/pubs/dwr/data/subset_extended.csv") dat
Its column names are yeast genes and locus tags. Use one or more
num_range()
to get back the following sets of columns:ALG1 ALG2 ALG13 ALG14
SEC10 SEC12 SEC13 SEC14 SEC15 SEC20 SEC21 SEC22 SEC23 SEC24
GPI16 GPI17 GPI18 DLK_1033 DLK_1034 DLK_1035
ERG_9090 ERG_9092 ERG_9094 ERG_9096 ERG_9098 ERG_9100 ERG_9102 ERG_9104 ERG_9106