12 Subsetting Dataframes

12.1 Warm-Up

Imagine you have a dataset with these column names:

id race edu a11 b11 c11 a12 b12 c12 a13 b13 c13 a14 b14 c14 income2022 income2024

You realize you only need some of these for your research questions. To systematically select these columns, we need to think of “rules” that would return the column names we want.

For example, if we wanted a11 a12 a13 a14, our rule would be “starts with ‘a’.”

What rules would return these sets of columns?

a11 b11 c11 a12 b12 c12
id race edu
a12 b12 c12 a14 b14 c14 income2022 income2024

12.2 Outcomes

Objective: To systematically subset datasets by selecting columns and rows.

Why it matters: When working with datasets that have hundreds of variables and millions of observations, this skill is critical for creating a manageable dataset suited to your research question. Subsetting is also useful in the analysis of any dataset in order to exclude certain cases or to perform subgroup analyses.

Learning outcomes:

Fundamental Skills	Extended Skills
Select columns by name or pattern. Select rows by condition.	Select columns by type or range. Select rows with minimum or maximum values.

Key functions and operators:

select()
starts_with()
ends_with()
contains()
filter()

12.3 Subsetting Columns

Subsetting involves making a dataset smaller by selecting a portion of the data that we are interested in. Selecting a portion of the columns in a dataset is primarily for convenience. It is easier to work with a dataset with fewer columns.

The functions in this section can also be used within other functions like pivot_longer() and across() that support tidyselect syntax.

Load the tidyverse, which has functions for subsetting datasets by columns and rows.

library(tidyverse)

12.3.1 By Name

The simplest way to select columns is to pass their names to select(). The output of select() will be our original dataframe, but it will only include the columns we name, and they will be in the specified order. In the mtcars dataset, cyl is before hp, but if we specify hp before cyl in select(), they will appear in that order.

colnames(mtcars)

 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"

mtcars |> 
  select(mpg, hp, cyl) |> 
  head()

                   mpg  hp cyl
Mazda RX4         21.0 110   6
Mazda RX4 Wag     21.0 110   6
Datsun 710        22.8  93   4
Hornet 4 Drive    21.4 110   6
Hornet Sportabout 18.7 175   8
Valiant           18.1 105   6

Note

The column subsetting code on this page uses head() to just display the first six rows to save space.

The above example only keeps the rows we name. If we prefix column names with -, it will only drop those we name:

mtcars |> 
  select(-mpg, -hp, -cyl) |> 
  head()

                  disp drat    wt  qsec vs am gear carb
Mazda RX4          160 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag      160 3.90 2.875 17.02  0  1    4    4
Datsun 710         108 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive     258 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout  360 3.15 3.440 17.02  0  0    3    2
Valiant            225 2.76 3.460 20.22  1  0    3    1

12.3.2 By Pattern

Columns can be selected by a pattern, such as the letters they start with or contain. To do this, we need to use other functions inside of select().

starts_with() returns columns that start with a character string. Both disp and drat start with “d”.

mtcars |> 
  select(starts_with("d")) |> 
  head()

                  disp drat
Mazda RX4          160 3.90
Mazda RX4 Wag      160 3.90
Datsun 710         108 3.85
Hornet 4 Drive     258 3.08
Hornet Sportabout  360 3.15
Valiant            225 2.76

ends_with() does the opposite, and looks for a character string at the end of the column names. disp and hp both end with “p”.

mtcars |> 
  select(ends_with("p")) |> 
  head()

                  disp  hp
Mazda RX4          160 110
Mazda RX4 Wag      160 110
Datsun 710         108  93
Hornet 4 Drive     258 110
Hornet Sportabout  360 175
Valiant            225 105

contains() is more general than both starts_with() and ends_with() since it looks for a character string anywhere in the column name. “ar” is found in both gear and carb.

mtcars |> 
  select(contains("ar")) |> 
  head()

                  gear carb
Mazda RX4            4    4
Mazda RX4 Wag        4    4
Datsun 710           4    1
Hornet 4 Drive       3    1
Hornet Sportabout    3    2
Valiant              3    1

matches() allows use of regex (see the chapter on character vectors). The regex ^[a-g].a finds a string that starts with (^) the letters a-g ([a-g]), then has any character (.), and then the letter a (a). Both drat and gear fulfill these criteria.

mtcars |> 
  select(matches("^[a-g].a")) |> 
  head()

                  drat gear
Mazda RX4         3.90    4
Mazda RX4 Wag     3.90    4
Datsun 710        3.85    4
Hornet 4 Drive    3.08    3
Hornet Sportabout 3.15    3
Valiant           2.76    3

12.3.3 By Type

where() can help us select columns that are of a certain type, such as numeric or character. chickwts has one numeric column and one factor column:

glimpse(chickwts)

Rows: 71
Columns: 2
$ weight <dbl> 179, 160, 136, 227, 217, 168, 108, 124, 143, 140, 309, 229, 181…
$ feed   <fct> horsebean, horsebean, horsebean, horsebean, horsebean, horsebea…

Use the appropriate type-checking function to get either the numeric (is.numeric) or factor column (is.factor):

chickwts |> 
  select(where(is.factor)) |> 
  head()

       feed
1 horsebean
2 horsebean
3 horsebean
4 horsebean
5 horsebean
6 horsebean

12.3.4 By Range

If we have multiple columns with a common prefix and a numeric infix or suffix, we can select a range with num_range().

First, make up some data:

dat <-
  matrix(sample(1:20),
         ncol = 10) |> 
  as.data.frame()

dat

  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 10  6 17 15 11  8 16 18  7   5
2 19 20 13  1  4 12  9 14  3   2

Rename the columns to be gene names, BCR1-5 and CTL6-10:

colnames(dat) <- 
  c("BCR1", "BCR2", "BCR3", "BCR4", "BCR5", 
    "CTL6", "CTL7", "CTL8", "CTL9", "CTL10")

num_range() takes a prefix and a numeric range. If we want BCR2-5, we would specify a prefix of "BCR" and a range of 2:5:

dat |> 
  select(num_range(prefix = "BCR", 
                   range = 2:5))

  BCR2 BCR3 BCR4 BCR5
1    6   17   15   11
2   20   13    1    4

The range argument can take any numeric vector, such as a sequence:

dat |> 
  select(num_range(prefix = "BCR", 
                   range = seq(1, 5, 2)))

  BCR1 BCR3 BCR5
1   10   17   11
2   19   13    4

Or some arbitrary numbers we combine with c():

dat |> 
  select(num_range(prefix = "BCR", 
                   range = c(1, 2, 4)))

  BCR1 BCR2 BCR4
1   10    6   15
2   19   20    1

If we want multiple prefixes to be returned, we just need to specify more than one num_range() within select(). To also return CTL6-8, specify a prefix of "CTL" and a range of 6:8:

dat |> 
  select(num_range(prefix = "BCR", 
                   range = 2:5),
         num_range(prefix = "CTL", 
                   range = 6:8))

  BCR2 BCR3 BCR4 BCR5 CTL6 CTL7 CTL8
1    6   17   15   11    8   16   18
2   20   13    1    4   12    9   14

In this case, select() only requires that a column fulfill at least one of the criteria. See the next section on combining multiple criteria.

Here, the column names all end with a number, but we may sometimes have an additional suffix.

Rename the columns to reflect fictional income data, with suffixes denoting whether it is gross or net income:

colnames(dat) <- 
  paste0("income_",
         rep(2000:2004, each = 2),
         rep(c("_gross", "_net"), times = 2))

dat

  income_2000_gross income_2000_net income_2001_gross income_2001_net
1                10               6                17              15
2                19              20                13               1
  income_2002_gross income_2002_net income_2003_gross income_2003_net
1                11               8                16              18
2                 4              12                 9              14
  income_2004_gross income_2004_net
1                 7               5
2                 3               2

num_range() has a suffix argument that is set to "" by default. Since it is an argument, we can change it.

To get the gross income for years 2002-2003, specify a prefix of "income_", a range of 2002:2003, and a suffix of "_gross":

dat |> 
  select(num_range(prefix = "income_", 
                   range = 2002:2003, 
                   suffix = "_gross"))

  income_2002_gross income_2003_gross
1                11                16
2                 4                 9

12.3.5 Multiple Criteria

Multiple criteria can be combined within select(). Commas function as the logical OR, so to find columns that start with “d” OR end with “p”, we would specify:

mtcars |> 
  select(starts_with("d"), 
         ends_with("p")) |> 
  head()

                  disp drat  hp
Mazda RX4          160 3.90 110
Mazda RX4 Wag      160 3.90 110
Datsun 710         108 3.85  93
Hornet 4 Drive     258 3.08 110
Hornet Sportabout  360 3.15 175
Valiant            225 2.76 105

Substituting the comma for an OR operator (|) yields the same result:

mtcars |> 
  select(starts_with("d") | 
         ends_with("p")) |> 
  head()

                  disp drat  hp
Mazda RX4          160 3.90 110
Mazda RX4 Wag      160 3.90 110
Datsun 710         108 3.85  93
Hornet 4 Drive     258 3.08 110
Hornet Sportabout  360 3.15 175
Valiant            225 2.76 105

If we wanted columns whose names fulfill both criteria, use the logical AND (&):

mtcars |> 
  select(starts_with("d") & ends_with("p")) |> 
  head()

                  disp
Mazda RX4          160
Mazda RX4 Wag      160
Datsun 710         108
Hornet 4 Drive     258
Hornet Sportabout  360
Valiant            225

Combine more criteria together with more logical operators, using names, pattern-functions, ranges, and/or type-functions:

mtcars |> 
  select(starts_with("d") & ends_with("p") & !where(is.character),
         contains("ar") & !starts_with("c"),
         mpg) |> 
  head()

                  disp gear  mpg
Mazda RX4          160    4 21.0
Mazda RX4 Wag      160    4 21.0
Datsun 710         108    4 22.8
Hornet 4 Drive     258    3 21.4
Hornet Sportabout  360    3 18.7
Valiant            225    3 18.1

12.4 Subsetting Rows

While selecting columns from a dataset is primarily for convenience, filtering the rows in a dataset has consequences for subsequent analyses and visualizations because it alters the sample.

12.4.1 By Condition

The filter() function takes one or more logical expressions. Only rows where all of the conditions evaluate to TRUE are returned.

The built-in airquality dataset has daily measurements for four metrics of air quality:

head(airquality)

  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Keep rows for only the first day of the month, where Day == 1:

airquality |> 
  filter(Day == 1)

  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    NA     286  8.6   78     6   1
3   135     269  4.1   84     7   1
4    39      83  6.9   81     8   1
5    96     167  6.9   91     9   1

Keep rows where Solar.R is over 315:

airquality |> 
  filter(Solar.R > 315)

  Ozone Solar.R Wind Temp Month Day
1    14     334 11.5   64     5  16
2    30     322 11.5   68     5  19
3    11     320 16.6   73     5  22
4    39     323 11.5   87     6  10
5    NA     332 13.8   80     6  14
6    NA     322 11.5   79     6  15

Keep rows where Solar.R is over 315 and Temp is at least 70:

airquality |> 
  filter(Solar.R > 315,
         Temp >= 70)

  Ozone Solar.R Wind Temp Month Day
1    11     320 16.6   73     5  22
2    39     323 11.5   87     6  10
3    NA     332 13.8   80     6  14
4    NA     322 11.5   79     6  15

The commas within filter() function as the logical AND (&). Replacing them with this operator yields the same results:

airquality |> 
  filter(Solar.R > 315 &
         Temp >= 70)

  Ozone Solar.R Wind Temp Month Day
1    11     320 16.6   73     5  22
2    39     323 11.5   87     6  10
3    NA     332 13.8   80     6  14
4    NA     322 11.5   79     6  15

Further subset the data to drop rows where Ozone is missing. We want to drop missing values, but filter() wants a condition that evaluates to TRUE, so we need to specify what we want to keep. We want to keep observations where Ozone is not (!) missing (is.na()): !is.na(Ozone)

airquality |> 
  filter(Solar.R > 315 &
         Temp >= 70,
         !is.na(Ozone))

  Ozone Solar.R Wind Temp Month Day
1    11     320 16.6   73     5  22
2    39     323 11.5   87     6  10

12.4.2 By Values

The slice_min() and slice_max() functions are used to return rows with the lowest or highest values for some variable.

To see two rows from airquality with the highest value of Temp, specify the name of the variable and the number of rows to return (n):

airquality |> 
  slice_max(Temp, n = 2)

  Ozone Solar.R Wind Temp Month Day
1    76     203  9.7   97     8  28
2    84     237  6.3   96     8  30

In the case of a tie, we will get more rows than n. Asking for three rows returns four:

airquality |> 
  slice_max(Temp, n = 3)

  Ozone Solar.R Wind Temp Month Day
1    76     203  9.7   97     8  28
2    84     237  6.3   96     8  30
3   118     225  2.3   94     8  29
4    85     188  6.3   94     8  31

slice_min() works just like slice_max() but, as its name suggests, returns the rows with the lowest values rather than the greatest values. Find rows with the five lowest (or more if there is a tie) values of Temp:

airquality |> 
  slice_min(Temp, n = 5)

  Ozone Solar.R Wind Temp Month Day
1    NA      NA 14.3   56     5   5
2     6      78 18.4   57     5  18
3    NA      66 16.6   57     5  25
4    NA      NA  8.0   57     5  27
5    18      65 13.2   58     5  15
6    NA     266 14.9   58     5  26

12.5 Exercises

12.5.1 Fundamental

Use the penguins dataset for these exercises. After subsetting, the dataset will have three columns and eight rows.

Drop penguins from Dream Island.
Keep rows where penguins’ body_mass is in the range 3000-3200.
Keep columns containing the letters “b” or “p”.
Drop columns without an underscore (_) in their name.

12.5.2 Extended

Drop character and date columns from penguins_raw.
Use this dataset:
```
dat <- read.csv("https://sscc.wisc.edu/sscc/pubs/dwr/data/subset_extended.csv")
```
Its column names are yeast genes and locus tags. Use one or more num_range() to get back the following sets of columns:
- ALG1 ALG2 ALG13 ALG14
- SEC10 SEC12 SEC13 SEC14 SEC15 SEC20 SEC21 SEC22 SEC23 SEC24
- GPI16 GPI17 GPI18 DLK_1033 DLK_1034 DLK_1035
- ERG_9090 ERG_9092 ERG_9094 ERG_9096 ERG_9098 ERG_9100 ERG_9102 ERG_9104 ERG_9106