16 Subsetting

Subsetting involves making a dataset smaller by selecting a portion of the data that we are interested in.

We can take subsets of a dataset either by columns or by rows. For example, we might have a dataset with extra columns we do not care about, or we might want to look at rows where a certain condition is true (such as age greater than or equal to 18).

16.1 Extractors

We have previously seen that we can index vectors by position, name, or condition to extract and replace elements. Now we extend this to data frames. In addition to the square bracket extractors, [ ], we also have two “list” extractors, the dollar sign , $, and the doubled square brackets, [[ ]]. Here it is relevant that a data frame is a special kind of list.

The square brackets, [ ], allow us to index both the rows and the columns of a data frame. We can index by position, name, or condition.

When used with data frames (or matrices), we index by two positions, rows and columns. The square brackets should contain two objects, a vector indexing the rows and a vector indexing the columns. To select all the elements along one dimension, omit that vector but include the comma.

Try these examples:
```
mtcars["Valiant", "mpg"]  # a single element
mtcars[ , c("mpg", "wt")] # all rows, two columns
mtcars[1:6, ]             # rows 1-6, all columns
mtcars[ , ]               # everything
```
The next example illustrates indexing rows by position and columns by name.
```
mtcars[1:6, c("mpg", "wt")] # [rows, columns]
```
```
                   mpg    wt
Mazda RX4         21.0 2.620
Mazda RX4 Wag     21.0 2.875
Datsun 710        22.8 2.320
Hornet 4 Drive    21.4 3.215
Hornet Sportabout 18.7 3.440
Valiant           18.1 3.460
```
The dollar sign extractor, $, allows us to index a single named element of a list. In a data frame, that would be a column, a single variable. (We’ll use the head function to show just the first few values.)
```
head( mtcars$mpg )
```
```
[1] 21.0 21.0 22.8 21.4 18.7 18.1
```
The double brackets, [[ ]], allow us to index a single element of a list by either name or position. This can be useful for programming repeated tasks.
```
head( mtcars[[3]] )      # by position, displacement
```
```
[1] 160 160 108 258 360 225
```
```
head( mtcars[["disp"]] ) # by name
```
```
[1] 160 160 108 258 360 225
```

In addition to these extractors, we also have a subset function that provides a very useful variation on the square bracket extractor.

subset(mtcars, 
       subset=carb >=6,       # rows
       select=c("mpg", "wt")) # columns

               mpg   wt
Ferrari Dino  19.7 2.77
Maserati Bora 15.0 3.57

16.2 Subsetting by Columns

Now, load the dplyr package, which contains the functions we need to subset.

library(dplyr)

The select() function from dplyr gives us a number of ways to take a subset of columns. Four ways to select columns are covered below: By Name, By Pattern, By Range, and By Type. For a full list of the ways to select a subset of columns, run ?tidyselect::language.

16.2.1 By Name

The simplest and most straightforward way to select columns is to pass their names to select(). The output of select() will be our original dataframe, but it will only include the columns we name, and they will be in the specified order. In the mtcars dataset, cyl is before hp, but if we specify hp before cyl in select(), they will appear in that order.

(Note that the code below uses head() to only show the first six rows for space reasons.)

colnames(mtcars)

 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"

mtcars |> 
  select(mpg, hp, cyl) |> 
  head()

                   mpg  hp cyl
Mazda RX4         21.0 110   6
Mazda RX4 Wag     21.0 110   6
Datsun 710        22.8  93   4
Hornet 4 Drive    21.4 110   6
Hornet Sportabout 18.7 175   8
Valiant           18.1 105   6

As it was used above, select() only keeps the columns we name. We can also have it only drop the columns we name by using a - in front of the column name.

mtcars |> 
  select(-disp, -drat, -wt, -qsec, -vs, -am, -gear, -carb) |> 
  head()

                   mpg cyl  hp
Mazda RX4         21.0   6 110
Mazda RX4 Wag     21.0   6 110
Datsun 710        22.8   4  93
Hornet 4 Drive    21.4   6 110
Hornet Sportabout 18.7   8 175
Valiant           18.1   6 105

In some cases, we might want to select a certain subset of columns multiple times, and we do not want to have to retype them every time. In that case, we can pass a character vector to select(). We can also add another column, hp, without adding it to our list when running the command.

myCols <- c("mpg", "vs", "carb")

mtcars |> 
  select(myCols, hp) |> 
  head()

Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(myCols)

  # Now:
  data %>% select(all_of(myCols))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

                   mpg vs carb  hp
Mazda RX4         21.0  0    4 110
Mazda RX4 Wag     21.0  0    4 110
Datsun 710        22.8  1    1  93
Hornet 4 Drive    21.4  1    1 110
Hornet Sportabout 18.7  0    2 175
Valiant           18.1  1    1 105

We can also simply reorder the columns by giving select() the columns we would like first and then everything().

mtcars |> 
  select(myCols, hp, everything()) |> 
  head()

                   mpg vs carb  hp cyl disp drat    wt  qsec am gear
Mazda RX4         21.0  0    4 110   6  160 3.90 2.620 16.46  1    4
Mazda RX4 Wag     21.0  0    4 110   6  160 3.90 2.875 17.02  1    4
Datsun 710        22.8  1    1  93   4  108 3.85 2.320 18.61  1    4
Hornet 4 Drive    21.4  1    1 110   6  258 3.08 3.215 19.44  0    3
Hornet Sportabout 18.7  0    2 175   8  360 3.15 3.440 17.02  0    3
Valiant           18.1  1    1 105   6  225 2.76 3.460 20.22  0    3

select() includes the functionality of rename(), so we can do in one line (second example) what normally takes two (first example). The pattern is new_name = old_name.

mtcars |> 
  rename(horsepower = hp) |> 
  select(horsepower, mpg) |> 
  head()

                  horsepower  mpg
Mazda RX4                110 21.0
Mazda RX4 Wag            110 21.0
Datsun 710                93 22.8
Hornet 4 Drive           110 21.4
Hornet Sportabout        175 18.7
Valiant                  105 18.1

mtcars |> 
  select(horsepower = hp, mpg) |> 
  head()

                  horsepower  mpg
Mazda RX4                110 21.0
Mazda RX4 Wag            110 21.0
Datsun 710                93 22.8
Hornet 4 Drive           110 21.4
Hornet Sportabout        175 18.7
Valiant                  105 18.1

16.2.2 By Pattern

Columns can be selected by a pattern, such as the letters they start with or contain. To do this, we need to use other functions inside of select().

starts_with() returns columns that start with a character string. Both “disp” and “drat” start with “d”.

mtcars |> 
  select(starts_with("d")) |> 
  head()

                  disp drat
Mazda RX4          160 3.90
Mazda RX4 Wag      160 3.90
Datsun 710         108 3.85
Hornet 4 Drive     258 3.08
Hornet Sportabout  360 3.15
Valiant            225 2.76

ends_with() does the opposite, and looks for a character string at the end of the column names. “disp” and “hp” end with “p”.

mtcars |> 
  select(ends_with("p")) |> 
  head()

                  disp  hp
Mazda RX4          160 110
Mazda RX4 Wag      160 110
Datsun 710         108  93
Hornet 4 Drive     258 110
Hornet Sportabout  360 175
Valiant            225 105

contains() is more general than both starts_with() and ends_with() since it looks for a character string anywhere in the column name. “ar” is found in both “gear” and “carb”.

mtcars |> 
  select(contains("ar")) |> 
  head()

                  gear carb
Mazda RX4            4    4
Mazda RX4 Wag        4    4
Datsun 710           4    1
Hornet 4 Drive       3    1
Hornet Sportabout    3    2
Valiant              3    1

The functions above all use literal character strings. dplyr also allows for the use of regular expressions with the matches() function. The code below, "^[a-g].a", looks for a string that starts with (^) the letters a-g ([a-g]), then has any character (.), and then the letter a (a). Both “drat” and “gear” fulfill these matching criteria.

mtcars |> 
  select(matches("^[a-g].a"))

                    drat gear
Mazda RX4           3.90    4
Mazda RX4 Wag       3.90    4
Datsun 710          3.85    4
Hornet 4 Drive      3.08    3
Hornet Sportabout   3.15    3
Valiant             2.76    3
Duster 360          3.21    3
Merc 240D           3.69    4
Merc 230            3.92    4
Merc 280            3.92    4
Merc 280C           3.92    4
Merc 450SE          3.07    3
Merc 450SL          3.07    3
Merc 450SLC         3.07    3
Cadillac Fleetwood  2.93    3
Lincoln Continental 3.00    3
Chrysler Imperial   3.23    3
Fiat 128            4.08    4
Honda Civic         4.93    4
Toyota Corolla      4.22    4
Toyota Corona       3.70    3
Dodge Challenger    2.76    3
AMC Javelin         3.15    3
Camaro Z28          3.73    3
Pontiac Firebird    3.08    3
Fiat X1-9           4.08    4
Porsche 914-2       4.43    5
Lotus Europa        3.77    5
Ford Pantera L      4.22    5
Ferrari Dino        3.62    5
Maserati Bora       3.54    5
Volvo 142E          4.11    4

For more on regular expressions, see the chapter on Character Vectors or RStudio’s Regular Expressions cheatsheet.

16.2.3 By Range

The : operator can be used to select a range of variables, either by position or by number. Be careful to check that your columns are in the order you think they are!

mtcars |> 
  select(2:4) |> 
  head()

                  cyl disp  hp
Mazda RX4           6  160 110
Mazda RX4 Wag       6  160 110
Datsun 710          4  108  93
Hornet 4 Drive      6  258 110
Hornet Sportabout   8  360 175
Valiant             6  225 105

mtcars |> 
  select(cyl:hp) |> 
  head()

                  cyl disp  hp
Mazda RX4           6  160 110
Mazda RX4 Wag       6  160 110
Datsun 710          4  108  93
Hornet 4 Drive      6  258 110
Hornet Sportabout   8  360 175
Valiant             6  225 105

If we have multiple columns with a common prefix and a numeric suffix, we can select a range with num_range(). First, let’s change the column names of mtcars to create a fictional dataset with individuals’ incomes in different years, where each row is an individual and each column is their reported income for a different year.

mtcars_income <- mtcars
colnames(mtcars_income) <- paste0("income", 1960:1970)
colnames(mtcars_income)

 [1] "income1960" "income1961" "income1962" "income1963" "income1964" "income1965" "income1966" "income1967"
 [9] "income1968" "income1969" "income1970"

mtcars_income |> 
  select(num_range("income", 1962:1964)) |> 
  head()

                  income1962 income1963 income1964
Mazda RX4                160        110       3.90
Mazda RX4 Wag            160        110       3.90
Datsun 710               108         93       3.85
Hornet 4 Drive           258        110       3.08
Hornet Sportabout        360        175       3.15
Valiant                  225        105       2.76

mtcars_income |> 
  select(num_range("income", seq(1960, 1970, 2))) |> 
  head()

                  income1960 income1962 income1964 income1966 income1968 income1970
Mazda RX4               21.0        160       3.90      16.46          1          4
Mazda RX4 Wag           21.0        160       3.90      17.02          1          4
Datsun 710              22.8        108       3.85      18.61          1          1
Hornet 4 Drive          21.4        258       3.08      19.44          0          1
Hornet Sportabout       18.7        360       3.15      17.02          0          2
Valiant                 18.1        225       2.76      20.22          0          1

To use num_range(), give it the prefix ("income") and a numeric vector with the desired numeric suffixes. The first example specifies the range 1962:1964, giving us 1962, 1963, and 1964. The second example uses seq(1960, 1970, 2), so we will get every two years from 1960 to 1970.

mtcars_income |> 
  select(num_range("income", 1962:1964)) |> 
  head()

                  income1962 income1963 income1964
Mazda RX4                160        110       3.90
Mazda RX4 Wag            160        110       3.90
Datsun 710               108         93       3.85
Hornet 4 Drive           258        110       3.08
Hornet Sportabout        360        175       3.15
Valiant                  225        105       2.76

mtcars_income |> 
  select(num_range("income", seq(1960, 1970, 2))) |> 
  head()

                  income1960 income1962 income1964 income1966 income1968 income1970
Mazda RX4               21.0        160       3.90      16.46          1          4
Mazda RX4 Wag           21.0        160       3.90      17.02          1          4
Datsun 710              22.8        108       3.85      18.61          1          1
Hornet 4 Drive          21.4        258       3.08      19.44          0          1
Hornet Sportabout       18.7        360       3.15      17.02          0          2
Valiant                 18.1        225       2.76      20.22          0          1

If the numbers are out of order in the dataframe, this is not an issue, unlike with :. The examples below first select the columns income1962, income1964, and income1963, in that order. The second example with : fails to select income1963 because it selects columns by position, not by name.

mtcars_income |> 
  select(income1962, income1964, income1963) |> 
  select(num_range("income", 1962:1964)) |> 
  head()

                  income1962 income1963 income1964
Mazda RX4                160        110       3.90
Mazda RX4 Wag            160        110       3.90
Datsun 710               108         93       3.85
Hornet 4 Drive           258        110       3.08
Hornet Sportabout        360        175       3.15
Valiant                  225        105       2.76

mtcars_income |> 
  select(income1962, income1964, income1963) |> 
  select(income1962:income1964) |> 
  head()

                  income1962 income1964
Mazda RX4                160       3.90
Mazda RX4 Wag            160       3.90
Datsun 710               108       3.85
Hornet 4 Drive           258       3.08
Hornet Sportabout        360       3.15
Valiant                  225       2.76

It is also okay if some columns are not in the dataset. Of course, select() will not return these non-existent columns, but it will also not have an error. To see this, we can drop income1963.

mtcars_income |> 
  select(-income1963) |> 
  select(num_range("income", 1962:1964)) |> 
  head()

                  income1962 income1964
Mazda RX4                160       3.90
Mazda RX4 Wag            160       3.90
Datsun 710               108       3.85
Hornet 4 Drive           258       3.08
Hornet Sportabout        360       3.15
Valiant                  225       2.76

16.2.4 By Type

where() can help us select columns that are of a certain type, such as numeric or character. chickwts has one numeric column and one character column.

chickwts |> 
  head()

  weight      feed     feed2     feed3
1    179 horsebean horsebean horsebean
2    160 horsebean horsebean horsebean
3    136 horsebean horsebean horsebean
4    227 horsebean horsebean horsebean
5    217 horsebean horsebean horsebean
6    168 horsebean horsebean horsebean

chickwts |> 
  select(where(is.numeric)) |> 
  head()

16.2.5 Multiple Criteria

For more complex tasks, multiple criteria for selecting columns can be combined with logical operators (&, |, !).

mtcars |> 
  select((starts_with("d") & !where(is.character)) | contains("ar"), mpg) |> 
  head()

                  disp drat gear carb  mpg
Mazda RX4          160 3.90    4    4 21.0
Mazda RX4 Wag      160 3.90    4    4 21.0
Datsun 710         108 3.85    4    1 22.8
Hornet 4 Drive     258 3.08    3    1 21.4
Hornet Sportabout  360 3.15    3    2 18.7
Valiant            225 2.76    3    1 18.1

16.3 Subsetting by Rows

To select a subset of rows from a dataframe, we can provide one or more conditions to the filter() function.

If we want to see rows from mtcars with mpg greater than 25, we can do this as follows:

mtcars |> 
  filter(mpg > 25)

                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

Multiple conditions can be given to filter(), and these are separated by commas. If we give multiple conditions to filter(), it will only return rows that meet all of our conditions. In other words, we can think of each , like &, and the rows in the output are those for which all conditions are TRUE. Note how the two blocks of code below return the same output.

mtcars |> 
  filter(mpg > 25,
         hp < 100,
         vs == 1)

                mpg cyl disp hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4 78.7 66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.90  1  1    4    1
Fiat X1-9      27.3   4 79.0 66 4.08 1.935 18.90  1  1    4    1

mtcars |> 
  filter(mpg > 25 &
         hp < 100 &
         vs == 1)

                mpg cyl disp hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4 78.7 66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.90  1  1    4    1
Fiat X1-9      27.3   4 79.0 66 4.08 1.935 18.90  1  1    4    1

Because filter() runs logical tests on the rows of our dataset, we should make use of R’s logical operators (&, |, !), relational operators (<, >, <=, >=, ==, !=), and other functions that return logical values (%in%, is.na(), etc.). If it has been a while, or if you jumped to this chapter, you will find the chapter on Logical Vectors helpful.

To illustrate the use of is.na(), we first need some missing data. Since mtcars has no missing data, we can use the mutate() function to temporarily add a column with NA values. The mutate() code below gives us a new column called missingdata that repeats the vector c(1, 2, 3, 4, 5, 6, 7, NA) four times. (mtcars has 32 rows, so we need to repeat a vector with 8 numbers 4 times.) !is.na() will then return rows that do not (!) have NA in missingdata.

mtcars |> 
  mutate(missingdata = rep(c(1:7, NA), 4)) |> 
  filter(cyl %in% c(4, 6),
         mpg > 17 & mpg <= 21,
         am != 0 | gear == 4,
         !is.na(missingdata))

               mpg cyl  disp  hp drat    wt  qsec vs am gear carb missingdata
Mazda RX4     21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4           1
Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4           2
Merc 280      19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4           2
Merc 280C     17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4           3
Ferrari Dino  19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6           6

16.4 Exercises

Use the state.x77 dataset for 1 and 2. You will need to first convert it into a dataframe.

Drop the Population column.
Drop columns with spaces in their names.
Using the iris dataset, keep columns that begin with “S”.
Using the airquality dataset, find days where the value of Ozone is in the range 30-50.

16.5 Advanced Exercises

Using the iris dataset, keep rows where Sepal.Length is divisible by 1.0. (Hint: see %% in ?Arithmetic.)
Using the mtcars dataset, create a 3x3 scatterplot matrix (pass a dataframe to plot()) of the mpg, weight, and horsepower of cars that have four or six cylinders, at least four forward gears, and manual transmission. See help(mtcars) to figure out which column is which.