12 Subsetting

When working with the data in a data frame, you will routinely find that you only want to work with a portion of the data. Very often it will be convenient to create a new data frame that is a subset of your original data frame, especially where a function uses a data frame as it’s primary input argument, or where you will be using the same subset repeatedly in several subsequent statements in your script.

We can think of working with a subset of the columns of a data frame, or a subset of the rows. Two common examples are:

  • Subset by selecting variables to create a scatterplot matrix.
  • Subset by selecting observations for a data frame with no missing values.

12.1 Extractors

We have previously seen that we can index vectors by position, name, or condition to extract and replace elements. Now we extend this to data frames. In addition to the square bracket extractors, [ ], we also have two “list” extractors, the dollar sign , $, and the doubled square brackets, [[ ]]. Here it is relevant that a data frame is a special kind of list.

  • The square brackets, [ ], allow us to index both the rows and the columns of a data frame. We can index by position, name, or condition.

    When used with data frames (or matrices), we index by two positions, rows and columns. The square brackets should contain two objects, a vector indexing the rows and a vector indexing the columns. To select all the elements along one dimension, omit that vector but include the comma.

    Try these examples:

    mtcars["Valiant", "mpg"]  # a single element
    mtcars[ , c("mpg", "wt")] # all rows, two columns
    mtcars[1:6, ]             # rows 1-6, all columns
    mtcars[ , ]               # everything

    The next example illustrates indexing rows by position and columns by name.

    mtcars[1:6, c("mpg", "wt")] # [rows, columns]
                       mpg    wt
    Mazda RX4         21.0 2.620
    Mazda RX4 Wag     21.0 2.875
    Datsun 710        22.8 2.320
    Hornet 4 Drive    21.4 3.215
    Hornet Sportabout 18.7 3.440
    Valiant           18.1 3.460
  • The dollar sign extractor, $, allows us to index a single named element of a list. In a data frame, that would be a column, a single variable. (We’ll use the head function to show just the first few values.)

    head( mtcars$mpg )
    [1] 21.0 21.0 22.8 21.4 18.7 18.1
  • The double brackets, [[ ]], allow us to index a single element of a list by either name or position. This can be useful for programming repeated tasks.

    head( mtcars[[3]] )      # by position, displacement
    [1] 160 160 108 258 360 225
    head( mtcars[["disp"]] ) # by name
    [1] 160 160 108 258 360 225
  • In addition to these extractors, we also have a subset function that provides a very useful variation on the square bracket extractor.

    subset(mtcars, 
           subset=carb >=6,       # rows
           select=c("mpg", "wt")) # columns
                   mpg   wt
    Ferrari Dino  19.7 2.77
    Maserati Bora 15.0 3.57

Data for further examples:

cars <- mtcars
# create three character variables
cars$name <- row.names(cars)
cars$make <- sub(" .*$", "", cars$name)
cars$model <- sub("^.*? ", "", cars$name)

12.2 Selecting Columns

Perhaps the most common subset you will use will be pulling a single column from a data frame by name using the dollar sign extractor. This is so common and innocuous you might not even think of it as “subsetting!”

mtcars$mpg

12.2.1 Single name

This can be done with any extractor. The array and list extractors return a vector, while the subset function returns a data frame (we’ll pipe to the head function to show just a few rows of the data frame).

library(magrittr)

# a solitary vector
cars$mpg      %>%  
    head
[1] 21.0 21.0 22.8 21.4 18.7 18.1
cars[, "mpg"] %>%
    head
[1] 21.0 21.0 22.8 21.4 18.7 18.1
cars[["mpg"]] %>%
    head
[1] 21.0 21.0 22.8 21.4 18.7 18.1
# subset() returns a data.frame
subset(cars, select="mpg") %>% 
    head
                   mpg
Mazda RX4         21.0
Mazda RX4 Wag     21.0
Datsun 710        22.8
Hornet 4 Drive    21.4
Hornet Sportabout 18.7
Valiant           18.1

12.2.2 Arbitrary names

Very commonly you will want to put all of the variables you are using for a project into a single data frame, selecting a subset of columns using an arbitrary vector of names.

Note that the column names are specified as a character vector.

head( cars[, c("mpg", "disp")] )
                   mpg disp
Mazda RX4         21.0  160
Mazda RX4 Wag     21.0  160
Datsun 710        22.8  108
Hornet 4 Drive    21.4  258
Hornet Sportabout 18.7  360
Valiant           18.1  225
head( subset(cars, select=c("mpg", "disp")) )
                   mpg disp
Mazda RX4         21.0  160
Mazda RX4 Wag     21.0  160
Datsun 710        22.8  108
Hornet 4 Drive    21.4  258
Hornet Sportabout 18.7  360
Valiant           18.1  225

12.2.3 Name prefix, other name wildcards

It is fairly common for groups of related variables to have similar names with a common prefix or a common suffix, and often we want to extract these as a group. Perhaps the data have repeated observations of household income across multiple years, recorded as hhinc2005, hhinc2010, hhinc2015. In base R we use regular expressions to find multiple matching names, and convert those to column positions.

# starts with "m"
head( cars[ , grep("^m"   , names(cars))] ) 
                   mpg    make      model
Mazda RX4         21.0   Mazda        RX4
Mazda RX4 Wag     21.0   Mazda    RX4 Wag
Datsun 710        22.8  Datsun        710
Hornet 4 Drive    21.4  Hornet    4 Drive
Hornet Sportabout 18.7  Hornet Sportabout
Valiant           18.1 Valiant    Valiant
# starts with "m" or "c"
head( cars[ , grep("^[mc]", names(cars))] ) 
                   mpg cyl carb    make      model
Mazda RX4         21.0   6    4   Mazda        RX4
Mazda RX4 Wag     21.0   6    4   Mazda    RX4 Wag
Datsun 710        22.8   4    1  Datsun        710
Hornet 4 Drive    21.4   6    1  Hornet    4 Drive
Hornet Sportabout 18.7   8    2  Hornet Sportabout
Valiant           18.1   6    1 Valiant    Valiant

12.2.4 Position range

Less commonly you may extract columns from a data frame by position. This is easy to use when there are not too many columns in the original data frame, harder to use where there are a lot of columns, or where the data have been restructured as with a merge or reshape.

head(cars[, 1:5])
                   mpg cyl disp  hp drat
Mazda RX4         21.0   6  160 110 3.90
Mazda RX4 Wag     21.0   6  160 110 3.90
Datsun 710        22.8   4  108  93 3.85
Hornet 4 Drive    21.4   6  258 110 3.08
Hornet Sportabout 18.7   8  360 175 3.15
Valiant           18.1   6  225 105 2.76

12.2.5 Column Type

Occasionally you will want to work with all of the variables of a certain type - perhaps to create a scatterplot matrix of all numeric variables, to create a table of means and standard deviations of continuous variables, or to create frequency tables for all factor variables.

To accomplish this task we specify the variables to use by condition. In order to check the type of each variable, we use the sapply function (one of several apply functions used for repeated tasks).

# a variation of lapply
sapply(cars, is.character)
  mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb  name  make model 
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE 
cars[,sapply(cars, is.character)]  %>%
    head
                               name    make      model
Mazda RX4                 Mazda RX4   Mazda        RX4
Mazda RX4 Wag         Mazda RX4 Wag   Mazda    RX4 Wag
Datsun 710               Datsun 710  Datsun        710
Hornet 4 Drive       Hornet 4 Drive  Hornet    4 Drive
Hornet Sportabout Hornet Sportabout  Hornet Sportabout
Valiant                     Valiant Valiant    Valiant
subset(cars, 
       select=sapply(cars, is.character))  %>%
    head
                               name    make      model
Mazda RX4                 Mazda RX4   Mazda        RX4
Mazda RX4 Wag         Mazda RX4 Wag   Mazda    RX4 Wag
Datsun 710               Datsun 710  Datsun        710
Hornet 4 Drive       Hornet 4 Drive  Hornet    4 Drive
Hornet Sportabout Hornet Sportabout  Hornet Sportabout
Valiant                     Valiant Valiant    Valiant

12.2.6 Name range

The idea here is that we want all of the variables that are positioned together in the data frame, for example from mpg to disp inclusive. To accomplish this in base R, we need to figure out what position mpg and disp occupy in the data frame. Given these positions, we convert a name range into a position range.

varnames <- names(cars)
# an "exact" or "fixed" regular expression
grep("mpg", varnames)  
[1] 1
grep("disp", varnames)
[1] 3
# as one statement
cars[, grep("mpg", names(cars)):grep("disp", names(cars))]
                     mpg cyl  disp
Mazda RX4           21.0   6 160.0
Mazda RX4 Wag       21.0   6 160.0
Datsun 710          22.8   4 108.0
Hornet 4 Drive      21.4   6 258.0
Hornet Sportabout   18.7   8 360.0
Valiant             18.1   6 225.0
Duster 360          14.3   8 360.0
Merc 240D           24.4   4 146.7
Merc 230            22.8   4 140.8
Merc 280            19.2   6 167.6
Merc 280C           17.8   6 167.6
Merc 450SE          16.4   8 275.8
Merc 450SL          17.3   8 275.8
Merc 450SLC         15.2   8 275.8
Cadillac Fleetwood  10.4   8 472.0
Lincoln Continental 10.4   8 460.0
Chrysler Imperial   14.7   8 440.0
Fiat 128            32.4   4  78.7
Honda Civic         30.4   4  75.7
Toyota Corolla      33.9   4  71.1
Toyota Corona       21.5   4 120.1
Dodge Challenger    15.5   8 318.0
AMC Javelin         15.2   8 304.0
Camaro Z28          13.3   8 350.0
Pontiac Firebird    19.2   8 400.0
Fiat X1-9           27.3   4  79.0
Porsche 914-2       26.0   4 120.3
Lotus Europa        30.4   4  95.1
Ford Pantera L      15.8   8 351.0
Ferrari Dino        19.7   6 145.0
Maserati Bora       15.0   8 301.0
Volvo 142E          21.4   4 121.0

12.3 Subsetting Rows

12.3.1 most often by condition

# note the repeated "cars$... prefix"
cars[cars$am==0 & cars$mpg > 20, ]
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb           name   make   model
Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1 Hornet 4 Drive Hornet 4 Drive
Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2      Merc 240D   Merc    240D
Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2       Merc 230   Merc     230
Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1  Toyota Corona Toyota  Corona
# with subset() you don't need to use "cars$..."
subset(cars, am==0 & mpg > 20)
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb           name   make   model
Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1 Hornet 4 Drive Hornet 4 Drive
Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2      Merc 240D   Merc    240D
Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2       Merc 230   Merc     230
Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1  Toyota Corona Toyota  Corona
# this is how the subset option alongside a formula works, as well.

12.3.2 by name

Sometimes people use ID values as row names.

cars["Merc 230",]
          mpg cyl  disp hp drat   wt qsec vs am gear carb     name make model
Merc 230 22.8   4 140.8 95 3.92 3.15 22.9  1  0    4    2 Merc 230 Merc   230

12.3.3 by position

This is occasionally useful, especially to create data sets for testing and debugging your code.

cars[1:5,]
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb              name   make      model
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4         Mazda RX4  Mazda        RX4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4     Mazda RX4 Wag  Mazda    RX4 Wag
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1        Datsun 710 Datsun        710
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1    Hornet 4 Drive Hornet    4 Drive
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 Hornet Sportabout Hornet Sportabout

12.3.4 random observations

# approximately 25% of our data
cars[runif(nrow(cars)) < 0.25,] # approximately 8 observations
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb              name     make    model
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4     Mazda RX4 Wag    Mazda  RX4 Wag
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1        Datsun 710   Datsun      710
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1    Hornet 4 Drive   Hornet  4 Drive
Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4 Chrysler Imperial Chrysler Imperial
Toyota Corona     21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1     Toyota Corona   Toyota   Corona
Camaro Z28        13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4        Camaro Z28   Camaro      Z28
Porsche 914-2     26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2     Porsche 914-2  Porsche    914-2
Volvo 142E        21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2        Volvo 142E    Volvo     142E
# exactly 8 observations
cars[sample(1:nrow(cars), 8), ]
                   mpg cyl  disp  hp drat   wt  qsec vs am gear carb              name     make      model
Maserati Bora     15.0   8 301.0 335 3.54 3.57 14.60  0  1    5    8     Maserati Bora Maserati       Bora
Dodge Challenger  15.5   8 318.0 150 2.76 3.52 16.87  0  0    3    2  Dodge Challenger    Dodge Challenger
Volvo 142E        21.4   4 121.0 109 4.11 2.78 18.60  1  1    4    2        Volvo 142E    Volvo       142E
Merc 450SE        16.4   8 275.8 180 3.07 4.07 17.40  0  0    3    3        Merc 450SE     Merc      450SE
Merc 280C         17.8   6 167.6 123 3.92 3.44 18.90  1  0    4    4         Merc 280C     Merc       280C
Ford Pantera L    15.8   8 351.0 264 4.22 3.17 14.50  0  1    5    4    Ford Pantera L     Ford  Pantera L
Hornet Sportabout 18.7   8 360.0 175 3.15 3.44 17.02  0  0    3    2 Hornet Sportabout   Hornet Sportabout
Merc 280          19.2   6 167.6 123 3.92 3.44 18.30  1  0    4    4          Merc 280     Merc        280

12.4 Subsetting Exercises

Use the state.x77 dataset for 1 and 2.

  1. Keep all columns except Population.

  2. Drop columns with spaces in their names.

Use the iris dataset for 3 and 4.

  1. Keep only columns that are numeric and start with “S”.

  2. Keep rows where Sepal.Length is divisible by 1 with no remainder. (Hint: see %% in ?Arithmetic.)

Use the airquality dataset for 5 and 6.

  1. Make a subset of the days where the value of Ozone is in the range 30-50.

Bonus:

  1. Find days where the value of Ozone is more than two standard deviations away from the mean, with mean and standard deviation calculated by month. (Hint: If you got zero rows in the output, did you check if Ozone is fully observed?)

  2. Use the mtcars dataset. Create a 3x3 scatterplot matrix (pass a dataframe to plot()) of the mpg, weight, and horsepower of cars that have four or six cylinders, at least four forward gears, and manual transmission. See help(mtcars) to figure out which column is which.