# 12 Subsetting

When working with the data in a data frame, you will routinely find that you only want to work with a portion of the data. Very often it will be convenient to create a new data frame that is a subset of your original data frame, especially where a function uses a data frame as it’s primary input argument, or where you will be using the same subset repeatedly in several subsequent statements in your script.

We can think of working with a subset of the columns of a data frame, or a subset of the rows. Two common examples are:

• Subset by selecting variables to create a scatterplot matrix.
• Subset by selecting observations for a data frame with no missing values.

## 12.1 Extractors

We have previously seen that we can index vectors by position, name, or condition to extract and replace elements. Now we extend this to data frames. In addition to the square bracket extractors, [ ], we also have two “list” extractors, the dollar sign , $, and the doubled square brackets, [[ ]]. Here it is relevant that a data frame is a special kind of list. • The square brackets, [ ], allow us to index both the rows and the columns of a data frame. We can index by position, name, or condition. When used with data frames (or matrices), we index by two positions, rows and columns. The square brackets should contain two objects, a vector indexing the rows and a vector indexing the columns. To select all the elements along one dimension, omit that vector but include the comma. Try these examples: mtcars["Valiant", "mpg"] # a single element mtcars[ , c("mpg", "wt")] # all rows, two columns mtcars[1:6, ] # rows 1-6, all columns mtcars[ , ] # everything The next example illustrates indexing rows by position and columns by name. mtcars[1:6, c("mpg", "wt")] # [rows, columns] mpg wt Mazda RX4 21.0 2.620 Mazda RX4 Wag 21.0 2.875 Datsun 710 22.8 2.320 Hornet 4 Drive 21.4 3.215 Hornet Sportabout 18.7 3.440 Valiant 18.1 3.460 • The dollar sign extractor,$, allows us to index a single named element of a list. In a data frame, that would be a column, a single variable. (We’ll use the head function to show just the first few values.)

head( mtcars$mpg ) [1] 21.0 21.0 22.8 21.4 18.7 18.1 • The double brackets, [[ ]], allow us to index a single element of a list by either name or position. This can be useful for programming repeated tasks. head( mtcars[[3]] ) # by position, displacement [1] 160 160 108 258 360 225 head( mtcars[["disp"]] ) # by name [1] 160 160 108 258 360 225 • In addition to these extractors, we also have a subset function that provides a very useful variation on the square bracket extractor. subset(mtcars, subset=carb >=6, # rows select=c("mpg", "wt")) # columns mpg wt Ferrari Dino 19.7 2.77 Maserati Bora 15.0 3.57 Data for further examples: cars <- mtcars # create three character variables cars$name <- row.names(cars)
cars$make <- sub(" .*$", "", cars$name) cars$model <- sub("^.*? ", "", cars$name) ## 12.2 Selecting Columns Perhaps the most common subset you will use will be pulling a single column from a data frame by name using the dollar sign extractor. This is so common and innocuous you might not even think of it as “subsetting!” mtcars$mpg

### 12.2.1 Single name

This can be done with any extractor. The array and list extractors return a vector, while the subset function returns a data frame (we’ll pipe to the head function to show just a few rows of the data frame).

library(magrittr)

# a solitary vector
cars$mpg %>% head [1] 21.0 21.0 22.8 21.4 18.7 18.1 cars[, "mpg"] %>% head [1] 21.0 21.0 22.8 21.4 18.7 18.1 cars[["mpg"]] %>% head [1] 21.0 21.0 22.8 21.4 18.7 18.1 # subset() returns a data.frame subset(cars, select="mpg") %>% head mpg Mazda RX4 21.0 Mazda RX4 Wag 21.0 Datsun 710 22.8 Hornet 4 Drive 21.4 Hornet Sportabout 18.7 Valiant 18.1 ### 12.2.2 Arbitrary names Very commonly you will want to put all of the variables you are using for a project into a single data frame, selecting a subset of columns using an arbitrary vector of names. Note that the column names are specified as a character vector. head( cars[, c("mpg", "disp")] ) mpg disp Mazda RX4 21.0 160 Mazda RX4 Wag 21.0 160 Datsun 710 22.8 108 Hornet 4 Drive 21.4 258 Hornet Sportabout 18.7 360 Valiant 18.1 225 head( subset(cars, select=c("mpg", "disp")) ) mpg disp Mazda RX4 21.0 160 Mazda RX4 Wag 21.0 160 Datsun 710 22.8 108 Hornet 4 Drive 21.4 258 Hornet Sportabout 18.7 360 Valiant 18.1 225 ### 12.2.3 Name prefix, other name wildcards It is fairly common for groups of related variables to have similar names with a common prefix or a common suffix, and often we want to extract these as a group. Perhaps the data have repeated observations of household income across multiple years, recorded as hhinc2005, hhinc2010, hhinc2015. In base R we use regular expressions to find multiple matching names, and convert those to column positions. # starts with "m" head( cars[ , grep("^m" , names(cars))] ) mpg make model Mazda RX4 21.0 Mazda RX4 Mazda RX4 Wag 21.0 Mazda RX4 Wag Datsun 710 22.8 Datsun 710 Hornet 4 Drive 21.4 Hornet 4 Drive Hornet Sportabout 18.7 Hornet Sportabout Valiant 18.1 Valiant Valiant # starts with "m" or "c" head( cars[ , grep("^[mc]", names(cars))] ) mpg cyl carb make model Mazda RX4 21.0 6 4 Mazda RX4 Mazda RX4 Wag 21.0 6 4 Mazda RX4 Wag Datsun 710 22.8 4 1 Datsun 710 Hornet 4 Drive 21.4 6 1 Hornet 4 Drive Hornet Sportabout 18.7 8 2 Hornet Sportabout Valiant 18.1 6 1 Valiant Valiant ### 12.2.4 Position range Less commonly you may extract columns from a data frame by position. This is easy to use when there are not too many columns in the original data frame, harder to use where there are a lot of columns, or where the data have been restructured as with a merge or reshape. head(cars[, 1:5]) mpg cyl disp hp drat Mazda RX4 21.0 6 160 110 3.90 Mazda RX4 Wag 21.0 6 160 110 3.90 Datsun 710 22.8 4 108 93 3.85 Hornet 4 Drive 21.4 6 258 110 3.08 Hornet Sportabout 18.7 8 360 175 3.15 Valiant 18.1 6 225 105 2.76 ### 12.2.5 Column Type Occasionally you will want to work with all of the variables of a certain type - perhaps to create a scatterplot matrix of all numeric variables, to create a table of means and standard deviations of continuous variables, or to create frequency tables for all factor variables. To accomplish this task we specify the variables to use by condition. In order to check the type of each variable, we use the sapply function (one of several apply functions used for repeated tasks). # a variation of lapply sapply(cars, is.character) mpg cyl disp hp drat wt qsec vs am gear carb name make model FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE cars[,sapply(cars, is.character)] %>% head name make model Mazda RX4 Mazda RX4 Mazda RX4 Mazda RX4 Wag Mazda RX4 Wag Mazda RX4 Wag Datsun 710 Datsun 710 Datsun 710 Hornet 4 Drive Hornet 4 Drive Hornet 4 Drive Hornet Sportabout Hornet Sportabout Hornet Sportabout Valiant Valiant Valiant Valiant subset(cars, select=sapply(cars, is.character)) %>% head name make model Mazda RX4 Mazda RX4 Mazda RX4 Mazda RX4 Wag Mazda RX4 Wag Mazda RX4 Wag Datsun 710 Datsun 710 Datsun 710 Hornet 4 Drive Hornet 4 Drive Hornet 4 Drive Hornet Sportabout Hornet Sportabout Hornet Sportabout Valiant Valiant Valiant Valiant ### 12.2.6 Name range The idea here is that we want all of the variables that are positioned together in the data frame, for example from mpg to disp inclusive. To accomplish this in base R, we need to figure out what position mpg and disp occupy in the data frame. Given these positions, we convert a name range into a position range. varnames <- names(cars) # an "exact" or "fixed" regular expression grep("mpg", varnames) [1] 1 grep("disp", varnames) [1] 3 # as one statement cars[, grep("mpg", names(cars)):grep("disp", names(cars))] mpg cyl disp Mazda RX4 21.0 6 160.0 Mazda RX4 Wag 21.0 6 160.0 Datsun 710 22.8 4 108.0 Hornet 4 Drive 21.4 6 258.0 Hornet Sportabout 18.7 8 360.0 Valiant 18.1 6 225.0 Duster 360 14.3 8 360.0 Merc 240D 24.4 4 146.7 Merc 230 22.8 4 140.8 Merc 280 19.2 6 167.6 Merc 280C 17.8 6 167.6 Merc 450SE 16.4 8 275.8 Merc 450SL 17.3 8 275.8 Merc 450SLC 15.2 8 275.8 Cadillac Fleetwood 10.4 8 472.0 Lincoln Continental 10.4 8 460.0 Chrysler Imperial 14.7 8 440.0 Fiat 128 32.4 4 78.7 Honda Civic 30.4 4 75.7 Toyota Corolla 33.9 4 71.1 Toyota Corona 21.5 4 120.1 Dodge Challenger 15.5 8 318.0 AMC Javelin 15.2 8 304.0 Camaro Z28 13.3 8 350.0 Pontiac Firebird 19.2 8 400.0 Fiat X1-9 27.3 4 79.0 Porsche 914-2 26.0 4 120.3 Lotus Europa 30.4 4 95.1 Ford Pantera L 15.8 8 351.0 Ferrari Dino 19.7 6 145.0 Maserati Bora 15.0 8 301.0 Volvo 142E 21.4 4 121.0 ## 12.3 Subsetting Rows ### 12.3.1 most often by condition # note the repeated "cars$... prefix"
cars[cars$am==0 & cars$mpg > 20, ]
mpg cyl  disp  hp drat    wt  qsec vs am gear carb           name   make   model
Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1 Hornet 4 Drive Hornet 4 Drive
Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2      Merc 240D   Merc    240D
Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2       Merc 230   Merc     230
Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1  Toyota Corona Toyota  Corona
# with subset() you don't need to use "cars\$..."
subset(cars, am==0 & mpg > 20)
mpg cyl  disp  hp drat    wt  qsec vs am gear carb           name   make   model
Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1 Hornet 4 Drive Hornet 4 Drive
Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2      Merc 240D   Merc    240D
Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2       Merc 230   Merc     230
Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1  Toyota Corona Toyota  Corona
# this is how the subset option alongside a formula works, as well.

### 12.3.2 by name

Sometimes people use ID values as row names.

cars["Merc 230",]
mpg cyl  disp hp drat   wt qsec vs am gear carb     name make model
Merc 230 22.8   4 140.8 95 3.92 3.15 22.9  1  0    4    2 Merc 230 Merc   230

### 12.3.3 by position

This is occasionally useful, especially to create data sets for testing and debugging your code.

cars[1:5,]
mpg cyl disp  hp drat    wt  qsec vs am gear carb              name   make      model
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4         Mazda RX4  Mazda        RX4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4     Mazda RX4 Wag  Mazda    RX4 Wag
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1        Datsun 710 Datsun        710
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1    Hornet 4 Drive Hornet    4 Drive
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 Hornet Sportabout Hornet Sportabout

### 12.3.4 random observations

# approximately 25% of our data
cars[runif(nrow(cars)) < 0.25,] # approximately 8 observations
mpg cyl  disp  hp drat    wt  qsec vs am gear carb              name     make    model
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4     Mazda RX4 Wag    Mazda  RX4 Wag
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1        Datsun 710   Datsun      710
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1    Hornet 4 Drive   Hornet  4 Drive
Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4 Chrysler Imperial Chrysler Imperial
Toyota Corona     21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1     Toyota Corona   Toyota   Corona
Camaro Z28        13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4        Camaro Z28   Camaro      Z28
Porsche 914-2     26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2     Porsche 914-2  Porsche    914-2
Volvo 142E        21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2        Volvo 142E    Volvo     142E
# exactly 8 observations
cars[sample(1:nrow(cars), 8), ]
mpg cyl  disp  hp drat   wt  qsec vs am gear carb              name     make      model
Maserati Bora     15.0   8 301.0 335 3.54 3.57 14.60  0  1    5    8     Maserati Bora Maserati       Bora
Dodge Challenger  15.5   8 318.0 150 2.76 3.52 16.87  0  0    3    2  Dodge Challenger    Dodge Challenger
Volvo 142E        21.4   4 121.0 109 4.11 2.78 18.60  1  1    4    2        Volvo 142E    Volvo       142E
Merc 450SE        16.4   8 275.8 180 3.07 4.07 17.40  0  0    3    3        Merc 450SE     Merc      450SE
Merc 280C         17.8   6 167.6 123 3.92 3.44 18.90  1  0    4    4         Merc 280C     Merc       280C
Ford Pantera L    15.8   8 351.0 264 4.22 3.17 14.50  0  1    5    4    Ford Pantera L     Ford  Pantera L
Hornet Sportabout 18.7   8 360.0 175 3.15 3.44 17.02  0  0    3    2 Hornet Sportabout   Hornet Sportabout
Merc 280          19.2   6 167.6 123 3.92 3.44 18.30  1  0    4    4          Merc 280     Merc        280

## 12.4 Subsetting Exercises

Use the state.x77 dataset for 1 and 2.

1. Keep all columns except Population.

2. Drop columns with spaces in their names.

Use the iris dataset for 3 and 4.

1. Keep only columns that are numeric and start with “S”.

2. Keep rows where Sepal.Length is divisible by 1 with no remainder. (Hint: see %% in ?Arithmetic.)

Use the airquality dataset for 5 and 6.

1. Make a subset of the days where the value of Ozone is in the range 30-50.

Bonus:

1. Find days where the value of Ozone is more than two standard deviations away from the mean, with mean and standard deviation calculated by month. (Hint: If you got zero rows in the output, did you check if Ozone is fully observed?)

2. Use the mtcars dataset. Create a 3x3 scatterplot matrix (pass a dataframe to plot()) of the mpg, weight, and horsepower of cars that have four or six cylinders, at least four forward gears, and manual transmission. See help(mtcars) to figure out which column is which.