12 Subsetting
When working with the data in a data frame, you will routinely find that you only want to work with a portion of the data. Very often it will be convenient to create a new data frame that is a subset of your original data frame, especially where a function uses a data frame as it’s primary input argument, or where you will be using the same subset repeatedly in several subsequent statements in your script.
We can think of working with a subset of the columns of a data frame, or a subset of the rows. Two common examples are:
- Subset by selecting variables to create a scatterplot matrix.
- Subset by selecting observations for a data frame with no missing values.
12.1 Extractors
We have previously seen that we can index vectors by position, name, or
condition to extract and replace elements. Now we extend this to
data frames. In addition to the square bracket extractors, [ ]
, we
also have two “list” extractors, the dollar sign , $
, and the
doubled square brackets, [[ ]]
. Here it is relevant that a data frame
is a special kind of list.
The square brackets,
[ ]
, allow us to index both the rows and the columns of a data frame. We can index by position, name, or condition.When used with data frames (or matrices), we index by two positions, rows and columns. The square brackets should contain two objects, a vector indexing the rows and a vector indexing the columns. To select all the elements along one dimension, omit that vector but include the comma.
Try these examples:
mtcars["Valiant", "mpg"] # a single element mtcars[ , c("mpg", "wt")] # all rows, two columns mtcars[1:6, ] # rows 1-6, all columns mtcars[ , ] # everything
The next example illustrates indexing rows by position and columns by name.
mtcars[1:6, c("mpg", "wt")] # [rows, columns]
mpg wt Mazda RX4 21.0 2.620 Mazda RX4 Wag 21.0 2.875 Datsun 710 22.8 2.320 Hornet 4 Drive 21.4 3.215 Hornet Sportabout 18.7 3.440 Valiant 18.1 3.460
The dollar sign extractor,
$
, allows us to index a single named element of a list. In a data frame, that would be a column, a single variable. (We’ll use thehead
function to show just the first few values.)head( mtcars$mpg )
[1] 21.0 21.0 22.8 21.4 18.7 18.1
The double brackets,
[[ ]]
, allow us to index a single element of a list by either name or position. This can be useful for programming repeated tasks.head( mtcars[[3]] ) # by position, displacement
[1] 160 160 108 258 360 225
head( mtcars[["disp"]] ) # by name
[1] 160 160 108 258 360 225
In addition to these extractors, we also have a
subset
function that provides a very useful variation on the square bracket extractor.subset(mtcars, subset=carb >=6, # rows select=c("mpg", "wt")) # columns
mpg wt Ferrari Dino 19.7 2.77 Maserati Bora 15.0 3.57
Data for further examples:
cars <- mtcars
# create three character variables
cars$name <- row.names(cars)
cars$make <- sub(" .*$", "", cars$name)
cars$model <- sub("^.*? ", "", cars$name)
12.2 Selecting Columns
Perhaps the most common subset you will use will be pulling a single column from a data frame by name using the dollar sign extractor. This is so common and innocuous you might not even think of it as “subsetting!”
mtcars$mpg
12.2.1 Single name
This can be done with any extractor. The array and list
extractors return a vector, while the subset
function
returns a data frame (we’ll pipe to the head
function to
show just a few rows of the data frame).
library(magrittr)
Attaching package: 'magrittr'
The following object is masked from 'package:purrr':
set_names
The following object is masked from 'package:tidyr':
extract
The following objects are masked from 'package:testthat':
equals, is_less_than, not
# a solitary vector
cars$mpg %>%
head
[1] 21.0 21.0 22.8 21.4 18.7 18.1
cars[, "mpg"] %>%
head
[1] 21.0 21.0 22.8 21.4 18.7 18.1
cars[["mpg"]] %>%
head
[1] 21.0 21.0 22.8 21.4 18.7 18.1
# subset() returns a data.frame
subset(cars, select="mpg") %>%
head
mpg
Mazda RX4 21.0
Mazda RX4 Wag 21.0
Datsun 710 22.8
Hornet 4 Drive 21.4
Hornet Sportabout 18.7
Valiant 18.1
12.2.2 Arbitrary names
Very commonly you will want to put all of the variables you are using for a project into a single data frame, selecting a subset of columns using an arbitrary vector of names.
Note that the column names are specified as a character vector.
head( cars[, c("mpg", "disp")] )
mpg disp
Mazda RX4 21.0 160
Mazda RX4 Wag 21.0 160
Datsun 710 22.8 108
Hornet 4 Drive 21.4 258
Hornet Sportabout 18.7 360
Valiant 18.1 225
head( subset(cars, select=c("mpg", "disp")) )
mpg disp
Mazda RX4 21.0 160
Mazda RX4 Wag 21.0 160
Datsun 710 22.8 108
Hornet 4 Drive 21.4 258
Hornet Sportabout 18.7 360
Valiant 18.1 225
12.2.3 Name prefix, other name wildcards
It is fairly common for groups of related variables to have
similar names with a common prefix or a common suffix, and
often we want to extract these as a group. Perhaps the
data have repeated observations of household income across
multiple years, recorded as hhinc2005
, hhinc2010
,
hhinc2015
. In base R we use regular expressions
to find multiple matching names, and convert those to
column positions.
# starts with "m"
head( cars[ , grep("^m" , names(cars))] )
mpg make model
Mazda RX4 21.0 Mazda RX4
Mazda RX4 Wag 21.0 Mazda RX4 Wag
Datsun 710 22.8 Datsun 710
Hornet 4 Drive 21.4 Hornet 4 Drive
Hornet Sportabout 18.7 Hornet Sportabout
Valiant 18.1 Valiant Valiant
# starts with "m" or "c"
head( cars[ , grep("^[mc]", names(cars))] )
mpg cyl carb make model
Mazda RX4 21.0 6 4 Mazda RX4
Mazda RX4 Wag 21.0 6 4 Mazda RX4 Wag
Datsun 710 22.8 4 1 Datsun 710
Hornet 4 Drive 21.4 6 1 Hornet 4 Drive
Hornet Sportabout 18.7 8 2 Hornet Sportabout
Valiant 18.1 6 1 Valiant Valiant
12.2.4 Position range
Less commonly you may extract columns from a data frame by position. This is easy to use when there are not too many columns in the original data frame, harder to use where there are a lot of columns, or where the data have been restructured as with a merge or reshape.
head(cars[, 1:5])
mpg cyl disp hp drat
Mazda RX4 21.0 6 160 110 3.90
Mazda RX4 Wag 21.0 6 160 110 3.90
Datsun 710 22.8 4 108 93 3.85
Hornet 4 Drive 21.4 6 258 110 3.08
Hornet Sportabout 18.7 8 360 175 3.15
Valiant 18.1 6 225 105 2.76
12.2.5 Column Type
Occasionally you will want to work with all of the variables of a certain type - perhaps to create a scatterplot matrix of all numeric variables, to create a table of means and standard deviations of continuous variables, or to create frequency tables for all factor variables.
To accomplish this task we specify the variables to use by
condition. In order to check the type of each variable, we
use the sapply
function (one of several apply
functions
used for repeated tasks).
# a variation of lapply
sapply(cars, is.character)
mpg cyl disp hp drat wt qsec vs am gear carb name make model
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
cars[,sapply(cars, is.character)] %>%
head
name make model
Mazda RX4 Mazda RX4 Mazda RX4
Mazda RX4 Wag Mazda RX4 Wag Mazda RX4 Wag
Datsun 710 Datsun 710 Datsun 710
Hornet 4 Drive Hornet 4 Drive Hornet 4 Drive
Hornet Sportabout Hornet Sportabout Hornet Sportabout
Valiant Valiant Valiant Valiant
subset(cars,
select=sapply(cars, is.character)) %>%
head
name make model
Mazda RX4 Mazda RX4 Mazda RX4
Mazda RX4 Wag Mazda RX4 Wag Mazda RX4 Wag
Datsun 710 Datsun 710 Datsun 710
Hornet 4 Drive Hornet 4 Drive Hornet 4 Drive
Hornet Sportabout Hornet Sportabout Hornet Sportabout
Valiant Valiant Valiant Valiant
12.2.6 Name range
The idea here is that we want all of the variables that are
positioned together in the data frame, for example
from mpg
to disp
inclusive.
To accomplish this in base R, we need to figure out what position
mpg
and disp
occupy in the data frame. Given these
positions, we convert a name range into a position range.
varnames <- names(cars)
# an "exact" or "fixed" regular expression
grep("mpg", varnames)
[1] 1
grep("disp", varnames)
[1] 3
# as one statement
cars[, grep("mpg", names(cars)):grep("disp", names(cars))]
mpg cyl disp
Mazda RX4 21.0 6 160.0
Mazda RX4 Wag 21.0 6 160.0
Datsun 710 22.8 4 108.0
Hornet 4 Drive 21.4 6 258.0
Hornet Sportabout 18.7 8 360.0
Valiant 18.1 6 225.0
Duster 360 14.3 8 360.0
Merc 240D 24.4 4 146.7
Merc 230 22.8 4 140.8
Merc 280 19.2 6 167.6
Merc 280C 17.8 6 167.6
Merc 450SE 16.4 8 275.8
Merc 450SL 17.3 8 275.8
Merc 450SLC 15.2 8 275.8
Cadillac Fleetwood 10.4 8 472.0
Lincoln Continental 10.4 8 460.0
Chrysler Imperial 14.7 8 440.0
Fiat 128 32.4 4 78.7
Honda Civic 30.4 4 75.7
Toyota Corolla 33.9 4 71.1
Toyota Corona 21.5 4 120.1
Dodge Challenger 15.5 8 318.0
AMC Javelin 15.2 8 304.0
Camaro Z28 13.3 8 350.0
Pontiac Firebird 19.2 8 400.0
Fiat X1-9 27.3 4 79.0
Porsche 914-2 26.0 4 120.3
Lotus Europa 30.4 4 95.1
Ford Pantera L 15.8 8 351.0
Ferrari Dino 19.7 6 145.0
Maserati Bora 15.0 8 301.0
Volvo 142E 21.4 4 121.0
12.3 Subsetting Rows
12.3.1 most often by condition
# note the repeated "cars$... prefix"
cars[cars$am==0 & cars$mpg > 20, ]
mpg cyl disp hp drat wt qsec vs am gear carb name make model
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive Hornet 4 Drive
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 240D Merc 240D
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 230 Merc 230
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Toyota Corona Toyota Corona
# with subset() you don't need to use "cars$..."
subset(cars, am==0 & mpg > 20)
mpg cyl disp hp drat wt qsec vs am gear carb name make model
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive Hornet 4 Drive
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 240D Merc 240D
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 230 Merc 230
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Toyota Corona Toyota Corona
# this is how the subset option alongside a formula works, as well.
12.3.2 by name
Sometimes people use ID values as row names.
cars["Merc 230",]
mpg cyl disp hp drat wt qsec vs am gear carb name make model
Merc 230 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 Merc 230 Merc 230
12.3.3 by position
This is occasionally useful, especially to create data sets for testing and debugging your code.
cars[1:5,]
mpg cyl disp hp drat wt qsec vs am gear carb name make model
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Mazda RX4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag Mazda RX4 Wag
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Datsun 710 Datsun 710
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive Hornet 4 Drive
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Hornet Sportabout Hornet Sportabout
12.3.4 random observations
# approximately 25% of our data
cars[runif(nrow(cars)) < 0.25,] # approximately 8 observations
mpg cyl disp hp drat wt qsec vs am gear carb name make model
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Mazda RX4
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Duster 360 Duster 360
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Merc 280 Merc 280
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 Merc 450SE Merc 450SE
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 Merc 450SL Merc 450SL
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 Fiat 128 Fiat 128
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 Honda Civic Honda Civic
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 Porsche 914-2 Porsche 914-2
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 Ferrari Dino Ferrari Dino
# exactly 8 observations
cars[sample(1:nrow(cars), 8), ]
mpg cyl disp hp drat wt qsec vs am gear carb name make model
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 Chrysler Imperial Chrysler Imperial
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 AMC Javelin AMC Javelin
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 Merc 280C Merc 280C
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 Merc 450SL Merc 450SL
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 Fiat X1-9 Fiat X1-9
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 Camaro Z28 Camaro Z28
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 Dodge Challenger Dodge Challenger
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 Honda Civic Honda Civic
12.4 Subsetting Exercises
Use the state.x77
dataset for 1 and 2.
Keep all columns except
Population
.Drop columns with spaces in their names.
Use the iris
dataset for 3 and 4.
Keep only columns that are numeric and start with “S”.
Keep rows where
Sepal.Length
is divisible by 1 with no remainder. (Hint: see%%
in?Arithmetic
.)
Use the airquality
dataset for 5 and 6.
- Make a subset of the days where the value of
Ozone
is in the range 30-50.
Bonus:
Find days where the value of
Ozone
is more than two standard deviations away from the mean, with mean and standard deviation calculated by month. (Hint: If you got zero rows in the output, did you check if Ozone is fully observed?)Use the
mtcars
dataset. Create a 3x3 scatterplot matrix (pass a dataframe toplot()
) of the mpg, weight, and horsepower of cars that have four or six cylinders, at least four forward gears, and manual transmission. Seehelp(mtcars)
to figure out which column is which.