Supporting Statistical Analysis for Research

## 3.8 Selecting rows

Rows are typically selected based on some condition in the data, as apposed to by name as columns typically are. The filter() function takes a Boolean variable and removes rows (from all columns) that are FALSE, keeping only the rows with TRUE.

Examples

1. Dropping observations (rows.)

This example uses filter() to create a subset data frame containing the 1000 largest companies.

forbes_1000 <-
forbes %>%
filter(
rank >= 1000
)

forbes_1000 %>%
print(10)
# A tibble: 6 x 14
name  market_value country  rank category sales profits assets     pe
<chr>        <dbl> <fct>   <dbl> <chr>    <dbl>   <dbl>  <dbl>  <dbl>
1 Nort~         2.48 United~  1000 Utiliti~  6.73   0.13   11.0   19.1
2 Kore~         1.76 South ~  1001 Utiliti~  6.17   0.25    7.87   7.04
3 MOL           3.15 Hungary  1002 Oil & g~  5.16   0.290   4.2   10.9
4 Firs~         2.36 United~  1003 Insuran~  5.98   0.44    4.28   5.36
5 Sumi~         4.42 Japan    1004 Diversi~  4.52   0.04   16.9  110.
6 Hibe~         3.58 United~  1005 Banking   1.33   0.25   17.6   14.3
# ... with 5 more variables: nafta <lgl>, profit_lev <fct>,
#   industry <chr>, profits_std <dbl>, outlier <lgl>
2. Conditional examination of the data.

forbes %>%
select(name, country, rank, market_value, nafta) %>%
filter(nafta)
# A tibble: 824 x 5
name                country        rank market_value nafta
<chr>               <fct>         <dbl>        <dbl> <lgl>
1 Citigroup           United States     1        255.  TRUE
2 General Electric    United States     2        329.  TRUE
3 American Intl Group United States     3        195.  TRUE
4 ExxonMobil          United States     4        277.  TRUE
5 Bank of America     United States     6        118.  TRUE
6 Fannie Mae          United States     9         76.8 TRUE
7 Wal-Mart Stores     United States    10        244.  TRUE
8 Berkshire Hathaway  United States    14        141.  TRUE
9 JP Morgan Chase     United States    15         81.9 TRUE
10 IBM                 United States    16        172.  TRUE
# ... with 814 more rows
3. Conditional proportion

The filter() function can be used to calculate a proportion conditional some state of the data.

Here we will recalculate the proportion of outlier profits conditional on being based in a NAFTA country.

forbes %>%
filter(
nafta
) %>%
summarise(
outlier_proportion = mean(outlier, na.rm = TRUE)
)
# A tibble: 1 x 1
outlier_proportion
<dbl>
1              0.229