SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

3.8 Selecting rows

Rows are typically selected based on some condition in the data, as apposed to by name as columns typically are. The filter() function takes a Boolean variable and removes rows (from all columns) that are FALSE, keeping only the rows with TRUE.

Examples

  1. Dropping observations (rows.)

    This example uses filter() to create a subset data frame containing the 1000 largest companies.

    forbes_1000 <-
      forbes %>%
      filter(
        rank >= 1000
        )
    
    forbes_1000 %>%
      head() %>%
      print(10)
    # A tibble: 6 x 14
      name  market_value country  rank category sales profits assets     pe
      <chr>        <dbl> <fct>   <dbl> <chr>    <dbl>   <dbl>  <dbl>  <dbl>
    1 Nort~         2.48 United~  1000 Utiliti~  6.73   0.13   11.0   19.1 
    2 Kore~         1.76 South ~  1001 Utiliti~  6.17   0.25    7.87   7.04
    3 MOL           3.15 Hungary  1002 Oil & g~  5.16   0.290   4.2   10.9 
    4 Firs~         2.36 United~  1003 Insuran~  5.98   0.44    4.28   5.36
    5 Sumi~         4.42 Japan    1004 Diversi~  4.52   0.04   16.9  110.  
    6 Hibe~         3.58 United~  1005 Banking   1.33   0.25   17.6   14.3 
    # ... with 5 more variables: nafta <lgl>, profit_lev <fct>,
    #   industry <chr>, profits_std <dbl>, outlier <lgl>
  2. Conditional examination of the data.

    forbes %>%
      select(name, country, rank, market_value, nafta) %>%
      filter(nafta)
    # A tibble: 824 x 5
       name                country        rank market_value nafta
       <chr>               <fct>         <dbl>        <dbl> <lgl>
     1 Citigroup           United States     1        255.  TRUE 
     2 General Electric    United States     2        329.  TRUE 
     3 American Intl Group United States     3        195.  TRUE 
     4 ExxonMobil          United States     4        277.  TRUE 
     5 Bank of America     United States     6        118.  TRUE 
     6 Fannie Mae          United States     9         76.8 TRUE 
     7 Wal-Mart Stores     United States    10        244.  TRUE 
     8 Berkshire Hathaway  United States    14        141.  TRUE 
     9 JP Morgan Chase     United States    15         81.9 TRUE 
    10 IBM                 United States    16        172.  TRUE 
    # ... with 814 more rows
  3. Conditional proportion

    The filter() function can be used to calculate a proportion conditional some state of the data.

    Here we will recalculate the proportion of outlier profits conditional on being based in a NAFTA country.

    forbes %>%
      filter(
        nafta
        ) %>%
      summarise(
        outlier_proportion = mean(outlier, na.rm = TRUE)
      )
    # A tibble: 1 x 1
      outlier_proportion
                   <dbl>
    1              0.229