Supporting Statistical Analysis for Research
3.8 Selecting rows
Rows are typically selected based on some condition in
the data,
as apposed to by name as columns typically are.
The filter()
function takes a Boolean variable and
removes rows (from all columns) that are FALSE
,
keeping only the rows with TRUE
.
Examples
Dropping observations (rows.)
This example uses
filter()
to create a subset data frame containing the 1000 largest companies.forbes_1000 <- forbes %>% filter( rank >= 1000 ) forbes_1000 %>% head() %>% print(10)
# A tibble: 6 x 14 name market_value country rank category sales profits assets pe <chr> <dbl> <fct> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> 1 Nort~ 2.48 United~ 1000 Utiliti~ 6.73 0.13 11.0 19.1 2 Kore~ 1.76 South ~ 1001 Utiliti~ 6.17 0.25 7.87 7.04 3 MOL 3.15 Hungary 1002 Oil & g~ 5.16 0.290 4.2 10.9 4 Firs~ 2.36 United~ 1003 Insuran~ 5.98 0.44 4.28 5.36 5 Sumi~ 4.42 Japan 1004 Diversi~ 4.52 0.04 16.9 110. 6 Hibe~ 3.58 United~ 1005 Banking 1.33 0.25 17.6 14.3 # ... with 5 more variables: nafta <lgl>, profit_lev <fct>, # industry <chr>, profits_std <dbl>, outlier <lgl>
Conditional examination of the data.
forbes %>% select(name, country, rank, market_value, nafta) %>% filter(nafta)
# A tibble: 824 x 5 name country rank market_value nafta <chr> <fct> <dbl> <dbl> <lgl> 1 Citigroup United States 1 255. TRUE 2 General Electric United States 2 329. TRUE 3 American Intl Group United States 3 195. TRUE 4 ExxonMobil United States 4 277. TRUE 5 Bank of America United States 6 118. TRUE 6 Fannie Mae United States 9 76.8 TRUE 7 Wal-Mart Stores United States 10 244. TRUE 8 Berkshire Hathaway United States 14 141. TRUE 9 JP Morgan Chase United States 15 81.9 TRUE 10 IBM United States 16 172. TRUE # ... with 814 more rows
Conditional proportion
The
filter()
function can be used to calculate a proportion conditional some state of the data.Here we will recalculate the proportion of outlier profits conditional on being based in a NAFTA country.
forbes %>% filter( nafta ) %>% summarise( outlier_proportion = mean(outlier, na.rm = TRUE) )
# A tibble: 1 x 1 outlier_proportion <dbl> 1 0.229