Supporting Statistical Analysis for Research

## 3.9 Working with groups

The group_by() function conceptually divides a data frame into a set of data frames. Function applied to a grouped data frame will operate separately on each group of the data. The ungroup() function return a data from from a grouped data frame. The group_by() and ungroup() functions are typically used in pairs. That is, when the work that relies on grouping is complete there should be an ungroup(). An exception to this is when a summarizing operation is done on the data frame, there typically is no need to ungroup().

Examples

1. Calculate proportions for each group

Here we will recalculate the proportion of outlier profits conditional on each level of the nafta variable.

forbes %>%
group_by(
nafta
) %>%
summarise(
outlier_proportion = mean(outlier, na.rm = TRUE)
)
# A tibble: 2 x 2
nafta outlier_proportion
<lgl>              <dbl>
1 FALSE              0.195
2 TRUE               0.229
2. Summarising with multiple grouping variables

nafta_industries <-
forbes %>%
group_by(nafta, industry) %>%
summarise(mean = mean(market_value))

nafta_industries
# A tibble: 8 x 3
# Groups:   nafta [2]
nafta industry  mean
<lgl> <chr>    <dbl>
1 FALSE finance   8.23
2 FALSE oil_gas  23.4
3 FALSE other     8.80
4 FALSE tech     19.8
5 TRUE  finance  14.1
6 TRUE  oil_gas  17.1
7 TRUE  other    11.7
8 TRUE  tech     30.9 

The returned data frame in this example has two columns that identify the sub-group that has been summarized.

3. Find largest values within groups

Here we the largest profits in each industry.

This example makes use of the rank() function. This function returns an order number for each observation based on the assending value of the variable being ranked.

This example also makes use of the arrange() function to sort the data frame.

forbes <-
forbes %>%
group_by(
industry
) %>%
mutate(
profit_rank = rank(desc(profits))
) %>%
ungroup

forbes %>%
filter(
profit_rank <= 5
) %>%
arrange(industry, desc(profits)) %>%
select(name, country, industry, profits, market_value)
# A tibble: 20 x 5
name                country                industry profits market_value
<chr>               <fct>                  <chr>      <dbl>        <dbl>
1 Citigroup           United States          finance    17.8         255.
2 Bank of America     United States          finance    10.8         118.
3 Freddie Mac         United States          finance    10.1          44.2
4 Berkshire Hathaway  United States          finance     6.95        141.
5 HSBC Group          United Kingdom         finance     6.66        178.
6 ExxonMobil          United States          oil_gas    21.0         277.
7 BP                  United Kingdom         oil_gas    10.3         174.
8 Total               France                 oil_gas     8.84        117.
9 Royal Dutch/Shell ~ Netherlands/ United K~ oil_gas     8.4         163.
10 ChevronTexaco       United States          oil_gas     7.43         92.5
11 General Electric    United States          other      15.6         329.
12 Altria Group        United States          other       9.2         111.
13 Wal-Mart Stores     United States          other       9.05        244.
14 Toyota Motor        Japan                  other       7.99        115.
15 SBC Communications  United States          other       5.97         82.9
16 Microsoft           United States          tech        8.88        287.
17 IBM                 United States          tech        7.58        172.
18 Merck & Co          United States          tech        7.33        109.
19 Johnson & Johnson   United States          tech        6.74        161.
20 GlaxoSmithKline     United Kingdom         tech        6.34        125. 

The above code creates a rank variable. Then created the display based on this rank variable. If there is no need for a rank variable, the following code could be used. This code uses the top_n() function to filter on the largest values of a variable.

forbes %>%
group_by(
industry
) %>%
top_n(
n = 5,
wt = profits
) %>%
arrange(industry, desc(profits)) %>%
select(name, country, industry, profits, market_value)
# A tibble: 20 x 5
# Groups:   industry [4]
name                country                industry profits market_value
<chr>               <fct>                  <chr>      <dbl>        <dbl>
1 Citigroup           United States          finance    17.8         255.
2 Bank of America     United States          finance    10.8         118.
3 Freddie Mac         United States          finance    10.1          44.2
4 Berkshire Hathaway  United States          finance     6.95        141.
5 HSBC Group          United Kingdom         finance     6.66        178.
6 ExxonMobil          United States          oil_gas    21.0         277.
7 BP                  United Kingdom         oil_gas    10.3         174.
8 Total               France                 oil_gas     8.84        117.
9 Royal Dutch/Shell ~ Netherlands/ United K~ oil_gas     8.4         163.
10 ChevronTexaco       United States          oil_gas     7.43         92.5
11 General Electric    United States          other      15.6         329.
12 Altria Group        United States          other       9.2         111.
13 Wal-Mart Stores     United States          other       9.05        244.
14 Toyota Motor        Japan                  other       7.99        115.
15 SBC Communications  United States          other       5.97         82.9
16 Microsoft           United States          tech        8.88        287.
17 IBM                 United States          tech        7.58        172.
18 Merck & Co          United States          tech        7.33        109.
19 Johnson & Johnson   United States          tech        6.74        161.
20 GlaxoSmithKline     United Kingdom         tech        6.34        125.