18 Aggregating

Aggregation is the process of turning many datapoints into fewer datapoints, typically in the form of summary statistics. Examples include calculating the total income by family or the mean test score by state.

Load the airquality dataset.

air <- airquality

The tidyverse includes the dplyr package, which has functions for aggregating dataframes. Load the dplyr package now.

(Although most of what we will discuss is easier with dplyr, any of this can be accomplished with base R functions as well. These usually involve several steps, and intermediate data objects that may not be data frames. See the chapter on Aggregating with base R if you are interested in the details.)

library(dplyr)

dplyr’s summarize() function applies a function to the variables in a dataset. A simple use of summarize() is calculating the mean of a single column.

air %>% 
  summarize(TempAvg = mean(Temp))
   TempAvg
1 77.88235

summarize() can also be used in conjunction with across() to apply a function to multiple variables. across() accepts select() syntax, so we can use the functions reviewed in the section on Subsetting by Columns. For now, we can just use everything() to apply the function to all of the columns.

air %>% 
  summarize(across(everything(), 
                   mean, 
                   na.rm = T))
     Ozone  Solar.R     Wind     Temp    Month      Day
1 42.12931 185.9315 9.957516 77.88235 6.993464 15.80392

Our usage of summarize() so far mirrors that of apply(). The summary() function, while less flexible, also accomplishes the task of calculating means alongside several other summary statistics, and it is more concise. In other words, the examples above are intended only to illustrate summarize()’s functionality rather than provide examples of efficient coding.

apply(air, 2, mean, na.rm = T)
     Ozone    Solar.R       Wind       Temp      Month        Day 
 42.129310 185.931507   9.957516  77.882353   6.993464  15.803922 
summary(air)
     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
     Month            Day      
 Min.   :5.000   Min.   : 1.0  
 1st Qu.:6.000   1st Qu.: 8.0  
 Median :7.000   Median :16.0  
 Mean   :6.993   Mean   :15.8  
 3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :9.000   Max.   :31.0  
                               

18.1 Summarizing by Grouping Variables

The real power of summarize() exists when it is used to aggregate across grouping variables. We should first add some variables to aggregate along. We can create variables for the day of the week of each observation in airquality, and we can do this in two steps.

We covered Dates in a previous chapter, but it is never too late for a little more practice. The airquality dataset has the month and the day, but not the year. When in doubt, read the documentation. Reading the page for help(airquality), we see that the data is all from 1973. Standardize the date format with the ISO 8601 standard (YYYY-MM-DD).

air <- 
  air %>% 
  mutate(Date = as.Date(paste("1973", Month, Day, sep = "-")))

After each step, it is good to browse the dataset to confirm the function accomplished what you expected. It is important to catch problems early on, especially when you have a lot of wrangling to do. Click on air in your global environment, or run the command View(air) to open the viewer.

It is now possible to calculate which day of the week each datapoint comes from with the weekdays() function. Let’s also add an indicator variable for whether the date is a weekend, 1 for yes and 0 for no, with the help of the ifelse() function.

air <-
  air %>% 
  mutate(DayOfWeek = weekdays(Date),
         Weekend = ifelse(DayOfWeek %in% c("Saturday", "Sunday"), 1, 0))

Now, we can aggregate the dataframe by any number of grouping variables. Let’s calculate some descriptive statistics by the Month and Weekend variables.

group_by() takes column names as arguments. After your dataset is grouped, use summarize() to perform calculations along the grouping variables.

In the example below, the data is grouped by Month and Weekend. TempAvg will be the mean of Temp by each value of Weekend for each value of Month. So, we should get an average temperature for the weekdays in May, the weekends in May, the weekdays in June, the weekends in June, and so on, for every combination of the grouping variables. n() counts the number of observations in each group.

Since the dataframe will be much smaller after summarizing, we can print it to the console.

air_summary <- 
  air %>% 
  group_by(Month, Weekend) %>%
  summarize(TempAvg = mean(Temp),
            TempSD = sd(Temp),
            TempMax = max(Temp),
            TempMin = min(Temp),
            N_days = n())
`summarise()` has grouped output by 'Month'. You can override using the `.groups` argument.
air
    Ozone Solar.R Wind Temp Month Day       Date DayOfWeek Weekend
1      41     190  7.4   67     5   1 1973-05-01   Tuesday       0
2      36     118  8.0   72     5   2 1973-05-02 Wednesday       0
3      12     149 12.6   74     5   3 1973-05-03  Thursday       0
4      18     313 11.5   62     5   4 1973-05-04    Friday       0
5      NA      NA 14.3   56     5   5 1973-05-05  Saturday       1
6      28      NA 14.9   66     5   6 1973-05-06    Sunday       1
7      23     299  8.6   65     5   7 1973-05-07    Monday       0
8      19      99 13.8   59     5   8 1973-05-08   Tuesday       0
9       8      19 20.1   61     5   9 1973-05-09 Wednesday       0
10     NA     194  8.6   69     5  10 1973-05-10  Thursday       0
11      7      NA  6.9   74     5  11 1973-05-11    Friday       0
12     16     256  9.7   69     5  12 1973-05-12  Saturday       1
13     11     290  9.2   66     5  13 1973-05-13    Sunday       1
14     14     274 10.9   68     5  14 1973-05-14    Monday       0
15     18      65 13.2   58     5  15 1973-05-15   Tuesday       0
16     14     334 11.5   64     5  16 1973-05-16 Wednesday       0
17     34     307 12.0   66     5  17 1973-05-17  Thursday       0
18      6      78 18.4   57     5  18 1973-05-18    Friday       0
19     30     322 11.5   68     5  19 1973-05-19  Saturday       1
20     11      44  9.7   62     5  20 1973-05-20    Sunday       1
21      1       8  9.7   59     5  21 1973-05-21    Monday       0
22     11     320 16.6   73     5  22 1973-05-22   Tuesday       0
23      4      25  9.7   61     5  23 1973-05-23 Wednesday       0
24     32      92 12.0   61     5  24 1973-05-24  Thursday       0
25     NA      66 16.6   57     5  25 1973-05-25    Friday       0
26     NA     266 14.9   58     5  26 1973-05-26  Saturday       1
27     NA      NA  8.0   57     5  27 1973-05-27    Sunday       1
28     23      13 12.0   67     5  28 1973-05-28    Monday       0
29     45     252 14.9   81     5  29 1973-05-29   Tuesday       0
30    115     223  5.7   79     5  30 1973-05-30 Wednesday       0
31     37     279  7.4   76     5  31 1973-05-31  Thursday       0
32     NA     286  8.6   78     6   1 1973-06-01    Friday       0
33     NA     287  9.7   74     6   2 1973-06-02  Saturday       1
34     NA     242 16.1   67     6   3 1973-06-03    Sunday       1
35     NA     186  9.2   84     6   4 1973-06-04    Monday       0
36     NA     220  8.6   85     6   5 1973-06-05   Tuesday       0
37     NA     264 14.3   79     6   6 1973-06-06 Wednesday       0
38     29     127  9.7   82     6   7 1973-06-07  Thursday       0
39     NA     273  6.9   87     6   8 1973-06-08    Friday       0
40     71     291 13.8   90     6   9 1973-06-09  Saturday       1
41     39     323 11.5   87     6  10 1973-06-10    Sunday       1
42     NA     259 10.9   93     6  11 1973-06-11    Monday       0
43     NA     250  9.2   92     6  12 1973-06-12   Tuesday       0
44     23     148  8.0   82     6  13 1973-06-13 Wednesday       0
45     NA     332 13.8   80     6  14 1973-06-14  Thursday       0
46     NA     322 11.5   79     6  15 1973-06-15    Friday       0
47     21     191 14.9   77     6  16 1973-06-16  Saturday       1
48     37     284 20.7   72     6  17 1973-06-17    Sunday       1
49     20      37  9.2   65     6  18 1973-06-18    Monday       0
50     12     120 11.5   73     6  19 1973-06-19   Tuesday       0
51     13     137 10.3   76     6  20 1973-06-20 Wednesday       0
52     NA     150  6.3   77     6  21 1973-06-21  Thursday       0
53     NA      59  1.7   76     6  22 1973-06-22    Friday       0
54     NA      91  4.6   76     6  23 1973-06-23  Saturday       1
55     NA     250  6.3   76     6  24 1973-06-24    Sunday       1
56     NA     135  8.0   75     6  25 1973-06-25    Monday       0
57     NA     127  8.0   78     6  26 1973-06-26   Tuesday       0
58     NA      47 10.3   73     6  27 1973-06-27 Wednesday       0
59     NA      98 11.5   80     6  28 1973-06-28  Thursday       0
60     NA      31 14.9   77     6  29 1973-06-29    Friday       0
61     NA     138  8.0   83     6  30 1973-06-30  Saturday       1
62    135     269  4.1   84     7   1 1973-07-01    Sunday       1
63     49     248  9.2   85     7   2 1973-07-02    Monday       0
64     32     236  9.2   81     7   3 1973-07-03   Tuesday       0
65     NA     101 10.9   84     7   4 1973-07-04 Wednesday       0
66     64     175  4.6   83     7   5 1973-07-05  Thursday       0
67     40     314 10.9   83     7   6 1973-07-06    Friday       0
68     77     276  5.1   88     7   7 1973-07-07  Saturday       1
69     97     267  6.3   92     7   8 1973-07-08    Sunday       1
70     97     272  5.7   92     7   9 1973-07-09    Monday       0
71     85     175  7.4   89     7  10 1973-07-10   Tuesday       0
72     NA     139  8.6   82     7  11 1973-07-11 Wednesday       0
73     10     264 14.3   73     7  12 1973-07-12  Thursday       0
74     27     175 14.9   81     7  13 1973-07-13    Friday       0
75     NA     291 14.9   91     7  14 1973-07-14  Saturday       1
76      7      48 14.3   80     7  15 1973-07-15    Sunday       1
77     48     260  6.9   81     7  16 1973-07-16    Monday       0
78     35     274 10.3   82     7  17 1973-07-17   Tuesday       0
79     61     285  6.3   84     7  18 1973-07-18 Wednesday       0
80     79     187  5.1   87     7  19 1973-07-19  Thursday       0
81     63     220 11.5   85     7  20 1973-07-20    Friday       0
82     16       7  6.9   74     7  21 1973-07-21  Saturday       1
83     NA     258  9.7   81     7  22 1973-07-22    Sunday       1
84     NA     295 11.5   82     7  23 1973-07-23    Monday       0
85     80     294  8.6   86     7  24 1973-07-24   Tuesday       0
86    108     223  8.0   85     7  25 1973-07-25 Wednesday       0
87     20      81  8.6   82     7  26 1973-07-26  Thursday       0
88     52      82 12.0   86     7  27 1973-07-27    Friday       0
89     82     213  7.4   88     7  28 1973-07-28  Saturday       1
90     50     275  7.4   86     7  29 1973-07-29    Sunday       1
91     64     253  7.4   83     7  30 1973-07-30    Monday       0
92     59     254  9.2   81     7  31 1973-07-31   Tuesday       0
93     39      83  6.9   81     8   1 1973-08-01 Wednesday       0
94      9      24 13.8   81     8   2 1973-08-02  Thursday       0
95     16      77  7.4   82     8   3 1973-08-03    Friday       0
96     78      NA  6.9   86     8   4 1973-08-04  Saturday       1
97     35      NA  7.4   85     8   5 1973-08-05    Sunday       1
98     66      NA  4.6   87     8   6 1973-08-06    Monday       0
99    122     255  4.0   89     8   7 1973-08-07   Tuesday       0
100    89     229 10.3   90     8   8 1973-08-08 Wednesday       0
101   110     207  8.0   90     8   9 1973-08-09  Thursday       0
102    NA     222  8.6   92     8  10 1973-08-10    Friday       0
103    NA     137 11.5   86     8  11 1973-08-11  Saturday       1
104    44     192 11.5   86     8  12 1973-08-12    Sunday       1
105    28     273 11.5   82     8  13 1973-08-13    Monday       0
106    65     157  9.7   80     8  14 1973-08-14   Tuesday       0
107    NA      64 11.5   79     8  15 1973-08-15 Wednesday       0
108    22      71 10.3   77     8  16 1973-08-16  Thursday       0
109    59      51  6.3   79     8  17 1973-08-17    Friday       0
110    23     115  7.4   76     8  18 1973-08-18  Saturday       1
111    31     244 10.9   78     8  19 1973-08-19    Sunday       1
112    44     190 10.3   78     8  20 1973-08-20    Monday       0
113    21     259 15.5   77     8  21 1973-08-21   Tuesday       0
114     9      36 14.3   72     8  22 1973-08-22 Wednesday       0
115    NA     255 12.6   75     8  23 1973-08-23  Thursday       0
116    45     212  9.7   79     8  24 1973-08-24    Friday       0
117   168     238  3.4   81     8  25 1973-08-25  Saturday       1
118    73     215  8.0   86     8  26 1973-08-26    Sunday       1
119    NA     153  5.7   88     8  27 1973-08-27    Monday       0
120    76     203  9.7   97     8  28 1973-08-28   Tuesday       0
121   118     225  2.3   94     8  29 1973-08-29 Wednesday       0
122    84     237  6.3   96     8  30 1973-08-30  Thursday       0
123    85     188  6.3   94     8  31 1973-08-31    Friday       0
124    96     167  6.9   91     9   1 1973-09-01  Saturday       1
125    78     197  5.1   92     9   2 1973-09-02    Sunday       1
126    73     183  2.8   93     9   3 1973-09-03    Monday       0
127    91     189  4.6   93     9   4 1973-09-04   Tuesday       0
128    47      95  7.4   87     9   5 1973-09-05 Wednesday       0
129    32      92 15.5   84     9   6 1973-09-06  Thursday       0
130    20     252 10.9   80     9   7 1973-09-07    Friday       0
131    23     220 10.3   78     9   8 1973-09-08  Saturday       1
132    21     230 10.9   75     9   9 1973-09-09    Sunday       1
133    24     259  9.7   73     9  10 1973-09-10    Monday       0
134    44     236 14.9   81     9  11 1973-09-11   Tuesday       0
135    21     259 15.5   76     9  12 1973-09-12 Wednesday       0
136    28     238  6.3   77     9  13 1973-09-13  Thursday       0
137     9      24 10.9   71     9  14 1973-09-14    Friday       0
138    13     112 11.5   71     9  15 1973-09-15  Saturday       1
139    46     237  6.9   78     9  16 1973-09-16    Sunday       1
140    18     224 13.8   67     9  17 1973-09-17    Monday       0
141    13      27 10.3   76     9  18 1973-09-18   Tuesday       0
142    24     238 10.3   68     9  19 1973-09-19 Wednesday       0
143    16     201  8.0   82     9  20 1973-09-20  Thursday       0
144    13     238 12.6   64     9  21 1973-09-21    Friday       0
145    23      14  9.2   71     9  22 1973-09-22  Saturday       1
146    36     139 10.3   81     9  23 1973-09-23    Sunday       1
147     7      49 10.3   69     9  24 1973-09-24    Monday       0
148    14      20 16.6   63     9  25 1973-09-25   Tuesday       0
149    30     193  6.9   70     9  26 1973-09-26 Wednesday       0
150    NA     145 13.2   77     9  27 1973-09-27  Thursday       0
151    14     191 14.3   75     9  28 1973-09-28    Friday       0
152    18     131  8.0   76     9  29 1973-09-29  Saturday       1
153    20     223 11.5   68     9  30 1973-09-30    Sunday       1

Of course, there is no limit on how many functions you can pipe, so all of the above steps could be combined together as follows:

air_summary <- 
  airquality %>% 
  mutate(Date = as.Date(paste("1973", Month, Day, sep = "-")),
         DayOfWeek = weekdays(Date),
         Weekend = ifelse(DayOfWeek %in% c("Saturday", "Sunday"), 1, 0)) %>% 
  group_by(Month, Weekend) %>% 
  summarize(TempAvg = mean(Temp),
            TempSD = sd(Temp),
            TempMax = max(Temp),
            TempMin = min(Temp),
            N_days = n())
`summarise()` has grouped output by 'Month'. You can override using the `.groups` argument.

This is a good place to stop for a moment and be thankful for piping. The code below is the same as the code above, but instead of using pipes, it is in a nested format and without any line breaks. Of course, we could make it a little prettier by adding in some line breaks or by taking the output of each function and saving it as a new object to be used in the following function.

Looking at the code below, ask yourself a couple questions:

  • Is it easy to read from inside to outside?
  • How long did it take you to find the name of the dataset we are using?
air_summary <- summarize(group_by(mutate(airquality, Date = as.Date(paste("1973", Month, Day, sep = "-")), DayOfWeek = weekdays(Date), Weekend = ifelse(DayOfWeek %in% c("Saturday", "Sunday"), 1, 0)), Month, Weekend), TempAvg = mean(Temp), TempSD = sd(Temp), TempMax = max(Temp), TempMin = min(Temp), N_days = n())
`summarise()` has grouped output by 'Month'. You can override using the `.groups` argument.

18.2 Adding Group-Level Information without Removing Rows

The code in the above section reduced the number of rows in the dataframe when calculating group-level variables. This is useful in situations where you want to create summary statistics for succinct tables, such as mean income by level of education.

However, if you want to keep your whole dataset but add information along a grouping variable, you can do that too. Just change summarize() to mutate() in the previous example:

air_summary2 <-
  airquality %>% 
  mutate(Date = as.Date(paste("1973", Month, Day, sep = "-")),
         DayOfWeek = weekdays(Date),
         Weekend = ifelse(DayOfWeek %in% c("Saturday", "Sunday"), 1, 0)) %>% 
  group_by(Month, Weekend) %>% 
  mutate(TempAvg = mean(Temp),
         TempSD = sd(Temp),
         TempMax = max(Temp),
         TempMin = min(Temp),
         N_days = n())

air_summary2 %>% select(-c(1:3, 7)) # drop a few columns to fit the others on the screen
# A tibble: 153 x 10
# Groups:   Month, Weekend [10]
    Temp Month   Day DayOfWeek Weekend TempAvg TempSD TempMax TempMin N_days
   <int> <int> <int> <chr>       <dbl>   <dbl>  <dbl>   <int>   <int>  <int>
 1    67     5     1 Tuesday         0    66.5   7.19      81      57     23
 2    72     5     2 Wednesday       0    66.5   7.19      81      57     23
 3    74     5     3 Thursday        0    66.5   7.19      81      57     23
 4    62     5     4 Friday          0    66.5   7.19      81      57     23
 5    56     5     5 Saturday        1    62.8   5.20      69      56      8
 6    66     5     6 Sunday          1    62.8   5.20      69      56      8
 7    65     5     7 Monday          0    66.5   7.19      81      57     23
 8    59     5     8 Tuesday         0    66.5   7.19      81      57     23
 9    61     5     9 Wednesday       0    66.5   7.19      81      57     23
10    69     5    10 Thursday        0    66.5   7.19      81      57     23
# ... with 143 more rows

Note that, unlike in Stata where we need to sort our data with bysort in order to perform group-wise calculations, R does not require our dataframe to be sorted.

The group-level summary statistics are repeated alongside the individual cases.

We can check the grouping variables in a dataframe with group_keys().

air_summary2 %>% group_vars()
[1] "Month"   "Weekend"

It is good practice to ungroup() your dataframe after carrying out operations. If you do not, you may forget and unknowingly perform group-wise calculations when you meant to perform global calculations.

air_summary2 <- 
  air_summary2 %>% 
  ungroup()

The empty vector output from group_vars() confirms that our dataframe is no longer grouped.

air_summary2 %>% group_vars()
character(0)

18.3 Aggregating Exercises

Use the chickwts dataset.

  1. Is this a balanced experiment (same number of individuals/chickens in each condition/feed)?

  2. Which feed was associated with the largest variation (standard deviation) in weight?

  3. Without reducing the number of rows, add a column with the range (maximum - minimum) of weight for each feed.