14 Aggregating

For some analysis and graphing tasks you will need to work with summary data in a data frame. Occasionally this might be the summary statistics of an entire data set, but more often it will be measures that summarize groups of observations. The task of producing such a data frame may be called summarizing, aggregating, or collapsing data.

For some tasks, the summary data set itself may be a suitable endpoint of your data wrangling, but for other problems you will want to merge the summary data back onto the original data set. In base R, this is often a three step process:

  • produce summary statistics
  • ensure the result is a data frame
  • merge

While it is useful to understand how to set this up in base R, functions in the tidyverse package dplyr can greatly simplify all this, especially where you want to merge summary statistics back onto your original data set. (See Chapter 18.)

What follows here is base R.

14.1 Summary Statistics

Let’s begin with a review of how to produce various summary statistics, but with the added emphasis of turning the results into data.frames.

Let’s work with a subset of the mtcars data.

cars <- mtcars[,1:5]

14.1.1 Grand Summaries

The usual summary statistics applied to vectors - mean, minimum, maximum, length (i.e. counts) - produce results that are essentially scalars. The only hard part here is remembering to consider missing values!

Applied to multiple columns in a data.frame using the apply or sapply functions, our summary statistics are returned as vectors. Here, it may not be obvious how to turn this into a data.frame. The key is to realize that vectors are considered column vectors by default. The matrix transpose function, t(), fixes this for us.

means <- sapply(cars, mean)
as.data.frame(t(means))
       mpg    cyl     disp       hp     drat
1 20.09062 6.1875 230.7219 146.6875 3.596563

14.1.2 Group Summaries

More often than grand summaries we will be interested in group summaries. To produce these summaries with a vector and a grouping variable, you might typically use the tapply function, returning a vector of results.

mpgmin <- tapply(cars$mpg, cars$cyl, min)
mpgmin
   4    6    8 
21.4 17.8 10.4 

And we can turn this into a data frame, with the grouping data values as row names.

as.data.frame(mpgmin)
  mpgmin
4   21.4
6   17.8
8   10.4

Another useful function, which produces a data frame directly, is aggregate. For summaries of a single variable, the formula method of specification is perhaps easiest to write and labels the result most nicely.

aggregate(mpg ~ cyl, cars, min)
  cyl  mpg
1   4 21.4
2   6 17.8
3   8 10.4

The aggregate function is also useful where you have multiple variables to summarize by the same groups, which tapply will not handle.

aggregate(cars, list(cars$cyl), min)
  Group.1  mpg cyl  disp  hp drat
1       4 21.4   4  71.1  52 3.69
2       6 17.8   6 145.0 105 2.76
3       8 10.4   8 275.8 150 2.76

14.2 Merging

Often we want to include summary data alongside the original data. This entails merging the summary data with the original data frame.

14.2.1 Merging a Named Vector

Where we have a grouped summary statistic in a named vector, we can merge this with the original data directly, skipping the data framing step. We simply use the “by” variable from the original data as a vector of category names (after coercing to character) to construct a new vector in the data frame.

mpgmin <- tapply(cars$mpg, cars$cyl, min)
cars$mpgmin <- mpgmin[as.character(cars$cyl)]
head(cars)
                   mpg cyl disp  hp drat mpgmin
Mazda RX4         21.0   6  160 110 3.90   17.8
Mazda RX4 Wag     21.0   6  160 110 3.90   17.8
Datsun 710        22.8   4  108  93 3.85   21.4
Hornet 4 Drive    21.4   6  258 110 3.08   17.8
Hornet Sportabout 18.7   8  360 175 3.15   10.4
Valiant           18.1   6  225 105 2.76   17.8

14.2.2 Merging a Data Frame

See Chapter 13 for more on the basics of merging data frames. In this case, we especially need to think about how our summary statistic columns are names in order to avoid confusion in the merged data.

cars <- mtcars[,1:3] # simple example

maxcars <- aggregate(cars, list(cars$cyl), max) # summarize

# cleanup prior to merging
names(maxcars) <- c("cyl", paste(names(cars), "max", sep="_"))
maxcars$cyl_max <- NULL # drop

carsplus <- merge(cars, maxcars, all=TRUE)
carsplus[11:20,]
   cyl  mpg  disp mpg_max disp_max
11   4 27.3  79.0    33.9    146.7
12   6 21.0 160.0    21.4    258.0
13   6 21.0 160.0    21.4    258.0
14   6 17.8 167.6    21.4    258.0
15   6 21.4 258.0    21.4    258.0
16   6 18.1 225.0    21.4    258.0
17   6 19.2 167.6    21.4    258.0
18   6 19.7 145.0    21.4    258.0
19   8 18.7 360.0    19.2    472.0
20   8 17.3 275.8    19.2    472.0

14.3 Aggregating Exercises

Use the chickwts dataset.

  1. Is this a balanced experiment (are the same number of chickens given each feed)?

  2. Which feed was associated with the largest variation (standard deviation) in weight?

  3. Without reducing the number of rows, add a column with the range (maximum - minimum) of weight for each feed.