14  Aggregating Dataframes

14.1 Warm-Up

Four datasets are presented below. Dataset 1 was used to create Datasets 2-4. Discuss the relationships between the four datasets:

  1. How do the four differ in terms of the numbers of their rows and columns?
  2. How were the columns mean1 and mean2 calculated?
  3. Which columns have repeated values?

Dataset 1:

id wave value
a 1 5
a 2 4
a 3 9
b 1 3
b 2 10
b 3 8

Dataset 2:

id wave value mean1 mean2
a 1 5 6 4.0
a 2 4 6 7.0
a 3 9 6 8.5
b 1 3 7 4.0
b 2 10 7 7.0
b 3 8 7 8.5

Dataset 3:

id mean1
a 6
b 7

Dataset 4:

id mean2
a 6
b 7

14.2 Outcomes

Objective: To produce summary statistics along one or more grouping variables.

Why it matters: Many datasets are inherently multilevel: individuals in families, counties in states, observations within participants, and so on. We need to create variables like the number of individuals in a family, or the mean vote share by county for some political party.

Learning outcomes:

Fundamental Skills Extended Skills
  • Summarize variables of interest across one or more grouping variables.

  • Add group-level information to a dataframe without removing rows.

  • Find and remove a dataframe’s grouping variables.

  • Check quality of panel data.

Key functions and operators:

group_by()
summarize()
ungroup()

14.3 Aggregating Without Grouping Variables

Aggregation is the process of turning many data points into fewer data points, typically in the form of summary statistics.

Load the tidyverse.

library(tidyverse)

The summarize() function follows a similar structure to mutate(). Its arguments are one or more lines of column_name = how_to_create_it. The function on the right side must return only a single value, such as mean(), sd(), min(), or max(). Functions that return more than one value include unique(), quantile(), and range(); do not use these by themselves.

Calculate the mean and standard deviation of the bill_len column in penguins, ignoring missing data with na.rm = T:

penguins |> 
  summarize(bill_len_avg = mean(bill_len, na.rm = T),
            bill_len_sd = sd(bill_len, na.rm = T))
  bill_len_avg bill_len_sd
1     43.92193    5.459584

summarize() only returns the columns we requested; the others are omitted.

14.4 Aggregating by Grouping Variables

The real power of summarize() exists when it is used to aggregate across grouping variables.

To do this, use the group_by() function with one or more groups as arguments.

To calculate the mean and standard deviation of bill_len by species, run the same code as above but with an additional line of group_by(species):

penguins |> 
  group_by(species) |> 
  summarize(bill_len_avg = mean(bill_len, na.rm = T),
            bill_len_sd = sd(bill_len, na.rm = T))
# A tibble: 3 × 3
  species   bill_len_avg bill_len_sd
  <fct>            <dbl>       <dbl>
1 Adelie            38.8        2.66
2 Chinstrap         48.8        3.34
3 Gentoo            47.5        3.08

Now the requested statistics are calculated separately for each level of the grouping variable: Adelie has its own mean and standard deviation, as does Chinstrap, as does Gentoo.

In addition to the columns we named in summarize(), the grouping variable(s) is also returned.

More than one grouping variable can be used. First, check for the combinations of species and island in penguins:

penguins |> 
  distinct(species, island)
    species    island
1    Adelie Torgersen
2    Adelie    Biscoe
3    Adelie     Dream
4    Gentoo    Biscoe
5 Chinstrap     Dream

Adelie penguins appear on three different islands, but Gentoo and Chinstrap appear on only one island each. With more than one grouping variable, group_by() finds all combinations that exist in the dataset. That is, it will not try to create all possible pairwise combinations of these two variables, so it will not return nonexistent combinations (e.g., no Gentoo-Dream or Chinstrap-Torgersen).

Run the same code as above, but now with species and island as the groups:

penguins |> 
  group_by(species, island) |> 
  summarize(bill_len_avg = mean(bill_len, na.rm = T),
            bill_len_sd = sd(bill_len, na.rm = T))
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 5 × 4
# Groups:   species [3]
  species   island    bill_len_avg bill_len_sd
  <fct>     <fct>            <dbl>       <dbl>
1 Adelie    Biscoe            39.0        2.48
2 Adelie    Dream             38.5        2.47
3 Adelie    Torgersen         39.0        3.03
4 Chinstrap Dream             48.8        3.34
5 Gentoo    Biscoe            47.5        3.08

With a few more functions and some creativity, we can return even more useful information with summarize():

  • The number of observations/rows/penguins by group: n()
  • The number of missing values in bill_len by group: sum(is.na(bill_len))
  • The proportion of missing values in bill_len by group: mean(is.na(bill_len))
  • The number of different values of year by group: length(unique(year))
penguins |> 
  group_by(species, island) |> 
  summarize(n = n(),
            miss_n = sum(is.na(bill_len)),
            miss_prop = mean(is.na(bill_len)),
            year_n = length(unique(year)))
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 5 × 6
# Groups:   species [3]
  species   island        n miss_n miss_prop year_n
  <fct>     <fct>     <int>  <int>     <dbl>  <int>
1 Adelie    Biscoe       44      0   0            3
2 Adelie    Dream        56      0   0            3
3 Adelie    Torgersen    52      1   0.0192       3
4 Chinstrap Dream        68      0   0            3
5 Gentoo    Biscoe      124      1   0.00806      3

14.5 Adding Group-Level Information Without Removing Rows

The code in the above section reduced the number of rows in the dataframe when calculating group-level variables. This is useful in situations where you want to create summary statistics for succinct tables, such as mean income by level of education.

However, if you want to keep your whole dataset but add information along a grouping variable, you can do that too. Just change summarize() to mutate() in the first example:

penguins |> 
  group_by(species) |> 
  mutate(bill_len_avg = mean(bill_len, na.rm = T),
         bill_len_sd = sd(bill_len, na.rm = T)) |> 
  select(-bill_dep, -flipper_len) # drop some columns to see the new columns
# A tibble: 344 × 8
# Groups:   species [3]
   species island    bill_len body_mass sex     year bill_len_avg bill_len_sd
   <fct>   <fct>        <dbl>     <int> <fct>  <int>        <dbl>       <dbl>
 1 Adelie  Torgersen     39.1      3750 male    2007         38.8        2.66
 2 Adelie  Torgersen     39.5      3800 female  2007         38.8        2.66
 3 Adelie  Torgersen     40.3      3250 female  2007         38.8        2.66
 4 Adelie  Torgersen     NA          NA <NA>    2007         38.8        2.66
 5 Adelie  Torgersen     36.7      3450 female  2007         38.8        2.66
 6 Adelie  Torgersen     39.3      3650 male    2007         38.8        2.66
 7 Adelie  Torgersen     38.9      3625 female  2007         38.8        2.66
 8 Adelie  Torgersen     39.2      4675 male    2007         38.8        2.66
 9 Adelie  Torgersen     34.1      3475 <NA>    2007         38.8        2.66
10 Adelie  Torgersen     42        4250 <NA>    2007         38.8        2.66
# ℹ 334 more rows

Now, the group-level information is included and repeated alongside individual cases.

This is useful when preparing data for multilevel modeling, so that aggregated level-one variables can be used as level-two predictors.

14.6 Removing Groupings

When we first learned how to use mutate() in a previous chapter, we had yet to learn about groups. Now we see that adding groups to a dataframe changes how this function works.

Groups can continue to persist after some operations. Notice how the top of the output in the previous example states Groups: species [3]. Confirm this by rerunning the code above, assigning the resulting dataframe to dat, and checking the groups with group_vars():

dat <- 
  penguins |> 
  group_by(species) |> 
  mutate(bill_len_avg = mean(bill_len, na.rm = T),
         bill_len_sd = sd(bill_len, na.rm = T)) |> 
  select(-bill_dep, -flipper_len)

group_vars(dat)
[1] "species"

group_vars() returns a character vector of the grouping variables.

The consequence of persisting groups is that a later mutate() or summarize() will continue to work with these groups, even if we do not explicitly declare groups with group_by() in that later operation!

This can have unintended effects, so it is best practice to remove groups after each operation that uses those groups. To do this, just follow each operation with a pipe into ungroup():

dat <- 
  penguins |> 
  group_by(species) |> 
  mutate(bill_len_avg = mean(bill_len, na.rm = T),
         bill_len_sd = sd(bill_len, na.rm = T)) |> 
  select(-bill_dep, -flipper_len) |> 
  ungroup()

group_vars(dat)
character(0)

character(0) is a character vector with length zero, meaning that there are no groups associated with the data.

14.7 Checking Quality of Panel Data

When working with panel or hierarchical or longitudinal data, we need to perform checks of the data quality to catch possible data entry errors and abnormal data.

These checks can be done with the skills taught above.

Import the waves sample dataset:

waves <- read.csv("https://sscc.wisc.edu/sscc/pubs/dwr/data/aggregate_waves.csv")
waves
   id wave birth_year grade
1   1    1       2001     3
2   1    2       2001     4
3   1    3       2001     5
4   2    1       1995     9
5   2    2       1995    10
6   2    3       1995    11
7   3    1       1998     6
8   3    2       1998     7
9   3    3       1998     8
10  4    1       1994    10
11  4    2       1994    11
12  4    3       1994    12
13  5    1       2002     2
14  5    2       2003     3
15  5    3       2002     4
16  6    1       2000     4
17  6    2       2000     4
18  6    3       2000     6

This dataset has longitudinal data for students over three years, with one wave per year. At each wave, birth_year and grade were collected. birth_year is an individual-level variable that should not change over time, so we need to check whether any individuals have multiple values for this variable. grade is a wave-level variable that normally increases by one each year (but not necessarily), so we should check whether this variable follows such a pattern within each individual.

First, check whether each id has three observations, by returning those groups with not-three observations:

waves |> 
  group_by(id) |> 
  mutate(n = n()) |> 
  filter(n != 3)
# A tibble: 0 × 5
# Groups:   id [0]
# ℹ 5 variables: id <int>, wave <int>, birth_year <int>, grade <int>, n <int>

An empty dataframe means the check is passed.

Now that we know each individual has exactly three observations, we need to check whether each of those observations happened in a different wave. In other words, each combination of id and wave should only exist once:

waves |> 
  group_by(id, wave) |> 
  mutate(n = n()) |> 
  filter(n != 1)
# A tibble: 0 × 5
# Groups:   id, wave [0]
# ℹ 5 variables: id <int>, wave <int>, birth_year <int>, grade <int>, n <int>

This test also passed.

Then, check whether birth_year varies within individuals. To do this, use id as a grouping variable, and get the number of (length()) unique() values of birth_year. This is just for data checking, so the variable name does not need to be meaningful (x is fine).

Then, filter() the data to find individuals who have more than one value of birth_year:

waves |> 
  group_by(id) |> 
  mutate(x = length(unique(birth_year))) |> 
  filter(x > 1)
# A tibble: 3 × 5
# Groups:   id [1]
     id  wave birth_year grade     x
  <int> <int>      <int> <int> <int>
1     5     1       2002     2     2
2     5     2       2003     3     2
3     5     3       2002     4     2

This individual has two values of birth_year: 2002 and 2003.

Next, check whether an individual had three different values of grade, as a way of checking whether they were in a different grade each year:

waves |> 
  group_by(id) |> 
  mutate(x = length(unique(grade))) |> 
  filter(x != 3)
# A tibble: 3 × 5
# Groups:   id [1]
     id  wave birth_year grade     x
  <int> <int>      <int> <int> <int>
1     6     1       2000     4     2
2     6     2       2000     4     2
3     6     3       2000     6     2

This individual seems to have repeated grade four and then skipped to grade six.

Finally, check whether grade increased by one each year. For this, use the lag() function, which gets the value from the previous row. Because this function is based on position, be sure to sort the data with arrange() first to sort any records not in chronological order.

Subtracting the previous value of grade (lag(grade)) from the current value of grade answers the question, “How many grades did this individual advance between years?” The expected answer is one, but other answers are possible for repeated and skipped grades.

waves |> 
  group_by(id) |> 
  arrange(wave) |> 
  mutate(x = grade - lag(grade)) |> 
  filter(x != 1)
# A tibble: 2 × 5
# Groups:   id [1]
     id  wave birth_year grade     x
  <int> <int>      <int> <int> <int>
1     6     2       2000     4     0
2     6     3       2000     6     2
Note

Only two rows are returned here because lag() returns NA for the first observation for each id since there is no previous value:

lag(1:3)
[1] NA  1  2

Any logical comparison with NA returns NA, and filter() only returns rows where the expression evaluates to TRUE.

This individual was also identified in the previous block because their grade progression of 4-4-6 only has two unique values. They advanced zero grades from wave one to wave two, and they advanced two grades from wave two to wave three.

We have identified two problems with the dataset: ID 5’s multiple birth years, and ID 6’s grade progression. Now, we need to handle them.

First, fix the multiple birth year problem. We could fix this in several ways:

  1. Drop this individual from the dataset:

    waves |> 
      filter(id != 5)
  2. Assume their first value is most likely the correct value, and use that one:

    waves |> 
      group_by(id) |> 
      mutate(birth_year = first(birth_year)) |> 
      ungroup()
  3. Assume their last value is most likely the correct value, and use that one:

    waves |> 
      group_by(id) |> 
      mutate(birth_year = last(birth_year)) |> 
      ungroup()
  4. Assume their most common value is most likely the correct value, and use that one. R has no built-in “mode” function for finding the most common value (mode() is something else entirely), so a workaround is needed:

    waves |> 
    
      group_by(id, birth_year) |> 
      # get counts of birth_year by id
      mutate(n = n()) |> 
      ungroup() |> 
    
      group_by(id) |> 
      # use the value of birth year associated with the highest count
      mutate(x = birth_year[n = max(n)][1]) |> 
      ungroup()

The [1] in birth_year[n = max(n)][1] returns the first value in case of a tie, which here would happen with three different values of birth_year, a count of one each. Another approach could be to randomly pick one of the modes with sample(birth_year[n = max(n)], 1). This would require setting a seed with set.seed() beforehand to get the same random value each time.

For any of these approaches, transparency in reporting is a must. Clearly explain the decisions made in any data modification and justify them. You should also run a sensitivity analysis by using multiple approaches (use first/last/mode) to check whether inferences differ as a result of the data preparation process.

Now, for ID 6 with a nonstandard progression in the values of grade, we have a few choices:

  1. Leave it as-is. Assume this is accurate, or contact the data provider to confirm its accuracy.

  2. Drop this individual from the dataset:

    waves |> 
      filter(id != 6)
  3. Force the grades to progress by one each year. This can be done manually with just this particular observation:

waves |>
  mutate(grade = ifelse(id == 6 & wave == 2, 5, grade))

Applying a rule to the entire dataframe should be done with care, and there is no one-size-fits-all approach. The check of grade progression would also flag individuals who have

  • Missing observations (e.g., grade 4 in wave 1, no observation for wave 2, and grade 6 in wave 3)
  • The same grade every year (e.g., 4-4-4)
    • With which wave should that value of grade be associated? Should their progression be 4-5-6, 3-4-5, or 2-3-4?
  • In a dataset with more observations, multiple series of consecutive grades (e.g., 2-3-4-7-8-9)
    • Should the second set be moved down to the first (2-3-4-5-6-7), the first set moved up to the second (4-5-6-7-8-9), or something else?
  • Aberrant patterns (e.g., 1-1-7 or 6-2-9)

14.8 Exercises

14.8.1 Fundamental

Use the chickwts dataset.

  1. Is this a balanced experiment (same number of individuals/chickens in each condition/feed)?

  2. Which feed was associated with the largest variation (standard deviation) in weight?

  3. Without reducing the number of rows, add a column with the range (maximum - minimum) of weight for each feed.

14.8.2 Extended

Use this dataset:

dat <- readRDS(url("https://sscc.wisc.edu/sscc/pubs/dwr/data/aggregate_extended.rds"))

Check the quality of the data. Each id should have one observation per year for ten years. year and age_on_jan1 should increase by one each year. birthday and birth_state should not vary within individual.