id | wave | value |
---|---|---|
a | 1 | 5 |
a | 2 | 4 |
a | 3 | 9 |
b | 1 | 3 |
b | 2 | 10 |
b | 3 | 8 |
14 Aggregating Dataframes
14.1 Warm-Up
Four datasets are presented below. Dataset 1 was used to create Datasets 2-4. Discuss the relationships between the four datasets:
- How do the four differ in terms of the numbers of their rows and columns?
- How were the columns
mean1
andmean2
calculated? - Which columns have repeated values?
Dataset 1:
Dataset 2:
id | wave | value | mean1 | mean2 |
---|---|---|---|---|
a | 1 | 5 | 6 | 4.0 |
a | 2 | 4 | 6 | 7.0 |
a | 3 | 9 | 6 | 8.5 |
b | 1 | 3 | 7 | 4.0 |
b | 2 | 10 | 7 | 7.0 |
b | 3 | 8 | 7 | 8.5 |
Dataset 3:
id | mean1 |
---|---|
a | 6 |
b | 7 |
Dataset 4:
id | mean2 |
---|---|
a | 6 |
b | 7 |
14.2 Outcomes
Objective: To produce summary statistics along one or more grouping variables.
Why it matters: Many datasets are inherently multilevel: individuals in families, counties in states, observations within participants, and so on. We need to create variables like the number of individuals in a family, or the mean vote share by county for some political party.
Learning outcomes:
Fundamental Skills | Extended Skills |
|
|
Key functions and operators:
group_by()
summarize()
ungroup()
14.3 Aggregating Without Grouping Variables
Aggregation is the process of turning many data points into fewer data points, typically in the form of summary statistics.
Load the tidyverse
.
library(tidyverse)
The summarize()
function follows a similar structure to mutate()
. Its arguments are one or more lines of column_name = how_to_create_it
. The function on the right side must return only a single value, such as mean()
, sd()
, min()
, or max()
. Functions that return more than one value include unique()
, quantile()
, and range()
; do not use these by themselves.
Calculate the mean and standard deviation of the bill_len
column in penguins
, ignoring missing data with na.rm = T
:
|>
penguins summarize(bill_len_avg = mean(bill_len, na.rm = T),
bill_len_sd = sd(bill_len, na.rm = T))
bill_len_avg bill_len_sd
1 43.92193 5.459584
summarize()
only returns the columns we requested; the others are omitted.
14.4 Aggregating by Grouping Variables
The real power of summarize()
exists when it is used to aggregate across grouping variables.
To do this, use the group_by()
function with one or more groups as arguments.
To calculate the mean and standard deviation of bill_len
by species
, run the same code as above but with an additional line of group_by(species)
:
|>
penguins group_by(species) |>
summarize(bill_len_avg = mean(bill_len, na.rm = T),
bill_len_sd = sd(bill_len, na.rm = T))
# A tibble: 3 × 3
species bill_len_avg bill_len_sd
<fct> <dbl> <dbl>
1 Adelie 38.8 2.66
2 Chinstrap 48.8 3.34
3 Gentoo 47.5 3.08
Now the requested statistics are calculated separately for each level of the grouping variable: Adelie has its own mean and standard deviation, as does Chinstrap, as does Gentoo.
In addition to the columns we named in summarize()
, the grouping variable(s) is also returned.
More than one grouping variable can be used. First, check for the combinations of species
and island
in penguins
:
|>
penguins distinct(species, island)
species island
1 Adelie Torgersen
2 Adelie Biscoe
3 Adelie Dream
4 Gentoo Biscoe
5 Chinstrap Dream
Adelie penguins appear on three different islands, but Gentoo and Chinstrap appear on only one island each. With more than one grouping variable, group_by()
finds all combinations that exist in the dataset. That is, it will not try to create all possible pairwise combinations of these two variables, so it will not return nonexistent combinations (e.g., no Gentoo-Dream or Chinstrap-Torgersen).
Run the same code as above, but now with species
and island
as the groups:
|>
penguins group_by(species, island) |>
summarize(bill_len_avg = mean(bill_len, na.rm = T),
bill_len_sd = sd(bill_len, na.rm = T))
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 5 × 4
# Groups: species [3]
species island bill_len_avg bill_len_sd
<fct> <fct> <dbl> <dbl>
1 Adelie Biscoe 39.0 2.48
2 Adelie Dream 38.5 2.47
3 Adelie Torgersen 39.0 3.03
4 Chinstrap Dream 48.8 3.34
5 Gentoo Biscoe 47.5 3.08
With a few more functions and some creativity, we can return even more useful information with summarize()
:
- The number of observations/rows/penguins by group:
n()
- The number of missing values in
bill_len
by group:sum(is.na(bill_len))
- The proportion of missing values in
bill_len
by group:mean(is.na(bill_len))
- The number of different values of
year
by group:length(unique(year))
|>
penguins group_by(species, island) |>
summarize(n = n(),
miss_n = sum(is.na(bill_len)),
miss_prop = mean(is.na(bill_len)),
year_n = length(unique(year)))
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 5 × 6
# Groups: species [3]
species island n miss_n miss_prop year_n
<fct> <fct> <int> <int> <dbl> <int>
1 Adelie Biscoe 44 0 0 3
2 Adelie Dream 56 0 0 3
3 Adelie Torgersen 52 1 0.0192 3
4 Chinstrap Dream 68 0 0 3
5 Gentoo Biscoe 124 1 0.00806 3
14.5 Adding Group-Level Information Without Removing Rows
The code in the above section reduced the number of rows in the dataframe when calculating group-level variables. This is useful in situations where you want to create summary statistics for succinct tables, such as mean income by level of education.
However, if you want to keep your whole dataset but add information along a grouping variable, you can do that too. Just change summarize()
to mutate()
in the first example:
|>
penguins group_by(species) |>
mutate(bill_len_avg = mean(bill_len, na.rm = T),
bill_len_sd = sd(bill_len, na.rm = T)) |>
select(-bill_dep, -flipper_len) # drop some columns to see the new columns
# A tibble: 344 × 8
# Groups: species [3]
species island bill_len body_mass sex year bill_len_avg bill_len_sd
<fct> <fct> <dbl> <int> <fct> <int> <dbl> <dbl>
1 Adelie Torgersen 39.1 3750 male 2007 38.8 2.66
2 Adelie Torgersen 39.5 3800 female 2007 38.8 2.66
3 Adelie Torgersen 40.3 3250 female 2007 38.8 2.66
4 Adelie Torgersen NA NA <NA> 2007 38.8 2.66
5 Adelie Torgersen 36.7 3450 female 2007 38.8 2.66
6 Adelie Torgersen 39.3 3650 male 2007 38.8 2.66
7 Adelie Torgersen 38.9 3625 female 2007 38.8 2.66
8 Adelie Torgersen 39.2 4675 male 2007 38.8 2.66
9 Adelie Torgersen 34.1 3475 <NA> 2007 38.8 2.66
10 Adelie Torgersen 42 4250 <NA> 2007 38.8 2.66
# ℹ 334 more rows
Now, the group-level information is included and repeated alongside individual cases.
This is useful when preparing data for multilevel modeling, so that aggregated level-one variables can be used as level-two predictors.
14.6 Removing Groupings
When we first learned how to use mutate()
in a previous chapter, we had yet to learn about groups. Now we see that adding groups to a dataframe changes how this function works.
Groups can continue to persist after some operations. Notice how the top of the output in the previous example states Groups: species [3]
. Confirm this by rerunning the code above, assigning the resulting dataframe to dat
, and checking the groups with group_vars()
:
<-
dat |>
penguins group_by(species) |>
mutate(bill_len_avg = mean(bill_len, na.rm = T),
bill_len_sd = sd(bill_len, na.rm = T)) |>
select(-bill_dep, -flipper_len)
group_vars(dat)
[1] "species"
group_vars()
returns a character vector of the grouping variables.
The consequence of persisting groups is that a later mutate()
or summarize()
will continue to work with these groups, even if we do not explicitly declare groups with group_by()
in that later operation!
This can have unintended effects, so it is best practice to remove groups after each operation that uses those groups. To do this, just follow each operation with a pipe into ungroup()
:
<-
dat |>
penguins group_by(species) |>
mutate(bill_len_avg = mean(bill_len, na.rm = T),
bill_len_sd = sd(bill_len, na.rm = T)) |>
select(-bill_dep, -flipper_len) |>
ungroup()
group_vars(dat)
character(0)
character(0)
is a character vector with length zero, meaning that there are no groups associated with the data.
14.7 Checking Quality of Panel Data
When working with panel or hierarchical or longitudinal data, we need to perform checks of the data quality to catch possible data entry errors and abnormal data.
These checks can be done with the skills taught above.
Import the waves
sample dataset:
<- read.csv("https://sscc.wisc.edu/sscc/pubs/dwr/data/aggregate_waves.csv")
waves waves
id wave birth_year grade
1 1 1 2001 3
2 1 2 2001 4
3 1 3 2001 5
4 2 1 1995 9
5 2 2 1995 10
6 2 3 1995 11
7 3 1 1998 6
8 3 2 1998 7
9 3 3 1998 8
10 4 1 1994 10
11 4 2 1994 11
12 4 3 1994 12
13 5 1 2002 2
14 5 2 2003 3
15 5 3 2002 4
16 6 1 2000 4
17 6 2 2000 4
18 6 3 2000 6
This dataset has longitudinal data for students over three years, with one wave
per year. At each wave, birth_year
and grade
were collected. birth_year
is an individual-level variable that should not change over time, so we need to check whether any individuals have multiple values for this variable. grade
is a wave-level variable that normally increases by one each year (but not necessarily), so we should check whether this variable follows such a pattern within each individual.
First, check whether each id
has three observations, by returning those groups with not-three observations:
|>
waves group_by(id) |>
mutate(n = n()) |>
filter(n != 3)
# A tibble: 0 × 5
# Groups: id [0]
# ℹ 5 variables: id <int>, wave <int>, birth_year <int>, grade <int>, n <int>
An empty dataframe means the check is passed.
Now that we know each individual has exactly three observations, we need to check whether each of those observations happened in a different wave. In other words, each combination of id
and wave
should only exist once:
|>
waves group_by(id, wave) |>
mutate(n = n()) |>
filter(n != 1)
# A tibble: 0 × 5
# Groups: id, wave [0]
# ℹ 5 variables: id <int>, wave <int>, birth_year <int>, grade <int>, n <int>
This test also passed.
Then, check whether birth_year
varies within individuals. To do this, use id
as a grouping variable, and get the number of (length()
) unique()
values of birth_year
. This is just for data checking, so the variable name does not need to be meaningful (x
is fine).
Then, filter()
the data to find individuals who have more than one value of birth_year
:
|>
waves group_by(id) |>
mutate(x = length(unique(birth_year))) |>
filter(x > 1)
# A tibble: 3 × 5
# Groups: id [1]
id wave birth_year grade x
<int> <int> <int> <int> <int>
1 5 1 2002 2 2
2 5 2 2003 3 2
3 5 3 2002 4 2
This individual has two values of birth_year
: 2002 and 2003.
Next, check whether an individual had three different values of grade
, as a way of checking whether they were in a different grade each year:
|>
waves group_by(id) |>
mutate(x = length(unique(grade))) |>
filter(x != 3)
# A tibble: 3 × 5
# Groups: id [1]
id wave birth_year grade x
<int> <int> <int> <int> <int>
1 6 1 2000 4 2
2 6 2 2000 4 2
3 6 3 2000 6 2
This individual seems to have repeated grade four and then skipped to grade six.
Finally, check whether grade
increased by one each year. For this, use the lag()
function, which gets the value from the previous row. Because this function is based on position, be sure to sort the data with arrange()
first to sort any records not in chronological order.
Subtracting the previous value of grade
(lag(grade)
) from the current value of grade
answers the question, “How many grades did this individual advance between years?” The expected answer is one, but other answers are possible for repeated and skipped grades.
|>
waves group_by(id) |>
arrange(wave) |>
mutate(x = grade - lag(grade)) |>
filter(x != 1)
# A tibble: 2 × 5
# Groups: id [1]
id wave birth_year grade x
<int> <int> <int> <int> <int>
1 6 2 2000 4 0
2 6 3 2000 6 2
Only two rows are returned here because lag()
returns NA
for the first observation for each id
since there is no previous value:
lag(1:3)
[1] NA 1 2
Any logical comparison with NA
returns NA
, and filter()
only returns rows where the expression evaluates to TRUE
.
This individual was also identified in the previous block because their grade progression of 4-4-6 only has two unique values. They advanced zero grades from wave one to wave two, and they advanced two grades from wave two to wave three.
We have identified two problems with the dataset: ID 5’s multiple birth years, and ID 6’s grade progression. Now, we need to handle them.
First, fix the multiple birth year problem. We could fix this in several ways:
Drop this individual from the dataset:
|> waves filter(id != 5)
Assume their first value is most likely the correct value, and use that one:
|> waves group_by(id) |> mutate(birth_year = first(birth_year)) |> ungroup()
Assume their last value is most likely the correct value, and use that one:
|> waves group_by(id) |> mutate(birth_year = last(birth_year)) |> ungroup()
Assume their most common value is most likely the correct value, and use that one. R has no built-in “mode” function for finding the most common value (
mode()
is something else entirely), so a workaround is needed:|> waves group_by(id, birth_year) |> # get counts of birth_year by id mutate(n = n()) |> ungroup() |> group_by(id) |> # use the value of birth year associated with the highest count mutate(x = birth_year[n = max(n)][1]) |> ungroup()
The [1]
in birth_year[n = max(n)][1]
returns the first value in case of a tie, which here would happen with three different values of birth_year
, a count of one each. Another approach could be to randomly pick one of the modes with sample(birth_year[n = max(n)], 1)
. This would require setting a seed with set.seed()
beforehand to get the same random value each time.
For any of these approaches, transparency in reporting is a must. Clearly explain the decisions made in any data modification and justify them. You should also run a sensitivity analysis by using multiple approaches (use first/last/mode) to check whether inferences differ as a result of the data preparation process.
Now, for ID 6 with a nonstandard progression in the values of grade
, we have a few choices:
Leave it as-is. Assume this is accurate, or contact the data provider to confirm its accuracy.
Drop this individual from the dataset:
|> waves filter(id != 6)
Force the grades to progress by one each year. This can be done manually with just this particular observation:
|>
waves mutate(grade = ifelse(id == 6 & wave == 2, 5, grade))
Applying a rule to the entire dataframe should be done with care, and there is no one-size-fits-all approach. The check of grade progression would also flag individuals who have
- Missing observations (e.g., grade 4 in wave 1, no observation for wave 2, and grade 6 in wave 3)
- The same grade every year (e.g., 4-4-4)
- With which wave should that value of
grade
be associated? Should their progression be 4-5-6, 3-4-5, or 2-3-4?
- With which wave should that value of
- In a dataset with more observations, multiple series of consecutive grades (e.g., 2-3-4-7-8-9)
- Should the second set be moved down to the first (2-3-4-5-6-7), the first set moved up to the second (4-5-6-7-8-9), or something else?
- Aberrant patterns (e.g., 1-1-7 or 6-2-9)
14.8 Exercises
14.8.1 Fundamental
Use the chickwts
dataset.
Is this a balanced experiment (same number of individuals/chickens in each condition/feed)?
Which feed was associated with the largest variation (standard deviation) in weight?
Without reducing the number of rows, add a column with the range (maximum - minimum) of weight for each feed.
14.8.2 Extended
Use this dataset:
<- readRDS(url("https://sscc.wisc.edu/sscc/pubs/dwr/data/aggregate_extended.rds")) dat
Check the quality of the data. Each id
should have one observation per year for ten years. year
and age_on_jan1
should increase by one each year. birthday
and birth_state
should not vary within individual.