<- read.csv("https://www.sscc.wisc.edu/sscc/pubs/dwr/data/reshape_exercise_long.csv")
long <- read.csv("https://www.sscc.wisc.edu/sscc/pubs/dwr/data/reshape_exercise_wide.csv") wide
15 Reshaping Dataframes
15.1 Warm-Up
Load these two datasets into R:
long
and wide
have the same data but are organized differently.
Work with one or two partners. At least one of you should attempt #1 with both long
and wide
, and at least one of you should attempt #2 with both long
and wide
. If you get stuck, try the task with the other dataset. Then discuss with your partner(s): was your task easier with one of the two datasets? Which one?
- Calculate each ID’s average across years, resulting in this output:
id avg
1 13.75
2 6.25
3 7.25
4 9.50
- Add a variable with each ID’s sum of values from 2000 and 2001.
15.2 Outcomes
Objective: To change the shape of data, from long to wide or from wide to long.
Why it matters: Different data wrangling operations and analyses are easier done in either long or wide format, so reshaping data is important in preparing and modeling data.
Learning outcomes:
Fundamental Skills | Extended Skills |
|
|
Key functions:
pivot_wider()
pivot_longer()
15.3 Data Shapes
Dataframes can be organized in different ways for different purposes. Dataframes often come in less-than-ideal formats, especially when you are using secondary data. It is important to know how to rearrange the data to prepare it for tables or for plotting with ggplot2
. Data comes in two primary shapes: wide and long.
Data is wide when a row has more than one observation, and the units of observation (e.g., individuals, countries, households) are on one row each.
You might run into this format if you work with survey or assessment data, or if you have ever downloaded data from Qualtrics. In the example below, each row corresponds to a single person, and each column is a different observation for that person.
ID | Income2000 | Income2001 | Income2002 | … |
---|---|---|---|---|
001 | 50000 | 52000 | 56000 | |
002 | 0 | 30000 | 31000 | |
003 | 6800 | 6400 | 6850 | |
… |
Data is long when a row has only one observation, but the units of observation are repeated down a column.
Longitudinal data is often in the long format. You might have a column where ID numbers are repeated, a column marking when each data point was observed, and another column with observed values.
Here is the same dataset as above, now arranged into a long format:
ID | Year | Income |
---|---|---|
001 | 2000 | 50000 |
001 | 2001 | 52000 |
001 | 2002 | 56000 |
002 | 2000 | 0 |
002 | 2001 | 30000 |
002 | 2002 | 31000 |
003 | 2000 | 6800 |
003 | 2001 | 6400 |
003 | 2002 | 6850 |
… |
15.4 Long to Wide
Load the tidyverse
.
library(tidyverse)
For a first example, we will reshape the built-in population
dataframe to wide.
The first few rows of population
currently looks like this:
|>
population head()
# A tibble: 6 × 3
country year population
<chr> <dbl> <dbl>
1 Afghanistan 1995 17586073
2 Afghanistan 1996 18415307
3 Afghanistan 1997 19021226
4 Afghanistan 1998 19496836
5 Afghanistan 1999 19987071
6 Afghanistan 2000 20595360
The units of observation, country
, are repeated, and there is one row for each value of year
.
Reshaping a dataset to wide format creates more columns. We need to decide two things:
names_from
: where those new columns should get their names fromvalues_from
: where the values of those new columns should come from
Here, we can make one column for each year, named 1995
, 1996
, and so on. The values in those columns should come from the current population
column. The result will be a single row for each country, with one column for each year with its population value.
Use the pivot_wider()
function and specify the source of the new column names (names_from
) and the source of the values for those columns (values_from
):
|>
population pivot_wider(names_from = year,
values_from = population)
# A tibble: 219 × 20
country `1995` `1996` `1997` `1998` `1999` `2000` `2001` `2002` `2003` `2004`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghan… 1.76e7 1.84e7 1.90e7 1.95e7 2.00e7 2.06e7 2.13e7 2.22e7 2.31e7 2.40e7
2 Albania 3.36e6 3.34e6 3.33e6 3.33e6 3.32e6 3.30e6 3.29e6 3.26e6 3.24e6 3.22e6
3 Algeria 2.93e7 2.98e7 3.03e7 3.08e7 3.13e7 3.17e7 3.22e7 3.26e7 3.30e7 3.35e7
4 Americ… 5.29e4 5.39e4 5.49e4 5.59e4 5.68e4 5.75e4 5.82e4 5.87e4 5.91e4 5.93e4
5 Andorra 6.39e4 6.43e4 6.41e4 6.38e4 6.41e4 6.54e4 6.80e4 7.16e4 7.56e4 7.91e4
6 Angola 1.21e7 1.25e7 1.28e7 1.31e7 1.35e7 1.39e7 1.44e7 1.49e7 1.54e7 1.60e7
7 Anguil… 9.81e3 1.01e4 1.03e4 1.05e4 1.08e4 1.11e4 1.14e4 1.17e4 1.20e4 1.23e4
8 Antigu… 6.83e4 7.02e4 7.22e4 7.42e4 7.60e4 7.76e4 7.90e4 8.00e4 8.09e4 8.17e4
9 Argent… 3.48e7 3.53e7 3.57e7 3.61e7 3.65e7 3.69e7 3.73e7 3.76e7 3.80e7 3.83e7
10 Armenia 3.22e6 3.17e6 3.14e6 3.11e6 3.09e6 3.08e6 3.06e6 3.05e6 3.04e6 3.03e6
# ℹ 209 more rows
# ℹ 9 more variables: `2005` <dbl>, `2006` <dbl>, `2007` <dbl>, `2008` <dbl>,
# `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>
The reshape worked just fine, but one issue we will have later is referring to the new column names. They start with a number, so we would need to reference them with back ticks: `1995`
. To make column names easier to use, we can prefix them with some string, and this is done during the reshaping with the names_prefix
argument. Add “pop” to the beginning of the column names:
|>
population pivot_wider(names_from = year,
names_prefix = "pop",
values_from = population)
# A tibble: 219 × 20
country pop1995 pop1996 pop1997 pop1998 pop1999 pop2000 pop2001 pop2002
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1.76e7 1.84e7 1.90e7 1.95e7 2.00e7 2.06e7 2.13e7 2.22e7
2 Albania 3.36e6 3.34e6 3.33e6 3.33e6 3.32e6 3.30e6 3.29e6 3.26e6
3 Algeria 2.93e7 2.98e7 3.03e7 3.08e7 3.13e7 3.17e7 3.22e7 3.26e7
4 American Sam… 5.29e4 5.39e4 5.49e4 5.59e4 5.68e4 5.75e4 5.82e4 5.87e4
5 Andorra 6.39e4 6.43e4 6.41e4 6.38e4 6.41e4 6.54e4 6.80e4 7.16e4
6 Angola 1.21e7 1.25e7 1.28e7 1.31e7 1.35e7 1.39e7 1.44e7 1.49e7
7 Anguilla 9.81e3 1.01e4 1.03e4 1.05e4 1.08e4 1.11e4 1.14e4 1.17e4
8 Antigua and … 6.83e4 7.02e4 7.22e4 7.42e4 7.60e4 7.76e4 7.90e4 8.00e4
9 Argentina 3.48e7 3.53e7 3.57e7 3.61e7 3.65e7 3.69e7 3.73e7 3.76e7
10 Armenia 3.22e6 3.17e6 3.14e6 3.11e6 3.09e6 3.08e6 3.06e6 3.05e6
# ℹ 209 more rows
# ℹ 11 more variables: pop2003 <dbl>, pop2004 <dbl>, pop2005 <dbl>,
# pop2006 <dbl>, pop2007 <dbl>, pop2008 <dbl>, pop2009 <dbl>, pop2010 <dbl>,
# pop2011 <dbl>, pop2012 <dbl>, pop2013 <dbl>
15.4.1 Missing Data
When reshaping a dataframe to wide, new missing data may appear.
The original population
has no missing data:
|>
population is.na() |>
sum()
[1] 0
But our wide population
has missing data:
|>
population pivot_wider(names_from = year,
names_prefix = "pop",
values_from = population) |>
is.na() |>
sum()
[1] 101
Where did the missing data come from?
In long format, missing data is represented by NA
or by missing rows.
Imagine we have two individuals take two tests, and we record their scores. Individual 02 does not take test b, so their score is missing. That could be represented in one of two ways.
They could have an observation for test b with a score of NA
:
id | test | score |
---|---|---|
01 | a | 43 |
01 | b | 52 |
02 | a | 46 |
02 | b | NA |
Or the row could be omitted:
id | test | score |
---|---|---|
01 | a | 43 |
01 | b | 52 |
02 | a | 46 |
However, when we reshape this dataset to wide, individual 02 must have NA
for test b because the test_b
column cannot be shorter than the other columns:
id | test_a | test_b |
---|---|---|
01 | 43 | 52 |
02 | 46 | NA |
When we reshape from wide to long, we will have an option to turn the dataframe into one that uses NA
s or that omits rows with missing data.
For another example, look at the built-in fish_encounters
dataframe, which is in a long format:
|>
fish_encounters head()
# A tibble: 6 × 3
fish station seen
<fct> <fct> <int>
1 4842 Release 1
2 4842 I80_1 1
3 4842 Lisbon 1
4 4842 Rstr 1
5 4842 Base_TD 1
6 4842 BCE 1
Reshape this data to wide with pivot_wider()
, making the new column names_from
the station
column, and getting values_from
the seen
column.
|>
fish_encounters pivot_wider(names_from = station,
values_from = seen)
# A tibble: 19 × 12
fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
<fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 4842 1 1 1 1 1 1 1 1 1 1 1
2 4843 1 1 1 1 1 1 1 1 1 1 1
3 4844 1 1 1 1 1 1 1 1 1 1 1
4 4845 1 1 1 1 1 NA NA NA NA NA NA
5 4847 1 1 1 NA NA NA NA NA NA NA NA
6 4848 1 1 1 1 NA NA NA NA NA NA NA
7 4849 1 1 NA NA NA NA NA NA NA NA NA
8 4850 1 1 NA 1 1 1 1 NA NA NA NA
9 4851 1 1 NA NA NA NA NA NA NA NA NA
10 4854 1 1 NA NA NA NA NA NA NA NA NA
11 4855 1 1 1 1 1 NA NA NA NA NA NA
12 4857 1 1 1 1 1 1 1 1 1 NA NA
13 4858 1 1 1 1 1 1 1 1 1 1 1
14 4859 1 1 1 1 1 NA NA NA NA NA NA
15 4861 1 1 1 1 1 1 1 1 1 1 1
16 4862 1 1 1 1 1 1 1 1 1 NA NA
17 4863 1 1 NA NA NA NA NA NA NA NA NA
18 4864 1 1 NA NA NA NA NA NA NA NA NA
19 4865 1 1 1 NA NA NA NA NA NA NA NA
Now each row is a fish, and the columns indicate whether that fish was seen at a station. Here, “seen” is a 1
when a fish was observed, and NA
when the fish was not observed. However, we typically code indicators as 1 and 0. We have control over how the missing values are filled in when making data wide. To use a different value, such as 0, specify values_fill = 0
:
|>
fish_encounters pivot_wider(names_from = station,
values_from = seen,
values_fill = 0)
# A tibble: 19 × 12
fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
<fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 4842 1 1 1 1 1 1 1 1 1 1 1
2 4843 1 1 1 1 1 1 1 1 1 1 1
3 4844 1 1 1 1 1 1 1 1 1 1 1
4 4845 1 1 1 1 1 0 0 0 0 0 0
5 4847 1 1 1 0 0 0 0 0 0 0 0
6 4848 1 1 1 1 0 0 0 0 0 0 0
7 4849 1 1 0 0 0 0 0 0 0 0 0
8 4850 1 1 0 1 1 1 1 0 0 0 0
9 4851 1 1 0 0 0 0 0 0 0 0 0
10 4854 1 1 0 0 0 0 0 0 0 0 0
11 4855 1 1 1 1 1 0 0 0 0 0 0
12 4857 1 1 1 1 1 1 1 1 1 0 0
13 4858 1 1 1 1 1 1 1 1 1 1 1
14 4859 1 1 1 1 1 0 0 0 0 0 0
15 4861 1 1 1 1 1 1 1 1 1 1 1
16 4862 1 1 1 1 1 1 1 1 1 0 0
17 4863 1 1 0 0 0 0 0 0 0 0 0
18 4864 1 1 0 0 0 0 0 0 0 0 0
19 4865 1 1 1 0 0 0 0 0 0 0 0
15.4.2 Multiple Columns of Values
population
and fish_encounters
only have a single value associated with each of the names.
Some datasets have multiple values for each name. The built-in us_rent_income
has one row for each combination of state (NAME
) and income and rent (together in variable
). On each row, it has two variables, estimate
and moe
.
us_rent_income
# A tibble: 104 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01 Alabama income 24476 136
2 01 Alabama rent 747 3
3 02 Alaska income 32940 508
4 02 Alaska rent 1200 13
5 04 Arizona income 27517 148
6 04 Arizona rent 972 4
7 05 Arkansas income 23789 165
8 05 Arkansas rent 709 5
9 06 California income 29454 109
10 06 California rent 1358 3
# ℹ 94 more rows
If we only had one of these, reshaping to wide would follow the same process as above. Drop the moe
column and get names_from
the variable
column and values_from
the estimate
column:
|>
us_rent_income select(-moe) |>
pivot_wider(names_from = variable,
values_from = estimate)
# A tibble: 52 × 4
GEOID NAME income rent
<chr> <chr> <dbl> <dbl>
1 01 Alabama 24476 747
2 02 Alaska 32940 1200
3 04 Arizona 27517 972
4 05 Arkansas 23789 709
5 06 California 29454 1358
6 08 Colorado 32401 1125
7 09 Connecticut 35326 1123
8 10 Delaware 31560 1076
9 11 District of Columbia 43198 1424
10 12 Florida 25952 1077
# ℹ 42 more rows
Here, we got values from a single column, estimate
.
Since we have two columns with data, estimate
and moe
, we can provide more than one column to the values_from
argument. In that case, pivot_wider()
will prefix the new column names (from the names_from
column) with the current names of the values_from
columns. We will then get four columns, one for each combination of income/rent and estimate/moe:
|>
us_rent_income pivot_wider(names_from = variable,
values_from = c(estimate, moe))
# A tibble: 52 × 6
GEOID NAME estimate_income estimate_rent moe_income moe_rent
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 01 Alabama 24476 747 136 3
2 02 Alaska 32940 1200 508 13
3 04 Arizona 27517 972 148 4
4 05 Arkansas 23789 709 165 5
5 06 California 29454 1358 109 3
6 08 Colorado 32401 1125 109 5
7 09 Connecticut 35326 1123 195 5
8 10 Delaware 31560 1076 247 10
9 11 District of Columbia 43198 1424 681 17
10 12 Florida 25952 1077 70 3
# ℹ 42 more rows
15.5 Wide to Long
We can reshape wide data to long with the pivot_longer()
function.
The built-in relig_income
is in a wide format:
|>
relig_income head()
# A tibble: 6 × 11
religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Agnostic 27 34 60 81 76 137 122
2 Atheist 12 27 37 52 35 70 73
3 Buddhist 27 21 30 34 33 58 62
4 Catholic 418 617 732 670 638 1116 949
5 Don’t kn… 15 14 15 11 10 35 21
6 Evangeli… 575 869 1064 982 881 1486 949
# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
# `Don't know/refused` <dbl>
When reshaping to long, we need to specify three things:
cols
: which columns should be made longnames_to
: the name of the new column that the current column names will be gathered intovalues_to
: the name of the new column that the values will be gathered into
The columns that are made long are generally everything except the identifier column. Here, we only have one identifier column, religion
. The cols
argument supports tidyselect
syntax, so the ways we learned how to select columns in the chapter on subsetting can be used here. To select everything except for a column, specify -religion
. Note this does not drop religion
from the dataset but rather it specifies which columns are to be made long: everything but religion
.
Gather the current column names which describe income levels into a column called “income” with names_to = "income"
, and put their values which are the number of respondents into a column called “n” with values_to = "n"
:
|>
relig_income pivot_longer(cols = -religion,
names_to = "income",
values_to = "n")
# A tibble: 180 × 3
religion income n
<chr> <chr> <dbl>
1 Agnostic <$10k 27
2 Agnostic $10-20k 34
3 Agnostic $20-30k 60
4 Agnostic $30-40k 81
5 Agnostic $40-50k 76
6 Agnostic $50-75k 137
7 Agnostic $75-100k 122
8 Agnostic $100-150k 109
9 Agnostic >150k 84
10 Agnostic Don't know/refused 96
# ℹ 170 more rows
15.5.1 Cleaning Up
The built-in billboard
data is in a wide format with one row per song and columns with the song’s rank each week.
billboard
# A tibble: 317 × 79
artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
<chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA
8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38
9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14
10 Adams, Yo… Open… 2000-08-26 76 76 74 69 68 67 61 58
# ℹ 307 more rows
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
# wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
# wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
# wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
# wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
# wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …
Reshape the data so that it has one row per week. The columns to make long are those that start with the letters “wk”, so specify cols = starts_with("wk")
. Gather those into a column named “week” with names_to = "week"
and put the values from those columns into a column named “rank” with values_to = "rank"
:
|>
billboard pivot_longer(cols = starts_with("wk"),
names_to = "week",
values_to = "rank")
# A tibble: 24,092 × 5
artist track date.entered week rank
<chr> <chr> <date> <chr> <dbl>
1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
8 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk8 NA
9 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk9 NA
10 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk10 NA
# ℹ 24,082 more rows
The reshape was successful, but the dataset could be cleaned up in a few ways:
- Remove the prefix “wk” from the values in the
week
column- The character prefix made the columns syntactic in wide format, but they prevent us from treating
week
as an ordered numeric variable in long format - Remove it with
names_prefix = "wk"
- The character prefix made the columns syntactic in wide format, but they prevent us from treating
- Coerce
week
to numeric after removing the prefix- The
names_to
column is character by default - Change its type by applying a function with
names_transform = as.numeric
- This could also be done later with
mutate(week = as.numeric(week))
- The
- Optionally, drop observations with missing data
- Missing values are required in wide format when there is no existing combination of identifier variable (
track
) and a level of the names variable, but we can drop these rows in long format, as discussed in the [section on missing data] above - Drop observations with
values_drop_na = T
- Missing values are required in wide format when there is no existing combination of identifier variable (
|>
billboard pivot_longer(cols = starts_with("wk"),
names_prefix = "wk",
names_to = "week",
names_transform = as.numeric,
values_to = "rank",
values_drop_na = T)
# A tibble: 5,307 × 5
artist track date.entered week rank
<chr> <chr> <date> <dbl> <dbl>
1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
7 2 Pac Baby Don't Cry (Keep... 2000-02-26 7 99
8 2Ge+her The Hardest Part Of ... 2000-09-02 1 91
9 2Ge+her The Hardest Part Of ... 2000-09-02 2 87
10 2Ge+her The Hardest Part Of ... 2000-09-02 3 92
# ℹ 5,297 more rows
15.5.2 Multiple Datasets in One
The built-in construction
dataframe has counts of completed units in each month:
construction
# A tibble: 9 × 9
Year Month `1 unit` `2 to 4 units` `5 units or more` Northeast Midwest South
<dbl> <chr> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl>
1 2018 Janua… 859 NA 348 114 169 596
2 2018 Febru… 882 NA 400 138 160 655
3 2018 March 862 NA 356 150 154 595
4 2018 April 797 NA 447 144 196 613
5 2018 May 875 NA 364 90 169 673
6 2018 June 867 NA 342 76 170 610
7 2018 July 829 NA 360 108 183 594
8 2018 August 939 NA 286 90 205 649
9 2018 Septe… 835 NA 304 117 175 560
# ℹ 1 more variable: West <dbl>
This is actually two datasets in one. In each month, some number of units were completed, and then this number is broken down by size (1, 2-4, 5+ units) and by region (Northeast, Midwest, South, West). Size is unassociated with region, so each unit it counted twice in the current arrangement. If we had variables like “1 unit - Northeast”, “2 to 4 units - Northeast”, and “1 unit - Midwest”, then units would not be counted twice, and we could simply reshape it to long.
You may encounter this format in government or administrative records, where some total is broken down by various subcategories (age group, student status, level of education, race and ethnicity, etc.), but not by the intersection of those categories, perhaps to reduce the risk of identifiability.
For this dataset, it would be best to separate it into two when reshaping it to long. Drop either the region or size variables before reshaping to get either dataset:
|>
construction select(-c(Northeast, Midwest, South, West)) |>
pivot_longer(cols = -c(Year, Month),
names_to = "Size",
values_to = "n")
# A tibble: 27 × 4
Year Month Size n
<dbl> <chr> <chr> <dbl>
1 2018 January 1 unit 859
2 2018 January 2 to 4 units NA
3 2018 January 5 units or more 348
4 2018 February 1 unit 882
5 2018 February 2 to 4 units NA
6 2018 February 5 units or more 400
7 2018 March 1 unit 862
8 2018 March 2 to 4 units NA
9 2018 March 5 units or more 356
10 2018 April 1 unit 797
# ℹ 17 more rows
|>
construction select(-c(`1 unit`, `2 to 4 units`, `5 units or more`)) |>
pivot_longer(cols = -c(Year, Month),
names_to = "Region",
values_to = "n")
# A tibble: 36 × 4
Year Month Region n
<dbl> <chr> <chr> <dbl>
1 2018 January Northeast 114
2 2018 January Midwest 169
3 2018 January South 596
4 2018 January West 339
5 2018 February Northeast 138
6 2018 February Midwest 160
7 2018 February South 655
8 2018 February West 336
9 2018 March Northeast 150
10 2018 March Midwest 154
# ℹ 26 more rows
15.5.3 Multiple Variables in Names
If the columns in construction
did have names like “1 unit - Northeast” and “2 to 4 units - Northeast”, this would be a situation where the names contained multiple variables, following the pattern “size - region”.
A similar issue is present in the built-in household
dataframe:
household
# A tibble: 5 × 5
family dob_child1 dob_child2 name_child1 name_child2
<int> <date> <date> <chr> <chr>
1 1 1998-11-26 2000-01-29 Susan Jose
2 2 1996-06-22 NA Mark <NA>
3 3 2002-07-11 2004-04-05 Sam Seth
4 4 2004-10-10 2009-08-27 Craig Khai
5 5 2000-12-05 2005-02-28 Parker Gracie
The columns here follow a pattern of “variable_childnumber”. The string “_child” is a reliable separator between the two variables of variable name (“dob” or “name”) and child number (1 or 2).
In pivot_longer()
, specify that separator in the names with names_sep = "_child"
. A separator in the names means that more than one name will be returned, equal to the number of separators plus one. A single occurrence of “_child” in the names returns two pieces, one on the left (“dob”/“name”) and one on the right (1/2). Because of this, names_to
will require two names. The second one is easy; 1/2 is the child number, which we can just call “child”.
For the piece on the left, we want to create two separate columns since “dob” and “name” are distinct variables. (We are also unable to smoothly combine them into a single column since one is a date and the other a character.) To dynamically create separate columns for each unique piece extracted after separating, specify “.value” as the first name. Since “dob” and “name” will be the names of the columns, we do not need a values_to
argument.
names_to = c(".value", "child")
in conjunction with names_sep = "_child"
will split the current column names by “_child”:
dob_child1
becomesdob
and1
dob_child2
becomesdob
and2
name_child1
becomesname
and1
name_child2
becomesname
and2
With “.value” in the first position of names_to
, new columns will be made from the first piece in each column name, one for each unique value (“dob” and “name”). With “child” in the second position of names_to
, all of the values will be gathered into a single column called “child”.
|>
household pivot_longer(cols = -family,
names_to = c(".value", "child"),
names_sep = "_child")
# A tibble: 10 × 4
family child dob name
<int> <chr> <date> <chr>
1 1 1 1998-11-26 Susan
2 1 2 2000-01-29 Jose
3 2 1 1996-06-22 Mark
4 2 2 NA <NA>
5 3 1 2002-07-11 Sam
6 3 2 2004-04-05 Seth
7 4 1 2004-10-10 Craig
8 4 2 2009-08-27 Khai
9 5 1 2000-12-05 Parker
10 5 2 2005-02-28 Gracie
Some additional cleanup could be done to coerce the new child
variable to numeric and drop rows with missing data. Here, because multiple names are used and we only want to modify child
, use a named list following the pattern list(column = function)
: names_transform = list(child = as.numeric)
|>
household pivot_longer(cols = -family,
names_to = c(".value", "child"),
names_sep = "_child",
names_transform = list(child = as.numeric),
values_drop_na = T)
# A tibble: 9 × 4
family child dob name
<int> <dbl> <date> <chr>
1 1 1 1998-11-26 Susan
2 1 2 2000-01-29 Jose
3 2 1 1996-06-22 Mark
4 3 1 2002-07-11 Sam
5 3 2 2004-04-05 Seth
6 4 1 2004-10-10 Craig
7 4 2 2009-08-27 Khai
8 5 1 2000-12-05 Parker
9 5 2 2005-02-28 Gracie
For another example with multiple variables within column names, import this dataset:
<- read.csv("https://sscc.wisc.edu/sscc/pubs/dwr/data/reshape_pattern.csv") dat
After id
, the column names follow the pattern “subject_qnumber_time”:
colnames(dat)
[1] "id" "math_q1_pre" "math_q1_post" "reading_q1_pre"
[5] "reading_q1_post" "science_q1_pre" "science_q1_post" "math_q2_pre"
[9] "math_q2_post" "reading_q2_pre" "reading_q2_post" "science_q2_pre"
[13] "science_q2_post" "math_q3_pre" "math_q3_post" "reading_q3_pre"
[17] "reading_q3_post" "science_q3_pre" "science_q3_post" "math_q4_pre"
[21] "math_q4_post" "reading_q4_pre" "reading_q4_post" "science_q4_pre"
[25] "science_q4_post" "math_q5_pre" "math_q5_post" "reading_q5_pre"
[29] "reading_q5_post" "science_q5_pre" "science_q5_post"
The easiest way to split these columns is by the underscore delimiter. Two underscores in each name means we need to provide three values to names_to
. In the example with household
, we wanted separate columns by some of the pieces extracted since date of birth and name are distinct variables. In this example, the each of the pieces we will get back can be gathered into columns called “subject” (math, reading, science), “question” (q1-q5), and “time” (pre/post). Because we do not use “.value” within names_to
, we need to provide a values_to
argument. The values in this dataframe are all indicators whether a question was answered correctly, so call this new column “correct”:
|>
dat pivot_longer(cols = -id,
names_to = c("subject", "question", "time"),
names_sep = "_",
values_to = "correct")
# A tibble: 300 × 5
id subject question time correct
<int> <chr> <chr> <chr> <int>
1 1 math q1 pre 1
2 1 math q1 post 0
3 1 reading q1 pre 0
4 1 reading q1 post 1
5 1 science q1 pre 1
6 1 science q1 post 1
7 1 math q2 pre 0
8 1 math q2 post 0
9 1 reading q2 pre 0
10 1 reading q2 post 0
# ℹ 290 more rows
A more flexible alternative to separating column names into multiple variables is to use a regex with names_pattern
(learn about regex in the chapter on character vectors). The advantage of names_pattern
over names_sep
is that it allows for multiple separators, like an underscore and then a period, or no separators at all (in the next example).
Here, we can specify a group (()
) with any character (.
) zero or more times (*
) with this regex: (.*)
. Following this is an underscore and the letter “q”, and then a group of characters, another underscore, and another group of characters. The complete regex for our column names is (.*)_q(.*)_(.*)
. Putting the “q” outside the group treats it as a separator that is not included in the resulting question
column (i.e., we get 1-5 instead of q1-q5). Reshape the dataframe with this pattern:
|>
dat pivot_longer(cols = -id,
names_to = c("subject", "question", "time"),
names_pattern = "(.*)_q(.*)_(.*)",
values_to = "correct")
# A tibble: 300 × 5
id subject question time correct
<int> <chr> <chr> <chr> <int>
1 1 math 1 pre 1
2 1 math 1 post 0
3 1 reading 1 pre 0
4 1 reading 1 post 1
5 1 science 1 pre 1
6 1 science 1 post 1
7 1 math 2 pre 0
8 1 math 2 post 0
9 1 reading 2 pre 0
10 1 reading 2 post 0
# ℹ 290 more rows
Now, extend the use of names_pattern
to a situation with a clear pattern but no reliable separator. Rename the columns in dat
to remove the underscores:
<-
dat |>
dat rename_with(\(x) str_remove_all(x, "_"))
Now our column names look like this:
colnames(dat)
[1] "id" "mathq1pre" "mathq1post" "readingq1pre"
[5] "readingq1post" "scienceq1pre" "scienceq1post" "mathq2pre"
[9] "mathq2post" "readingq2pre" "readingq2post" "scienceq2pre"
[13] "scienceq2post" "mathq3pre" "mathq3post" "readingq3pre"
[17] "readingq3post" "scienceq3pre" "scienceq3post" "mathq4pre"
[21] "mathq4post" "readingq4pre" "readingq4post" "scienceq4pre"
[25] "scienceq4post" "mathq5pre" "mathq5post" "readingq5pre"
[29] "readingq5post" "scienceq5pre" "scienceq5post"
The pattern now is “subjectqnumbertime”. The letter “q” separates the subject from the number, but number and time are next to each other, so we have to differentiate them in a different way.
Before renaming the columns, the column names followed this pattern: (.*)_q(.*)_(.*)
Now without underscores, they follow this pattern: (.*)q(.*)(.*)
We can see there will be a problem differentiating the second two groups since they are both any character zero or more times. In our case, the second group is always numbers, so we can specify in the regex that it must be digits with the special group [[:digit:]]
. (See ?base::regex
for a list of such groups.)
Now our names_pattern
can be specified as (.*)q([[:digit:]]*)(.*)
:
|>
dat pivot_longer(cols = -id,
names_to = c("subject", "q", "time"),
names_pattern = "(.*)q([[:digit:]]*)(.*)",
values_to = "correct")
# A tibble: 300 × 5
id subject q time correct
<int> <chr> <chr> <chr> <int>
1 1 math 1 pre 1
2 1 math 1 post 0
3 1 reading 1 pre 0
4 1 reading 1 post 1
5 1 science 1 pre 1
6 1 science 1 post 1
7 1 math 2 pre 0
8 1 math 2 post 0
9 1 reading 2 pre 0
10 1 reading 2 post 0
# ℹ 290 more rows
15.6 Exercises
15.6.1 Fundamental
Reshape the built-in
table1
to long so that it looks liketable2
.Reshape the built-in
table2
to wide so that it looks liketable1
.Reshape the built-in
economics
to long so that its first four rows look like this:# A tibble: 4 × 3 date variable value <date> <chr> <dbl> 1 1967-07-01 pce 507. 2 1967-07-01 pop 198712 3 1967-07-01 psavert 12.6 4 1967-07-01 uempmed 4.5
Reshape the built-in
ChickWeight
to wide so that its first four rows look like this:# A tibble: 4 × 14 Chick Diet t0 t2 t4 t6 t8 t10 t12 t14 t16 t18 t20 <ord> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 1 42 51 59 64 76 93 106 125 149 171 199 2 2 1 40 49 58 72 84 103 122 138 162 187 209 3 3 1 43 39 55 67 84 99 115 138 163 187 198 4 4 1 42 49 56 67 74 87 102 108 136 154 160 # ℹ 1 more variable: t21 <dbl>
Reshape the built-in
sleep
to wide so that its first four rows look like this:# A tibble: 4 × 3 ID group1 group2 <fct> <dbl> <dbl> 1 1 0.7 1.9 2 2 -1.6 0.8 3 3 -0.2 1.1 4 4 -1.2 0.1
15.6.2 Extended
Use the following dataset, which is modified from some summer days listed on https://www.library.wisc.edu/locations/:
<- read.csv("https://www.sscc.wisc.edu/sscc/pubs/dwr/data/reshape_extended1.csv") libraries
Reshape it to wide so that its first two rows look like this:
# A tibble: 2 × 5 library open_sunday open_monday close_sunday close_monday <chr> <chr> <chr> <chr> <chr> 1 arboretum 12:30pm 9:30am 4pm 4pm 2 art <NA> 11am <NA> 4pm
Use the following dataset:
<- read.csv("https://www.sscc.wisc.edu/sscc/pubs/dwr/data/reshape_extended2.csv") attendance
Reshape it to long so that its first four rows look like this:
# A tibble: 4 × 6 id status language semester year attended <int> <chr> <chr> <chr> <chr> <int> 1 4223798 fac r fall 2021 0 2 4223798 fac r spring 2021 0 3 4223798 fac stata fall 2021 0 4 4223798 fac stata spring 2021 1