15  Reshaping Dataframes

15.1 Warm-Up

Load these two datasets into R:

long <- read.csv("https://www.sscc.wisc.edu/sscc/pubs/dwr/data/reshape_exercise_long.csv")
wide <- read.csv("https://www.sscc.wisc.edu/sscc/pubs/dwr/data/reshape_exercise_wide.csv")

long and wide have the same data but are organized differently.

Work with one or two partners. At least one of you should attempt #1 with both long and wide, and at least one of you should attempt #2 with both long and wide. If you get stuck, try the task with the other dataset. Then discuss with your partner(s): was your task easier with one of the two datasets? Which one?

  1. Calculate each ID’s average across years, resulting in this output:
 id   avg
  1 13.75
  2  6.25
  3  7.25
  4  9.50
  1. Add a variable with each ID’s sum of values from 2000 and 2001.

15.2 Outcomes

Objective: To change the shape of data, from long to wide or from wide to long.

Why it matters: Different data wrangling operations and analyses are easier done in either long or wide format, so reshaping data is important in preparing and modeling data.

Learning outcomes:

Fundamental Skills Extended Skills
  • Reshape a dataframe from long to wide.

  • Reshape a dataframe from wide to long.

  • Widen data with multiple columns of values.

  • Lengthen data where column names contain multiple variables.

Key functions:

pivot_wider()
pivot_longer()

15.3 Data Shapes

Dataframes can be organized in different ways for different purposes. Dataframes often come in less-than-ideal formats, especially when you are using secondary data. It is important to know how to rearrange the data to prepare it for tables or for plotting with ggplot2. Data comes in two primary shapes: wide and long.

Data is wide when a row has more than one observation, and the units of observation (e.g., individuals, countries, households) are on one row each.

You might run into this format if you work with survey or assessment data, or if you have ever downloaded data from Qualtrics. In the example below, each row corresponds to a single person, and each column is a different observation for that person.

ID Income2000 Income2001 Income2002
001 50000 52000 56000
002 0 30000 31000
003 6800 6400 6850

Data is long when a row has only one observation, but the units of observation are repeated down a column.

Longitudinal data is often in the long format. You might have a column where ID numbers are repeated, a column marking when each data point was observed, and another column with observed values.

Here is the same dataset as above, now arranged into a long format:

ID Year Income
001 2000 50000
001 2001 52000
001 2002 56000
002 2000 0
002 2001 30000
002 2002 31000
003 2000 6800
003 2001 6400
003 2002 6850

15.4 Long to Wide

Load the tidyverse.

library(tidyverse)

For a first example, we will reshape the built-in population dataframe to wide.

The first few rows of population currently looks like this:

population |> 
  head()
# A tibble: 6 × 3
  country      year population
  <chr>       <dbl>      <dbl>
1 Afghanistan  1995   17586073
2 Afghanistan  1996   18415307
3 Afghanistan  1997   19021226
4 Afghanistan  1998   19496836
5 Afghanistan  1999   19987071
6 Afghanistan  2000   20595360

The units of observation, country, are repeated, and there is one row for each value of year.

Reshaping a dataset to wide format creates more columns. We need to decide two things:

  • names_from: where those new columns should get their names from
  • values_from: where the values of those new columns should come from

Here, we can make one column for each year, named 1995, 1996, and so on. The values in those columns should come from the current population column. The result will be a single row for each country, with one column for each year with its population value.

Use the pivot_wider() function and specify the source of the new column names (names_from) and the source of the values for those columns (values_from):

population |> 
  pivot_wider(names_from = year,
              values_from = population)
# A tibble: 219 × 20
   country `1995` `1996` `1997` `1998` `1999` `2000` `2001` `2002` `2003` `2004`
   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Afghan… 1.76e7 1.84e7 1.90e7 1.95e7 2.00e7 2.06e7 2.13e7 2.22e7 2.31e7 2.40e7
 2 Albania 3.36e6 3.34e6 3.33e6 3.33e6 3.32e6 3.30e6 3.29e6 3.26e6 3.24e6 3.22e6
 3 Algeria 2.93e7 2.98e7 3.03e7 3.08e7 3.13e7 3.17e7 3.22e7 3.26e7 3.30e7 3.35e7
 4 Americ… 5.29e4 5.39e4 5.49e4 5.59e4 5.68e4 5.75e4 5.82e4 5.87e4 5.91e4 5.93e4
 5 Andorra 6.39e4 6.43e4 6.41e4 6.38e4 6.41e4 6.54e4 6.80e4 7.16e4 7.56e4 7.91e4
 6 Angola  1.21e7 1.25e7 1.28e7 1.31e7 1.35e7 1.39e7 1.44e7 1.49e7 1.54e7 1.60e7
 7 Anguil… 9.81e3 1.01e4 1.03e4 1.05e4 1.08e4 1.11e4 1.14e4 1.17e4 1.20e4 1.23e4
 8 Antigu… 6.83e4 7.02e4 7.22e4 7.42e4 7.60e4 7.76e4 7.90e4 8.00e4 8.09e4 8.17e4
 9 Argent… 3.48e7 3.53e7 3.57e7 3.61e7 3.65e7 3.69e7 3.73e7 3.76e7 3.80e7 3.83e7
10 Armenia 3.22e6 3.17e6 3.14e6 3.11e6 3.09e6 3.08e6 3.06e6 3.05e6 3.04e6 3.03e6
# ℹ 209 more rows
# ℹ 9 more variables: `2005` <dbl>, `2006` <dbl>, `2007` <dbl>, `2008` <dbl>,
#   `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>

The reshape worked just fine, but one issue we will have later is referring to the new column names. They start with a number, so we would need to reference them with back ticks: `1995`. To make column names easier to use, we can prefix them with some string, and this is done during the reshaping with the names_prefix argument. Add “pop” to the beginning of the column names:

population |> 
  pivot_wider(names_from = year,
              names_prefix = "pop",
              values_from = population)
# A tibble: 219 × 20
   country       pop1995 pop1996 pop1997 pop1998 pop1999 pop2000 pop2001 pop2002
   <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 Afghanistan    1.76e7  1.84e7  1.90e7  1.95e7  2.00e7  2.06e7  2.13e7  2.22e7
 2 Albania        3.36e6  3.34e6  3.33e6  3.33e6  3.32e6  3.30e6  3.29e6  3.26e6
 3 Algeria        2.93e7  2.98e7  3.03e7  3.08e7  3.13e7  3.17e7  3.22e7  3.26e7
 4 American Sam…  5.29e4  5.39e4  5.49e4  5.59e4  5.68e4  5.75e4  5.82e4  5.87e4
 5 Andorra        6.39e4  6.43e4  6.41e4  6.38e4  6.41e4  6.54e4  6.80e4  7.16e4
 6 Angola         1.21e7  1.25e7  1.28e7  1.31e7  1.35e7  1.39e7  1.44e7  1.49e7
 7 Anguilla       9.81e3  1.01e4  1.03e4  1.05e4  1.08e4  1.11e4  1.14e4  1.17e4
 8 Antigua and …  6.83e4  7.02e4  7.22e4  7.42e4  7.60e4  7.76e4  7.90e4  8.00e4
 9 Argentina      3.48e7  3.53e7  3.57e7  3.61e7  3.65e7  3.69e7  3.73e7  3.76e7
10 Armenia        3.22e6  3.17e6  3.14e6  3.11e6  3.09e6  3.08e6  3.06e6  3.05e6
# ℹ 209 more rows
# ℹ 11 more variables: pop2003 <dbl>, pop2004 <dbl>, pop2005 <dbl>,
#   pop2006 <dbl>, pop2007 <dbl>, pop2008 <dbl>, pop2009 <dbl>, pop2010 <dbl>,
#   pop2011 <dbl>, pop2012 <dbl>, pop2013 <dbl>

15.4.1 Missing Data

When reshaping a dataframe to wide, new missing data may appear.

The original population has no missing data:

population |> 
  is.na() |> 
  sum()
[1] 0

But our wide population has missing data:

population |> 
  pivot_wider(names_from = year,
              names_prefix = "pop",
              values_from = population) |> 
  is.na() |> 
  sum()
[1] 101

Where did the missing data come from?

In long format, missing data is represented by NA or by missing rows.

Imagine we have two individuals take two tests, and we record their scores. Individual 02 does not take test b, so their score is missing. That could be represented in one of two ways.

They could have an observation for test b with a score of NA:

id test score
01 a 43
01 b 52
02 a 46
02 b NA

Or the row could be omitted:

id test score
01 a 43
01 b 52
02 a 46

However, when we reshape this dataset to wide, individual 02 must have NA for test b because the test_b column cannot be shorter than the other columns:

id test_a test_b
01 43 52
02 46 NA

When we reshape from wide to long, we will have an option to turn the dataframe into one that uses NAs or that omits rows with missing data.

For another example, look at the built-in fish_encounters dataframe, which is in a long format:

fish_encounters |> 
  head()
# A tibble: 6 × 3
  fish  station  seen
  <fct> <fct>   <int>
1 4842  Release     1
2 4842  I80_1       1
3 4842  Lisbon      1
4 4842  Rstr        1
5 4842  Base_TD     1
6 4842  BCE         1

Reshape this data to wide with pivot_wider(), making the new column names_from the station column, and getting values_from the seen column.

fish_encounters |> 
  pivot_wider(names_from = station,
              values_from = seen)
# A tibble: 19 × 12
   fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
   <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
 1 4842        1     1      1     1       1     1     1     1     1     1     1
 2 4843        1     1      1     1       1     1     1     1     1     1     1
 3 4844        1     1      1     1       1     1     1     1     1     1     1
 4 4845        1     1      1     1       1    NA    NA    NA    NA    NA    NA
 5 4847        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA
 6 4848        1     1      1     1      NA    NA    NA    NA    NA    NA    NA
 7 4849        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
 8 4850        1     1     NA     1       1     1     1    NA    NA    NA    NA
 9 4851        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
10 4854        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
11 4855        1     1      1     1       1    NA    NA    NA    NA    NA    NA
12 4857        1     1      1     1       1     1     1     1     1    NA    NA
13 4858        1     1      1     1       1     1     1     1     1     1     1
14 4859        1     1      1     1       1    NA    NA    NA    NA    NA    NA
15 4861        1     1      1     1       1     1     1     1     1     1     1
16 4862        1     1      1     1       1     1     1     1     1    NA    NA
17 4863        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
18 4864        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
19 4865        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA

Now each row is a fish, and the columns indicate whether that fish was seen at a station. Here, “seen” is a 1 when a fish was observed, and NA when the fish was not observed. However, we typically code indicators as 1 and 0. We have control over how the missing values are filled in when making data wide. To use a different value, such as 0, specify values_fill = 0:

fish_encounters |> 
  pivot_wider(names_from = station,
              values_from = seen,
              values_fill = 0)
# A tibble: 19 × 12
   fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
   <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
 1 4842        1     1      1     1       1     1     1     1     1     1     1
 2 4843        1     1      1     1       1     1     1     1     1     1     1
 3 4844        1     1      1     1       1     1     1     1     1     1     1
 4 4845        1     1      1     1       1     0     0     0     0     0     0
 5 4847        1     1      1     0       0     0     0     0     0     0     0
 6 4848        1     1      1     1       0     0     0     0     0     0     0
 7 4849        1     1      0     0       0     0     0     0     0     0     0
 8 4850        1     1      0     1       1     1     1     0     0     0     0
 9 4851        1     1      0     0       0     0     0     0     0     0     0
10 4854        1     1      0     0       0     0     0     0     0     0     0
11 4855        1     1      1     1       1     0     0     0     0     0     0
12 4857        1     1      1     1       1     1     1     1     1     0     0
13 4858        1     1      1     1       1     1     1     1     1     1     1
14 4859        1     1      1     1       1     0     0     0     0     0     0
15 4861        1     1      1     1       1     1     1     1     1     1     1
16 4862        1     1      1     1       1     1     1     1     1     0     0
17 4863        1     1      0     0       0     0     0     0     0     0     0
18 4864        1     1      0     0       0     0     0     0     0     0     0
19 4865        1     1      1     0       0     0     0     0     0     0     0

15.4.2 Multiple Columns of Values

population and fish_encounters only have a single value associated with each of the names.

Some datasets have multiple values for each name. The built-in us_rent_income has one row for each combination of state (NAME) and income and rent (together in variable). On each row, it has two variables, estimate and moe.

us_rent_income
# A tibble: 104 × 5
   GEOID NAME       variable estimate   moe
   <chr> <chr>      <chr>       <dbl> <dbl>
 1 01    Alabama    income      24476   136
 2 01    Alabama    rent          747     3
 3 02    Alaska     income      32940   508
 4 02    Alaska     rent         1200    13
 5 04    Arizona    income      27517   148
 6 04    Arizona    rent          972     4
 7 05    Arkansas   income      23789   165
 8 05    Arkansas   rent          709     5
 9 06    California income      29454   109
10 06    California rent         1358     3
# ℹ 94 more rows

If we only had one of these, reshaping to wide would follow the same process as above. Drop the moe column and get names_from the variable column and values_from the estimate column:

us_rent_income |> 
  select(-moe) |> 
  pivot_wider(names_from = variable,
              values_from = estimate)
# A tibble: 52 × 4
   GEOID NAME                 income  rent
   <chr> <chr>                 <dbl> <dbl>
 1 01    Alabama               24476   747
 2 02    Alaska                32940  1200
 3 04    Arizona               27517   972
 4 05    Arkansas              23789   709
 5 06    California            29454  1358
 6 08    Colorado              32401  1125
 7 09    Connecticut           35326  1123
 8 10    Delaware              31560  1076
 9 11    District of Columbia  43198  1424
10 12    Florida               25952  1077
# ℹ 42 more rows

Here, we got values from a single column, estimate.

Since we have two columns with data, estimate and moe, we can provide more than one column to the values_from argument. In that case, pivot_wider() will prefix the new column names (from the names_from column) with the current names of the values_from columns. We will then get four columns, one for each combination of income/rent and estimate/moe:

us_rent_income |> 
  pivot_wider(names_from = variable,
              values_from = c(estimate, moe))
# A tibble: 52 × 6
   GEOID NAME                 estimate_income estimate_rent moe_income moe_rent
   <chr> <chr>                          <dbl>         <dbl>      <dbl>    <dbl>
 1 01    Alabama                        24476           747        136        3
 2 02    Alaska                         32940          1200        508       13
 3 04    Arizona                        27517           972        148        4
 4 05    Arkansas                       23789           709        165        5
 5 06    California                     29454          1358        109        3
 6 08    Colorado                       32401          1125        109        5
 7 09    Connecticut                    35326          1123        195        5
 8 10    Delaware                       31560          1076        247       10
 9 11    District of Columbia           43198          1424        681       17
10 12    Florida                        25952          1077         70        3
# ℹ 42 more rows

15.5 Wide to Long

We can reshape wide data to long with the pivot_longer() function.

The built-in relig_income is in a wide format:

relig_income |> 
  head()
# A tibble: 6 × 11
  religion  `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
  <chr>       <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
1 Agnostic       27        34        60        81        76       137        122
2 Atheist        12        27        37        52        35        70         73
3 Buddhist       27        21        30        34        33        58         62
4 Catholic      418       617       732       670       638      1116        949
5 Don’t kn…      15        14        15        11        10        35         21
6 Evangeli…     575       869      1064       982       881      1486        949
# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#   `Don't know/refused` <dbl>

When reshaping to long, we need to specify three things:

  • cols: which columns should be made long
  • names_to: the name of the new column that the current column names will be gathered into
  • values_to: the name of the new column that the values will be gathered into

The columns that are made long are generally everything except the identifier column. Here, we only have one identifier column, religion. The cols argument supports tidyselect syntax, so the ways we learned how to select columns in the chapter on subsetting can be used here. To select everything except for a column, specify -religion. Note this does not drop religion from the dataset but rather it specifies which columns are to be made long: everything but religion.

Gather the current column names which describe income levels into a column called “income” with names_to = "income", and put their values which are the number of respondents into a column called “n” with values_to = "n":

relig_income |> 
  pivot_longer(cols = -religion,
               names_to = "income",
               values_to = "n")
# A tibble: 180 × 3
   religion income                 n
   <chr>    <chr>              <dbl>
 1 Agnostic <$10k                 27
 2 Agnostic $10-20k               34
 3 Agnostic $20-30k               60
 4 Agnostic $30-40k               81
 5 Agnostic $40-50k               76
 6 Agnostic $50-75k              137
 7 Agnostic $75-100k             122
 8 Agnostic $100-150k            109
 9 Agnostic >150k                 84
10 Agnostic Don't know/refused    96
# ℹ 170 more rows

15.5.1 Cleaning Up

The built-in billboard data is in a wide format with one row per song and columns with the song’s rank each week.

billboard
# A tibble: 317 × 79
   artist     track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
   <chr>      <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 2 Pac      Baby… 2000-02-26      87    82    72    77    87    94    99    NA
 2 2Ge+her    The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
 3 3 Doors D… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
 4 3 Doors D… Loser 2000-10-21      76    76    72    69    67    65    55    59
 5 504 Boyz   Wobb… 2000-04-15      57    34    25    17    17    31    36    49
 6 98^0       Give… 2000-08-19      51    39    34    26    26    19     2     2
 7 A*Teens    Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
 8 Aaliyah    I Do… 2000-01-29      84    62    51    41    38    35    35    38
 9 Aaliyah    Try … 2000-03-18      59    53    38    28    21    18    16    14
10 Adams, Yo… Open… 2000-08-26      76    76    74    69    68    67    61    58
# ℹ 307 more rows
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …

Reshape the data so that it has one row per week. The columns to make long are those that start with the letters “wk”, so specify cols = starts_with("wk"). Gather those into a column named “week” with names_to = "week" and put the values from those columns into a column named “rank” with values_to = "rank":

billboard |> 
  pivot_longer(cols = starts_with("wk"),
               names_to = "week",
               values_to = "rank")
# A tibble: 24,092 × 5
   artist track                   date.entered week   rank
   <chr>  <chr>                   <date>       <chr> <dbl>
 1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1      87
 2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2      82
 3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3      72
 4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4      77
 5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5      87
 6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6      94
 7 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk7      99
 8 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk8      NA
 9 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk9      NA
10 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk10     NA
# ℹ 24,082 more rows

The reshape was successful, but the dataset could be cleaned up in a few ways:

  • Remove the prefix “wk” from the values in the week column
    • The character prefix made the columns syntactic in wide format, but they prevent us from treating week as an ordered numeric variable in long format
    • Remove it with names_prefix = "wk"
  • Coerce week to numeric after removing the prefix
    • The names_to column is character by default
    • Change its type by applying a function with names_transform = as.numeric
    • This could also be done later with mutate(week = as.numeric(week))
  • Optionally, drop observations with missing data
    • Missing values are required in wide format when there is no existing combination of identifier variable (track) and a level of the names variable, but we can drop these rows in long format, as discussed in the [section on missing data] above
    • Drop observations with values_drop_na = T
billboard |> 
  pivot_longer(cols = starts_with("wk"),
               names_prefix = "wk",
               names_to = "week",
               names_transform = as.numeric, 
               values_to = "rank",
               values_drop_na = T)
# A tibble: 5,307 × 5
   artist  track                   date.entered  week  rank
   <chr>   <chr>                   <date>       <dbl> <dbl>
 1 2 Pac   Baby Don't Cry (Keep... 2000-02-26       1    87
 2 2 Pac   Baby Don't Cry (Keep... 2000-02-26       2    82
 3 2 Pac   Baby Don't Cry (Keep... 2000-02-26       3    72
 4 2 Pac   Baby Don't Cry (Keep... 2000-02-26       4    77
 5 2 Pac   Baby Don't Cry (Keep... 2000-02-26       5    87
 6 2 Pac   Baby Don't Cry (Keep... 2000-02-26       6    94
 7 2 Pac   Baby Don't Cry (Keep... 2000-02-26       7    99
 8 2Ge+her The Hardest Part Of ... 2000-09-02       1    91
 9 2Ge+her The Hardest Part Of ... 2000-09-02       2    87
10 2Ge+her The Hardest Part Of ... 2000-09-02       3    92
# ℹ 5,297 more rows

15.5.2 Multiple Datasets in One

The built-in construction dataframe has counts of completed units in each month:

construction
# A tibble: 9 × 9
   Year Month  `1 unit` `2 to 4 units` `5 units or more` Northeast Midwest South
  <dbl> <chr>     <dbl> <lgl>                      <dbl>     <dbl>   <dbl> <dbl>
1  2018 Janua…      859 NA                           348       114     169   596
2  2018 Febru…      882 NA                           400       138     160   655
3  2018 March       862 NA                           356       150     154   595
4  2018 April       797 NA                           447       144     196   613
5  2018 May         875 NA                           364        90     169   673
6  2018 June        867 NA                           342        76     170   610
7  2018 July        829 NA                           360       108     183   594
8  2018 August      939 NA                           286        90     205   649
9  2018 Septe…      835 NA                           304       117     175   560
# ℹ 1 more variable: West <dbl>

This is actually two datasets in one. In each month, some number of units were completed, and then this number is broken down by size (1, 2-4, 5+ units) and by region (Northeast, Midwest, South, West). Size is unassociated with region, so each unit it counted twice in the current arrangement. If we had variables like “1 unit - Northeast”, “2 to 4 units - Northeast”, and “1 unit - Midwest”, then units would not be counted twice, and we could simply reshape it to long.

You may encounter this format in government or administrative records, where some total is broken down by various subcategories (age group, student status, level of education, race and ethnicity, etc.), but not by the intersection of those categories, perhaps to reduce the risk of identifiability.

For this dataset, it would be best to separate it into two when reshaping it to long. Drop either the region or size variables before reshaping to get either dataset:

construction |> 
  select(-c(Northeast, Midwest, South, West)) |> 
  pivot_longer(cols = -c(Year, Month),
               names_to = "Size",
               values_to = "n") 
# A tibble: 27 × 4
    Year Month    Size                n
   <dbl> <chr>    <chr>           <dbl>
 1  2018 January  1 unit            859
 2  2018 January  2 to 4 units       NA
 3  2018 January  5 units or more   348
 4  2018 February 1 unit            882
 5  2018 February 2 to 4 units       NA
 6  2018 February 5 units or more   400
 7  2018 March    1 unit            862
 8  2018 March    2 to 4 units       NA
 9  2018 March    5 units or more   356
10  2018 April    1 unit            797
# ℹ 17 more rows
construction |> 
  select(-c(`1 unit`, `2 to 4 units`, `5 units or more`)) |> 
  pivot_longer(cols = -c(Year, Month),
               names_to = "Region",
               values_to = "n")
# A tibble: 36 × 4
    Year Month    Region        n
   <dbl> <chr>    <chr>     <dbl>
 1  2018 January  Northeast   114
 2  2018 January  Midwest     169
 3  2018 January  South       596
 4  2018 January  West        339
 5  2018 February Northeast   138
 6  2018 February Midwest     160
 7  2018 February South       655
 8  2018 February West        336
 9  2018 March    Northeast   150
10  2018 March    Midwest     154
# ℹ 26 more rows

15.5.3 Multiple Variables in Names

If the columns in construction did have names like “1 unit - Northeast” and “2 to 4 units - Northeast”, this would be a situation where the names contained multiple variables, following the pattern “size - region”.

A similar issue is present in the built-in household dataframe:

household
# A tibble: 5 × 5
  family dob_child1 dob_child2 name_child1 name_child2
   <int> <date>     <date>     <chr>       <chr>      
1      1 1998-11-26 2000-01-29 Susan       Jose       
2      2 1996-06-22 NA         Mark        <NA>       
3      3 2002-07-11 2004-04-05 Sam         Seth       
4      4 2004-10-10 2009-08-27 Craig       Khai       
5      5 2000-12-05 2005-02-28 Parker      Gracie     

The columns here follow a pattern of “variable_childnumber”. The string “_child” is a reliable separator between the two variables of variable name (“dob” or “name”) and child number (1 or 2).

In pivot_longer(), specify that separator in the names with names_sep = "_child". A separator in the names means that more than one name will be returned, equal to the number of separators plus one. A single occurrence of “_child” in the names returns two pieces, one on the left (“dob”/“name”) and one on the right (1/2). Because of this, names_to will require two names. The second one is easy; 1/2 is the child number, which we can just call “child”.

For the piece on the left, we want to create two separate columns since “dob” and “name” are distinct variables. (We are also unable to smoothly combine them into a single column since one is a date and the other a character.) To dynamically create separate columns for each unique piece extracted after separating, specify “.value” as the first name. Since “dob” and “name” will be the names of the columns, we do not need a values_to argument.

names_to = c(".value", "child") in conjunction with names_sep = "_child" will split the current column names by “_child”:

  • dob_child1 becomes dob and 1
  • dob_child2 becomes dob and 2
  • name_child1 becomes name and 1
  • name_child2 becomes name and 2

With “.value” in the first position of names_to, new columns will be made from the first piece in each column name, one for each unique value (“dob” and “name”). With “child” in the second position of names_to, all of the values will be gathered into a single column called “child”.

household |> 
  pivot_longer(cols = -family,
               names_to = c(".value", "child"),
               names_sep = "_child")
# A tibble: 10 × 4
   family child dob        name  
    <int> <chr> <date>     <chr> 
 1      1 1     1998-11-26 Susan 
 2      1 2     2000-01-29 Jose  
 3      2 1     1996-06-22 Mark  
 4      2 2     NA         <NA>  
 5      3 1     2002-07-11 Sam   
 6      3 2     2004-04-05 Seth  
 7      4 1     2004-10-10 Craig 
 8      4 2     2009-08-27 Khai  
 9      5 1     2000-12-05 Parker
10      5 2     2005-02-28 Gracie

Some additional cleanup could be done to coerce the new child variable to numeric and drop rows with missing data. Here, because multiple names are used and we only want to modify child, use a named list following the pattern list(column = function): names_transform = list(child = as.numeric)

household |> 
  pivot_longer(cols = -family,
               names_to = c(".value", "child"),
               names_sep = "_child",
               names_transform = list(child = as.numeric),
               values_drop_na = T)
# A tibble: 9 × 4
  family child dob        name  
   <int> <dbl> <date>     <chr> 
1      1     1 1998-11-26 Susan 
2      1     2 2000-01-29 Jose  
3      2     1 1996-06-22 Mark  
4      3     1 2002-07-11 Sam   
5      3     2 2004-04-05 Seth  
6      4     1 2004-10-10 Craig 
7      4     2 2009-08-27 Khai  
8      5     1 2000-12-05 Parker
9      5     2 2005-02-28 Gracie

For another example with multiple variables within column names, import this dataset:

dat <- read.csv("https://sscc.wisc.edu/sscc/pubs/dwr/data/reshape_pattern.csv")

After id, the column names follow the pattern “subject_qnumber_time”:

colnames(dat)
 [1] "id"              "math_q1_pre"     "math_q1_post"    "reading_q1_pre" 
 [5] "reading_q1_post" "science_q1_pre"  "science_q1_post" "math_q2_pre"    
 [9] "math_q2_post"    "reading_q2_pre"  "reading_q2_post" "science_q2_pre" 
[13] "science_q2_post" "math_q3_pre"     "math_q3_post"    "reading_q3_pre" 
[17] "reading_q3_post" "science_q3_pre"  "science_q3_post" "math_q4_pre"    
[21] "math_q4_post"    "reading_q4_pre"  "reading_q4_post" "science_q4_pre" 
[25] "science_q4_post" "math_q5_pre"     "math_q5_post"    "reading_q5_pre" 
[29] "reading_q5_post" "science_q5_pre"  "science_q5_post"

The easiest way to split these columns is by the underscore delimiter. Two underscores in each name means we need to provide three values to names_to. In the example with household, we wanted separate columns by some of the pieces extracted since date of birth and name are distinct variables. In this example, the each of the pieces we will get back can be gathered into columns called “subject” (math, reading, science), “question” (q1-q5), and “time” (pre/post). Because we do not use “.value” within names_to, we need to provide a values_to argument. The values in this dataframe are all indicators whether a question was answered correctly, so call this new column “correct”:

dat |> 
  pivot_longer(cols = -id,
               names_to = c("subject", "question", "time"),
               names_sep = "_",
               values_to = "correct")
# A tibble: 300 × 5
      id subject question time  correct
   <int> <chr>   <chr>    <chr>   <int>
 1     1 math    q1       pre         1
 2     1 math    q1       post        0
 3     1 reading q1       pre         0
 4     1 reading q1       post        1
 5     1 science q1       pre         1
 6     1 science q1       post        1
 7     1 math    q2       pre         0
 8     1 math    q2       post        0
 9     1 reading q2       pre         0
10     1 reading q2       post        0
# ℹ 290 more rows

A more flexible alternative to separating column names into multiple variables is to use a regex with names_pattern (learn about regex in the chapter on character vectors). The advantage of names_pattern over names_sep is that it allows for multiple separators, like an underscore and then a period, or no separators at all (in the next example).

Here, we can specify a group (()) with any character (.) zero or more times (*) with this regex: (.*). Following this is an underscore and the letter “q”, and then a group of characters, another underscore, and another group of characters. The complete regex for our column names is (.*)_q(.*)_(.*). Putting the “q” outside the group treats it as a separator that is not included in the resulting question column (i.e., we get 1-5 instead of q1-q5). Reshape the dataframe with this pattern:

dat |> 
  pivot_longer(cols = -id,
               names_to = c("subject", "question", "time"),
               names_pattern = "(.*)_q(.*)_(.*)",
               values_to = "correct")
# A tibble: 300 × 5
      id subject question time  correct
   <int> <chr>   <chr>    <chr>   <int>
 1     1 math    1        pre         1
 2     1 math    1        post        0
 3     1 reading 1        pre         0
 4     1 reading 1        post        1
 5     1 science 1        pre         1
 6     1 science 1        post        1
 7     1 math    2        pre         0
 8     1 math    2        post        0
 9     1 reading 2        pre         0
10     1 reading 2        post        0
# ℹ 290 more rows

Now, extend the use of names_pattern to a situation with a clear pattern but no reliable separator. Rename the columns in dat to remove the underscores:

dat <-
  dat |> 
  rename_with(\(x) str_remove_all(x, "_"))

Now our column names look like this:

colnames(dat)
 [1] "id"            "mathq1pre"     "mathq1post"    "readingq1pre" 
 [5] "readingq1post" "scienceq1pre"  "scienceq1post" "mathq2pre"    
 [9] "mathq2post"    "readingq2pre"  "readingq2post" "scienceq2pre" 
[13] "scienceq2post" "mathq3pre"     "mathq3post"    "readingq3pre" 
[17] "readingq3post" "scienceq3pre"  "scienceq3post" "mathq4pre"    
[21] "mathq4post"    "readingq4pre"  "readingq4post" "scienceq4pre" 
[25] "scienceq4post" "mathq5pre"     "mathq5post"    "readingq5pre" 
[29] "readingq5post" "scienceq5pre"  "scienceq5post"

The pattern now is “subjectqnumbertime”. The letter “q” separates the subject from the number, but number and time are next to each other, so we have to differentiate them in a different way.

Before renaming the columns, the column names followed this pattern: (.*)_q(.*)_(.*)

Now without underscores, they follow this pattern: (.*)q(.*)(.*)

We can see there will be a problem differentiating the second two groups since they are both any character zero or more times. In our case, the second group is always numbers, so we can specify in the regex that it must be digits with the special group [[:digit:]]. (See ?base::regex for a list of such groups.)

Now our names_pattern can be specified as (.*)q([[:digit:]]*)(.*):

dat |> 
  pivot_longer(cols = -id,
               names_to = c("subject", "q", "time"),
               names_pattern = "(.*)q([[:digit:]]*)(.*)",
               values_to = "correct")
# A tibble: 300 × 5
      id subject q     time  correct
   <int> <chr>   <chr> <chr>   <int>
 1     1 math    1     pre         1
 2     1 math    1     post        0
 3     1 reading 1     pre         0
 4     1 reading 1     post        1
 5     1 science 1     pre         1
 6     1 science 1     post        1
 7     1 math    2     pre         0
 8     1 math    2     post        0
 9     1 reading 2     pre         0
10     1 reading 2     post        0
# ℹ 290 more rows

15.6 Exercises

15.6.1 Fundamental

  1. Reshape the built-in table1 to long so that it looks like table2.

  2. Reshape the built-in table2 to wide so that it looks like table1.

  3. Reshape the built-in economics to long so that its first four rows look like this:

    # A tibble: 4 × 3
      date       variable    value
      <date>     <chr>       <dbl>
    1 1967-07-01 pce         507. 
    2 1967-07-01 pop      198712  
    3 1967-07-01 psavert      12.6
    4 1967-07-01 uempmed       4.5
  4. Reshape the built-in ChickWeight to wide so that its first four rows look like this:

    # A tibble: 4 × 14
      Chick Diet     t0    t2    t4    t6    t8   t10   t12   t14   t16   t18   t20
      <ord> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    1 1     1        42    51    59    64    76    93   106   125   149   171   199
    2 2     1        40    49    58    72    84   103   122   138   162   187   209
    3 3     1        43    39    55    67    84    99   115   138   163   187   198
    4 4     1        42    49    56    67    74    87   102   108   136   154   160
    # ℹ 1 more variable: t21 <dbl>
  5. Reshape the built-in sleep to wide so that its first four rows look like this:

    # A tibble: 4 × 3
      ID    group1 group2
      <fct>  <dbl>  <dbl>
    1 1        0.7    1.9
    2 2       -1.6    0.8
    3 3       -0.2    1.1
    4 4       -1.2    0.1

15.6.2 Extended

  1. Use the following dataset, which is modified from some summer days listed on https://www.library.wisc.edu/locations/:

    libraries <- read.csv("https://www.sscc.wisc.edu/sscc/pubs/dwr/data/reshape_extended1.csv")

    Reshape it to wide so that its first two rows look like this:

    # A tibble: 2 × 5
      library   open_sunday open_monday close_sunday close_monday
      <chr>     <chr>       <chr>       <chr>        <chr>       
    1 arboretum 12:30pm     9:30am      4pm          4pm         
    2 art       <NA>        11am        <NA>         4pm         
  2. Use the following dataset:

    attendance <- read.csv("https://www.sscc.wisc.edu/sscc/pubs/dwr/data/reshape_extended2.csv")

    Reshape it to long so that its first four rows look like this:

    # A tibble: 4 × 6
           id status language semester year  attended
        <int> <chr>  <chr>    <chr>    <chr>    <int>
    1 4223798 fac    r        fall     2021         0
    2 4223798 fac    r        spring   2021         0
    3 4223798 fac    stata    fall     2021         0
    4 4223798 fac    stata    spring   2021         1