15 Reshaping

In a data frame, each row represents a unit of observation and each column represents a measurement. For example, each row might represent a respondent in a survey, and each column might represent their answers to a specific survey question.

Now suppose you re-surveyed the same respondents a year later, asking the same questions. You might consider your unit of observation to be the person, and record their new answers as a new set of columns in your data frame.

ID income2020 income2021
01 33000 34000
02 25000 29000

But you might also consider the unit of observation to be the person-time combination, that is, person 01 at time one is organized as a distinct observation from person 01 at time two.

ID year income
01 2020 33000
02 2020 25000
01 2021 34000
02 2021 29000

This sort of data is often described as repeated measures, nested, or hierarchical data, and occurs when one conceptual unit of observation includes multiple other units of observation. Often the larger unit (“higher level”) is termed an “observation”, and we discuss measures that vary “within” these units and measure that vary “between” these units.

In data wrangling terms, the first example above is often described as wide form data, while the second example is long form. Whether you need your data in wide or long form often depends on what you want to do next with your data - some statistical and graphing functions expect data in wide form, while others expect data in long form.

Converting a data frame from long to wide, or vice versa, is called reshaping.

15.1 Reshape Wider

To reshape data from long form to wide form we need to identify three or four groups of variables:

  • ID variable(s) - identify the units which are repeated
  • Repetition - a variable that identifies the repetition within ID
  • Varying - variables that differ within an ID
  • Repeated - variables that are constant within an ID, but vary between IDs.

15.1.1 Simple Case

A simple example of long form data is the sleep data.

sleep
   extra group ID
1    0.7     1  1
2   -1.6     1  2
3   -0.2     1  3
4   -1.2     1  4
5   -0.1     1  5
6    3.4     1  6
7    3.7     1  7
8    0.8     1  8
9    0.0     1  9
10   2.0     1 10
11   1.9     2  1
12   0.8     2  2
13   1.1     2  3
14   0.1     2  4
15  -0.1     2  5
16   4.4     2  6
17   5.5     2  7
18   1.6     2  8
19   4.6     2  9
20   3.4     2 10

Individuals in the sleep study are given an ID, and each appears twice. The repetitions are given by group, the two drugs being studied. The variable extra is varying within ID. (There are no “between” variables here.)

The reshape looks like this:

wide0 <-
  reshape(
    sleep,
    direction = "wide",
    idvar     = "ID",     # obs unit,   "i"
    timevar   = "group",  # repetition, "j"
    v.names   = "extra"
  )
wide0
   ID extra.1 extra.2
1   1     0.7     1.9
2   2    -1.6     0.8
3   3    -0.2     1.1
4   4    -1.2     0.1
5   5    -0.1    -0.1
6   6     3.4     4.4
7   7     3.7     5.5
8   8     0.8     1.6
9   9     0.0     4.6
10 10     2.0     3.4

Notice that the two variables, extra.1 and extra.2 are a combination of the varying variable name, extra, and the values of the repetition variable, group. This is an important naming pattern when reshaping from wide to long.

Be aware, too, that once reshaped, it is easy to reshape the data in the other direction. Try:

reshape(wide0) # reverse a reshape

15.1.2 Unbalanced Data

Now consider what happens if the repetitions are not balanced, that is, not every repeated unit has the same number of repetitions.

sleep1 <- sleep[-2,] # missing a rep in ID==2

wide1 <-
  reshape(
    sleep1,
    direction = "wide",
    idvar     = "ID",     # obs unit,   "i"
    timevar   = "group",  # repetition, "j"
    v.names   = "extra"
  )
wide1
   ID extra.1 extra.2
1   1     0.7     1.9
3   3    -0.2     1.1
4   4    -1.2     0.1
5   5    -0.1    -0.1
6   6     3.4     4.4
7   7     3.7     5.5
8   8     0.8     1.6
9   9     0.0     4.6
10 10     2.0     3.4
12  2      NA     0.8

Missing rows in long form become missing values in wide form.

15.1.3 Multiple “Within” Variables

Often, more than one variable varies within a repeated unit. Specify these with a vector of v.names.

sleep2 <- sleep
sleep2$other <- sample(20) 

reshape(
  sleep2, # more than one "within" var
  direction = "wide",
  idvar     = "ID",     # obs unit,   "i"
  timevar   = "group",  # repetition, "j"
  v.names   = c("extra", "other")
)
   ID extra.1 other.1 extra.2 other.2
1   1     0.7      14     1.9      10
2   2    -1.6      19     0.8       7
3   3    -0.2      11     1.1       5
4   4    -1.2      12     0.1       2
5   5    -0.1      20    -0.1       8
6   6     3.4      17     4.4      16
7   7     3.7      15     5.5      18
8   8     0.8       9     1.6      13
9   9     0.0       1     4.6       3
10 10     2.0       6     3.4       4

15.1.4 Additional Repeated Variables

You may also have additional variables that are repeated at the ID level, but do not vary within ID. You do not need to specify this final group of “between” variables - anything not already named is assumed to vary “between”.

sleep3 <- sleep2
sleep3$site <- rep(sample(5),4)

wide3 <-
  reshape(
    sleep3,
    direction = "wide",
    idvar     = "ID",     # obs unit,   "i"
    timevar   = "group",  # repetition, "j"
    v.names   = c("extra", "other")
  )
wide3
   ID site extra.1 other.1 extra.2 other.2
1   1    2     0.7      14     1.9      10
2   2    3    -1.6      19     0.8       7
3   3    1    -0.2      11     1.1       5
4   4    5    -1.2      12     0.1       2
5   5    4    -0.1      20    -0.1       8
6   6    2     3.4      17     4.4      16
7   7    3     3.7      15     5.5      18
8   8    1     0.8       9     1.6      13
9   9    5     0.0       1     4.6       3
10 10    4     2.0       6     3.4       4

However, you do have to be careful to declare all of your “within” varying variables.

# WARNING, "time"-varying not declared
reshape(  # more than one "within" var
  sleep3, # missing obs
  direction = "wide",
  idvar     = "ID",     # obs unit,   "i"
  timevar   = "group",  # repetition, "j"
  v.names   = c("extra")
)
Warning in reshapeWide(data, idvar = idvar, timevar = timevar, varying = varying, : some constant variables (other) are
really varying
   ID other site extra.1 extra.2
1   1    14    2     0.7     1.9
2   2    19    3    -1.6     0.8
3   3    11    1    -0.2     1.1
4   4    12    5    -1.2     0.1
5   5    20    4    -0.1    -0.1
6   6    17    2     3.4     4.4
7   7    15    3     3.7     5.5
8   8     9    1     0.8     1.6
9   9     1    5     0.0     4.6
10 10     6    4     2.0     3.4

It might be nice if this produced a clear ERROR rather than a WARNING and a resulting data frame!

15.2 Reshape Longer

To go from wide to long we need to invert all of these concepts.

  • ID variable - identify the units which are repeated
  • Repetition - a new variable name that will identify repetitions within ID
  • Varying - variables that differ within an ID (groups of columns)
  • Repeated - variables that are constant within an ID, but vary between IDs (single columns).

Here again, we will only actually specify the first three types of variables, one of which will be created. An important prerequisite is that varying columns have names which can be split into a shared part and a repetition value.

reshape(wide0,
        direction = "long",
        idvar = "ID",       # i
        timevar = "group",  # j
        varying = c("extra.1", "extra.2")
        )
     ID group extra
1.1   1     1   0.7
2.1   2     1  -1.6
3.1   3     1  -0.2
4.1   4     1  -1.2
5.1   5     1  -0.1
6.1   6     1   3.4
7.1   7     1   3.7
8.1   8     1   0.8
9.1   9     1   0.0
10.1 10     1   2.0
1.2   1     2   1.9
2.2   2     2   0.8
3.2   3     2   1.1
4.2   4     2   0.1
5.2   5     2  -0.1
6.2   6     2   4.4
7.2   7     2   5.5
8.2   8     2   1.6
9.2   9     2   4.6
10.2 10     2   3.4
reshape(wide3,
        direction = "long",
        idvar = "ID",       # i
        timevar = "group",  # j
        varying = c("extra.1", "extra.2", 
                    "other.1", "other.2")
)
     ID site group extra other
1.1   1    2     1   0.7    14
2.1   2    3     1  -1.6    19
3.1   3    1     1  -0.2    11
4.1   4    5     1  -1.2    12
5.1   5    4     1  -0.1    20
6.1   6    2     1   3.4    17
7.1   7    3     1   3.7    15
8.1   8    1     1   0.8     9
9.1   9    5     1   0.0     1
10.1 10    4     1   2.0     6
1.2   1    2     2   1.9    10
2.2   2    3     2   0.8     7
3.2   3    1     2   1.1     5
4.2   4    5     2   0.1     2
5.2   5    4     2  -0.1     8
6.2   6    2     2   4.4    16
7.2   7    3     2   5.5    18
8.2   8    1     2   1.6    13
9.2   9    5     2   4.6     3
10.2 10    4     2   3.4     4

15.3 Reshaping Exercises

  1. Reshape the WorldPhones dataset into a long format. Be sure to name the new columns appropriately.

  2. Reshape ChickWeight into a wide format with columns created from Time.

15.4 Data Wrangling Exercises

Putting everything together now,

  1. Reshape us_rent_income (from the tidyr package) so that it has one line per state, and two new columns named estimate_income and estimate_rent that contain values from estimate.

  2. Merge this with state.x77, and keep all rows. Then, drop rows where any values are missing. You can do this in one or two steps.

  3. Add a column containing state.division.

  4. Add a column with the proportion of income spent on rent (rent / income).

  5. Drop rows where Area is not greater than ten times Frost.

  6. Replace all spaces in all column names with dashes (e.g., HS Grad to HS-Grad).

  7. Without removing any rows, add a column with the population-weighted mean rent by geographic division.

  • Which division has the highest mean rent?
  1. Save the resulting dataframe as a CSV file, a tab-delimited text file, and an .RData file.