SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

6.2 Tidy data

6.2.1 Data concepts

A data frame is said to be tidy (Hadley's paper) if the following conditions are met.

  • A single column (variable) contains all measures of the same data feature of the observations.

  • A row contains measures of the same observation.

The following are a few common conditions when a data frame may not be tidy.

  • A data feature may be collected multiple times with an order. When the order is an important feature of the data it is often called panel data. The order is often time. For example it might be a start and end value from an experiment done over a fixed period of time. If the start and end value are separate variables, this would not be tidy data. When data features from each panels are in separate variables the data is said to be in long form.

  • A data feature may be collected multiple times with no particular order. When the order is not an important feature of the data it is often called repeated measures data. An example of this is a farmer is interested in the nitrogen level of a field. He might take a set of random samples. His real interest might be the average nitrogen level. When each of the repeated measures are stored in their own variable, the data is not tidy.

  • The variable names can contain values. For example, a data set could contain a set of indicator variables to identify the income range of subjects. This would not be tidy data.

    Note, when certain analysis are done, such as regression, categorical variables are converted to sets of indicator variables. While this needed for proper analysis, this is not the best form to manage the data in.

  • Multiple data features may be stored in a single variable. An example, is city and state may be stored in a single variable.

When data is stored in multiple rows, it can contain structural missing data. This occurs when it is not possible for a value to exist for the element. An example of this is a data set that contains a measure taken each day of the month. If the data is organized with a column for each day of the month, there would be 31 columns for the measures. The months with less than 31 days would have missing data for the days that do not exist in the month. The structural missing data can be deleted when the data is represented as a single column.

Reshaping the data from wide to long or long to wide can be visualized In the following graphic. To go from long to wide, a variable is identified as containing the information what the name of the repeated variable is, here that variable is key. Each unique value of the key variable will result in a variable of that name in wide form. Another variable is identified as containing the values that will be used to construct the new variables, here that variable is val. The key variable has 3 unique values. This results in key and val variables being transformed into three new variables, A, B, and C.

Conceptual representation of reshaping a data frame

Figure 6.1: Conceptual representation of reshaping a data frame

To go from wide to long form the set of wide variables needs to be identified, here that would be, A, B, and C. Two new variables will be created from these wide variables. One of these two new variables will contain the names of the of the wide variables, here this is the key variable. The other new variable will contain the values from the wide variables, here this is the val variable. The values of the key variable tell us which wide variable the val values are from. For example, in the first data row of the long form the A in key tells us that 9 is from the A variable.

6.2.2 Examples - R

These examples use the cps1.csv data set.

  1. We begin by loading the tidyverse and import the csv file.

    library(tidyverse)
    cps1_path <- file.path("..", "datasets", "cps1.csv")
    cps1_in <- read_csv(cps1_path, col_types = cols())
    Warning: Missing column names filled in: 'X1' [1]
    cps <- 
      cps1_in %>%
      select(-X1) %>%
      rename(
        no_deg = nodeg,
        real_earn_74 = re74,
        real_earn_75 = re75,
        real_earn_78 = re78
        )
    
    glimpse(cps)
    Observations: 15,992
    Variables: 10
    $ trt          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
    $ age          <dbl> 45, 21, 38, 48, 18, 22, 48, 18, 48, 45, 34, 16, 5...
    $ educ         <dbl> 11, 14, 12, 6, 8, 11, 10, 11, 9, 12, 14, 10, 10, ...
    $ black        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
    $ hisp         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
    $ marr         <dbl> 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1...
    $ no_deg       <dbl> 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0...
    $ real_earn_74 <dbl> 21516.6700, 3175.9710, 23039.0200, 24994.3700, 16...
    $ real_earn_75 <dbl> 25243.550, 5852.565, 25130.760, 25243.550, 10727....
    $ real_earn_78 <dbl> 25564.670, 13496.080, 25564.670, 25564.670, 9860....
    cps %>%
      select(real_earn_75, real_earn_78) %>%
      head()
    # A tibble: 6 x 2
      real_earn_75 real_earn_78
             <dbl>        <dbl>
    1       25244.       25565.
    2        5853.       13496.
    3       25131.       25565.
    4       25244.       25565.
    5       10728.        9861.
    6       18449.       25565.
  2. There are three variables for earnings in different years. An observation in this data set seems to be a person and year. We will tidy this data frame.

    The data does not have an observation identifier. We will add this identifier to connect the different years from the same person.

    The gather() function is used to transform a data frame from wide to tidy (long form.) The key parameter names the variable that will identify which of the wide variables the new row is for. The value parameter names the variable that will hold the values from the variables that are being stacked into a single variable. The remaining (un-named) parameters are the variables that will be stacked into a single row.

    cps <- 
      cps %>%
      mutate(
        id = row_number()
      ) %>%
      gather(
        key = year, 
        value = real_earn,
        real_earn_74,
        real_earn_75,
        real_earn_78
        ) %>%
      separate(year, into = c("X1", "X2", "year"), sep = "_") %>%
      select(-X1, -X2) %>%
      arrange(id, year)
    
    cps %>%
      select(id, year,  age, educ, marr, real_earn) %>%
      head()
    # A tibble: 6 x 6
         id year    age  educ  marr real_earn
      <int> <chr> <dbl> <dbl> <dbl>     <dbl>
    1     1 74       45    11     1    21517.
    2     1 75       45    11     1    25244.
    3     1 78       45    11     1    25565.
    4     2 74       21    14     0     3176.
    5     2 75       21    14     0     5853.
    6     2 78       21    14     0    13496.

    We also remove the real earning part of the year values. This results in each year value being the year identifier.

  3. There are two identifiers for ethnicity (black and hisp.) We will check to see if these are ethnicity are unique (no observations with both black and hisp set to 1.)

    We will do this by gathing on the ethnicity variables. This creates a single variable containing all the indicators for ethnicity. We then group by observations (id and year.) These two transformations provide a vector of ethnicity indicators to operate on.

    Summing the ethnicity indicator (within the groups) provides the number of ethnicities identified with. The common test is the presence of an ethnicity identifier (a value of 1 or TRUE.) The test of the sum is for any indication of an ethnicity (greater than 1.) Filtering can be done when the there is more than one ethnicity (a sum greater than 1.)

    cps %>%
      gather(
        key = minority_name, 
        value = minority_indc,
        black,
        hisp
        ) %>%
      group_by(id, year) %>%
      filter(sum(minority_indc) > 1) %>%
      ungroup()
    # A tibble: 0 x 10
    # ... with 10 variables: trt <dbl>, age <dbl>, educ <dbl>, marr <dbl>,
    #   no_deg <dbl>, id <int>, year <chr>, real_earn <dbl>,
    #   minority_name <chr>, minority_indc <dbl>

    There are no observations with more than one ethnicity.

  4. We will create an identifier to indicate if the person identifies with any ethnicity.

    We begin by gathing on the ethnicity variables and grouping by observations (id and year) as in the prior example. Instead of filtering on the sum of ethnicties, here a new variable is created based the test of any identification with any ethnicity.

    The data frame is then speard() on the same two variables created by the gather. This return the data frame to its prior form (an observation for each id and year.)

    Here the gather() and spread() functions were used temporarily create a variable from row. This is another technic for working on related rows. The map() functions from section ???????????? is another method for working with related rows.

    cps <-
      cps %>%
      gather(
        key = minority_name, 
        value = minority_indc,
        black,
        hisp
        ) %>%
      group_by(id, year) %>%
      mutate(
        minority = sum(minority_indc) > 0
        ) %>%
      ungroup() %>%
      spread(
        key = minority_name, 
        value = minority_indc
        ) %>%
      arrange(id, year)
    
    cps %>%
      select(id, year,  age, educ, minority, black, hisp) %>%
      head()
    # A tibble: 6 x 7
         id year    age  educ minority black  hisp
      <int> <chr> <dbl> <dbl> <lgl>    <dbl> <dbl>
    1     1 74       45    11 FALSE        0     0
    2     1 75       45    11 FALSE        0     0
    3     1 78       45    11 FALSE        0     0
    4     2 74       21    14 FALSE        0     0
    5     2 75       21    14 FALSE        0     0
    6     2 78       21    14 FALSE        0     0
  5. We will create a single ethnicity variable that will contain a level identifier for the person's identified ethnicity.

    We begin by creating an idicator variable for the ethnicity of white non-hispanic.

    As in the prior examples, gather() is used to create two ethnicity variables. Here we want the resulting value to be data from the variable formed from key, ethnicity name. Instead of using spread() to return to tidy form, filter is used to exclude the ethnicities that are not relevent. The result is one row for each observations

    cps <-
      cps %>%
      mutate(
        white_non_hisp = !minority
      ) %>%
      gather(
        key = ethnicity, 
        value = ethnicity_id,
        black,
        hisp,
        white_non_hisp
        ) %>%
      filter(ethnicity_id == 1) %>%
      select(-ethnicity_id) %>%
      arrange(id, year)
    
    cps %>%
      select(id, year,  age, educ, minority, ethnicity) %>%
      head()
    # A tibble: 6 x 6
         id year    age  educ minority ethnicity     
      <int> <chr> <dbl> <dbl> <lgl>    <chr>         
    1     1 74       45    11 FALSE    white_non_hisp
    2     1 75       45    11 FALSE    white_non_hisp
    3     1 78       45    11 FALSE    white_non_hisp
    4     2 74       21    14 FALSE    white_non_hisp
    5     2 75       21    14 FALSE    white_non_hisp
    6     2 78       21    14 FALSE    white_non_hisp

    Sets of related indicator variables can be difficult to wrangle. This are typically converted to an factor variable when the indicators are mutually exclusive and this would be the prefered format for the data.

    Can you see a problem with the code from this example?

    It ignores the possibility of missing values for ethnicity and white_non_hisp is assumed.

6.2.3 Examples - Python

These examples use the cps1.csv data set.

  1. We begin by loading the packages, importing the csv file, and naming the variables.

    import os
    from pathlib import Path
    import pandas as pd
    import numpy as np
    cps1_path = Path('..') / 'datasets' / 'cps1.csv'
    cps1_in = pd.read_csv(cps1_path)
    cps1_in = (
        cps1_in
            .rename(columns={
                'nodeg': 'no_deg',
                're74': 'real_earn_74',
                're75': 'real_earn_75',
                're78': 'real_earn_78'}))
    
    cps =  (
        cps1_in
            .copy(deep=True)
            .drop(columns='Unnamed: 0'))
    
    print(cps.dtypes)
    trt               int64
    age               int64
    educ              int64
    black             int64
    hisp              int64
    marr              int64
    no_deg            int64
    real_earn_74    float64
    real_earn_75    float64
    real_earn_78    float64
    dtype: object
    (cps
       .drop(['real_earn_75', 'real_earn_78'], axis='columns')
       .head()
       .pipe(print))
       trt  age  educ  black  hisp  marr  no_deg  real_earn_74
    0    0   45    11      0     0     1       1     21516.670
    1    0   21    14      0     0     0       0      3175.971
    2    0   38    12      0     0     1       0     23039.020
    3    0   48     6      0     0     1       1     24994.370
    4    0   18     8      0     0     1       1      1669.295

    The age, marr, educ, and no_deg variables may have changed over the three years of this study. We have no information on when these variables were recored.

  2. There are three variables for earnings in different years. An observation in this data set seems to be a person and year. We will tidy this data frame.

    The data does not have an observation identifier. We will add this identifier to connect the different years for the same person.

    The melt() method is used to transform a data frame from wide to tidy (long form.) The var_name parameter names the variable that will identify which of the wide variables the new row is for. The value_name parameter names the variable that will hold the values from the variables that are being stacked into a single variable. The value_vars parameter is the variables that will be stacked into a single row. The id_vars is the variables that uniquely identify the observation. Any variable that is not listed in either the values_var or id_vars is dropped from the new data frame.

    cps =  cps.assign(id = lambda df: df.index)
    gather_vars = ['real_earn_74', 'real_earn_75', 'real_earn_78']
    non_gather_vars = list(set(list(cps.columns)) - set(gather_vars))
    cps =  (
        cps
            .melt(
                id_vars=non_gather_vars,
                value_vars=gather_vars,
                var_name='year',
                value_name='real_earn')
            .assign(year=lambda df: 
                df['year'].str.extract('_(\d+)', expand=True))
            .sort_values(['id', 'year'], ascending=[False, False]))
    
    (cps
        .loc[:, ['id', 'year',  'age', 'educ', 'marr', 'real_earn']]
        .head()
        .pipe(print))
              id year  age  educ  marr  real_earn
    47975  15991   78   40    10     0   7979.724
    31983  15991   75   40    10     0  13144.550
    15991  15991   74   40    10     0  13628.660
    47974  15990   78   47     9     1  13671.930
    31982  15990   75   47     9     1  11384.660

    We also remove the real earning part of the year values. This results in each year value being the year identifier.

  3. There are two identifiers for ethnicity (black and hisp.) We will check to see if these are ethnicity are unique (no observations with both black and hisp set to 1.)

    We will do this by melting on the ethnicity variables. This creates a single variable containing all the indicators for ethnicity. We then group by observations (id and year.) These two transformations provide a vector of ethnicity indicators to operate on.

    Summing the ethnicity indicator (within the groups) provides the number of ethnicities identified with. The common test is the presence of an ethnicity identifier (a value of 1 or TRUE.) The test of the sum is for any indication of an ethnicity (greater than 1.) Filtering can be done when the there is more than one ethnicity (a sum greater than 1.) The filter() method is similar to the query() method. It filters at the group level and not the observation level.

    gather_vars = ['black', 'hisp']
    non_gather_vars = list(set(list(cps.columns)) - set(gather_vars))
    (cps
        .melt(
            id_vars=non_gather_vars,
            value_vars=gather_vars,
            var_name='minority_name',
            value_name='minority_indc')
        .groupby(['id', 'year'])
        .filter(lambda x: x['minority_indc'].sum() > 1)
        .loc[:, ['id', 'year',  'age', 'educ',
                 'marr', 'minority_name', 'real_earn']]
        .head()
        .pipe(print))
    Empty DataFrame
    Columns: [id, year, age, educ, marr, minority_name, real_earn]
    Index: []

    There are no observations with more than one ethnicity.

  4. We will create an identifier to indicate if the person identifies with any ethnicity.

    We begin by melting on the ethnicity variables and grouping by observations (id and year) as in the prior example. Instead of filtering on the sum of ethnicties, here a new variable is created based the test of any identification with any ethnicity.

    The data frame is then pivoted on the same two variables created by the gather. This return the data frame to its prior form (an observation for each id and year.)

    Here the melt() and pivot_table() functions were used temporarily create a variable from row. This is another technique for working on related rows. The related rows and relationships between columns sections show other method for working with related rows.

    gather_vars = ['black', 'hisp']
    non_gather_vars = list(set(list(cps.columns)) - set(gather_vars))
    cps = (
        cps
            .melt(
                id_vars=non_gather_vars,
                value_vars=gather_vars,
                var_name='minority_name',
                value_name='minority_indc')
            .assign(minority = lambda df: df
                .groupby(['id', 'year'])
                ['minority_indc']
                .transform('sum')
                > 0)
            .sort_values(['id', 'year'])
            .pivot_table(
                index=['id', 'year', 'trt', 'age', 'educ', 'marr',
                       'no_deg', 'real_earn', 'minority'],
                columns='minority_name',
                values='minority_indc',
                aggfunc='first')
            .reset_index()
            .rename_axis(None, axis='index')
            .sort_values(['id', 'year']))
    
    (cps
        .head()
        .pipe(print))
    minority_name  id year  trt  age  ...  real_earn  minority  black  hisp
    0               0   74    0   45  ...  21516.670     False      0     0
    1               0   75    0   45  ...  25243.550     False      0     0
    2               0   78    0   45  ...  25564.670     False      0     0
    3               1   74    0   21  ...   3175.971     False      0     0
    4               1   75    0   21  ...   5852.565     False      0     0
    
    [5 rows x 11 columns]
  5. We will create a single ethnicity variable that will contain a level identifier for the person's identified ethnicity.

    We begin by creating an idicator variable for the ethnicity of white non-hispanic.

    As in the prior examples, melt() is used to create two ethnicity variables. Here we want the resulting value to be data from the variable formed from ethnicity name. Instead of using pivot() to return to tidy form, query() is used to exclude the ethnicities that are not relevent. The result is one row for each observations

    gather_vars = ['black', 'hisp', 'white_non_hisp']
    non_gather_vars = list(set(list(cps.columns)) - set(gather_vars))
    cps = (
        cps
            .assign(white_non_hisp = lambda df: ~df['minority'])
            .melt(
                id_vars=non_gather_vars,
                value_vars=gather_vars,
                var_name='ethnicity',
                value_name='ethnicity_id')
            .query('ethnicity_id == 1')
            .drop(['ethnicity_id', 'minority'], axis='columns')
            .sort_values(['id', 'year'], ascending=[True, True]))
    
    (cps
        .loc[:, ['id', 'year', 'trt',  'age', 'educ',
                 'marr', 'real_earn', 'ethnicity']]
        .head()
        .pipe(print))
           id year  trt  age  educ  marr  real_earn       ethnicity
    95952   0   74    0   45    11     1  21516.670  white_non_hisp
    95953   0   75    0   45    11     1  25243.550  white_non_hisp
    95954   0   78    0   45    11     1  25564.670  white_non_hisp
    95955   1   74    0   21    14     0   3175.971  white_non_hisp
    95956   1   75    0   21    14     0   5852.565  white_non_hisp

    Note the age variable looks incorrect in the data in this format. This is due to the age only being recorded once for each subject. This is a limitation of the data and not a problem with the data wrangling.

    Sets of related indicator variables can be difficult to wrangle. This are typically converted to an factor variable when the indicators are mutually exclusive and this would be the prefered format for the data.

    Can you see a problem with the code from this example?

    It ignores the possibility of missing values for ethnicity and white_non_hisp is assumed.

6.2.4 Exercises

These examples use the Car.csv data set.

  1. Load the Car.csv data set.

  2. Create a new data frame that uses only the variables for vehicle chosen, demographics, and body type options. Tidy this new data frame.

  3. Using the data set with all the variables on the vehicle options, Tidy the data set.

    To gather multiple variables from multiple sets of columns, one gathers on all columns from all the sets. Then modify the variable created as the key to contain the names of the new variables. Finish by spreading using the column that contains the variable names. This is a common set of steps used when shapping a data frame. That is make the data frame as tall enough to contain everything that will be spread. Then spread to the desired columns.