6.2 Tidy data
6.2.1 Data concepts
A data frame is said to be tidy (Hadley's paper) if the following conditions are met.
A single column (variable) contains all measures of the same data feature of the observations.
A row contains measures of the same observation.
The following are a few common conditions when a data frame may not be tidy.
A data feature may be collected multiple times with an order. When the order is an important feature of the data it is often called panel data. The order is often time. For example it might be a start and end value from an experiment done over a fixed period of time. If the start and end value are separate variables, this would not be tidy data. When data features from each panels are in separate variables the data is said to be in long form.
A data feature may be collected multiple times with no particular order. When the order is not an important feature of the data it is often called repeated measures data. An example of this is a farmer is interested in the nitrogen level of a field. He might take a set of random samples. His real interest might be the average nitrogen level. When each of the repeated measures are stored in their own variable, the data is not tidy.
The variable names can contain values. For example, a data set could contain a set of indicator variables to identify the income range of subjects. This would not be tidy data.
Note, when certain analysis are done, such as regression, categorical variables are converted to sets of indicator variables. While this needed for proper analysis, this is not the best form to manage the data in.
Multiple data features may be stored in a single variable. An example, is city and state may be stored in a single variable.
When data is stored in multiple rows, it can contain structural missing data. This occurs when it is not possible for a value to exist for the element. An example of this is a data set that contains a measure taken each day of the month. If the data is organized with a column for each day of the month, there would be 31 columns for the measures. The months with less than 31 days would have missing data for the days that do not exist in the month. The structural missing data can be deleted when the data is represented as a single column.
Reshaping the data from wide to long or long to wide can be visualized
In the following graphic.
To go from long to wide, a variable is identified as containing the
information what the name of the repeated variable is,
here that variable is key
.
Each unique value of the key
variable will result in a variable of that
name in wide form.
Another variable is identified as containing the values that will be used
to construct the new variables,
here that variable is val
.
The key
variable has 3 unique values.
This results in key
and val
variables being transformed into three
new variables, A
, B
, and C
.
To go from wide to long form the set of wide variables needs to be
identified,
here that would be, A
, B
, and C
.
Two new variables will be created from these wide variables.
One of these two new variables will contain the names of the
of the wide variables,
here this is the key
variable.
The other new variable will contain the values from the wide
variables,
here this is the val
variable.
The values of the key
variable tell us which wide variable the
val
values are from.
For example,
in the first data row of the long form the A
in key
tells us that
9
is from the A
variable.
6.2.2 Examples - R
These examples use the cps1.csv
data set.
We begin by loading the tidyverse and import the csv file.
library(tidyverse)
cps1_path <- file.path("..", "datasets", "cps1.csv") cps1_in <- read_csv(cps1_path, col_types = cols())
Warning: Missing column names filled in: 'X1' [1]
cps <- cps1_in %>% select(-X1) %>% rename( no_deg = nodeg, real_earn_74 = re74, real_earn_75 = re75, real_earn_78 = re78 ) glimpse(cps)
Observations: 15,992 Variables: 10 $ trt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... $ age <dbl> 45, 21, 38, 48, 18, 22, 48, 18, 48, 45, 34, 16, 5... $ educ <dbl> 11, 14, 12, 6, 8, 11, 10, 11, 9, 12, 14, 10, 10, ... $ black <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... $ hisp <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... $ marr <dbl> 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1... $ no_deg <dbl> 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0... $ real_earn_74 <dbl> 21516.6700, 3175.9710, 23039.0200, 24994.3700, 16... $ real_earn_75 <dbl> 25243.550, 5852.565, 25130.760, 25243.550, 10727.... $ real_earn_78 <dbl> 25564.670, 13496.080, 25564.670, 25564.670, 9860....
cps %>% select(real_earn_75, real_earn_78) %>% head()
# A tibble: 6 x 2 real_earn_75 real_earn_78 <dbl> <dbl> 1 25244. 25565. 2 5853. 13496. 3 25131. 25565. 4 25244. 25565. 5 10728. 9861. 6 18449. 25565.
There are three variables for earnings in different years. An observation in this data set seems to be a person and year. We will tidy this data frame.
The data does not have an observation identifier. We will add this identifier to connect the different years from the same person.
The
gather()
function is used to transform a data frame from wide to tidy (long form.) The key parameter names the variable that will identify which of the wide variables the new row is for. The value parameter names the variable that will hold the values from the variables that are being stacked into a single variable. The remaining (un-named) parameters are the variables that will be stacked into a single row.cps <- cps %>% mutate( id = row_number() ) %>% gather( key = year, value = real_earn, real_earn_74, real_earn_75, real_earn_78 ) %>% separate(year, into = c("X1", "X2", "year"), sep = "_") %>% select(-X1, -X2) %>% arrange(id, year) cps %>% select(id, year, age, educ, marr, real_earn) %>% head()
# A tibble: 6 x 6 id year age educ marr real_earn <int> <chr> <dbl> <dbl> <dbl> <dbl> 1 1 74 45 11 1 21517. 2 1 75 45 11 1 25244. 3 1 78 45 11 1 25565. 4 2 74 21 14 0 3176. 5 2 75 21 14 0 5853. 6 2 78 21 14 0 13496.
We also remove the real earning part of the year values. This results in each year value being the year identifier.
There are two identifiers for ethnicity (
black
andhisp
.) We will check to see if these are ethnicity are unique (no observations with bothblack
andhisp
set to 1.)We will do this by gathing on the ethnicity variables. This creates a single variable containing all the indicators for ethnicity. We then group by observations (
id
andyear
.) These two transformations provide a vector of ethnicity indicators to operate on.Summing the ethnicity indicator (within the groups) provides the number of ethnicities identified with. The common test is the presence of an ethnicity identifier (a value of 1 or
TRUE
.) The test of the sum is for any indication of an ethnicity (greater than 1.) Filtering can be done when the there is more than one ethnicity (a sum greater than 1.)cps %>% gather( key = minority_name, value = minority_indc, black, hisp ) %>% group_by(id, year) %>% filter(sum(minority_indc) > 1) %>% ungroup()
# A tibble: 0 x 10 # ... with 10 variables: trt <dbl>, age <dbl>, educ <dbl>, marr <dbl>, # no_deg <dbl>, id <int>, year <chr>, real_earn <dbl>, # minority_name <chr>, minority_indc <dbl>
There are no observations with more than one ethnicity.
We will create an identifier to indicate if the person identifies with any ethnicity.
We begin by gathing on the ethnicity variables and grouping by observations (
id
andyear
) as in the prior example. Instead of filtering on the sum of ethnicties, here a new variable is created based the test of any identification with any ethnicity.The data frame is then
speard()
on the same two variables created by the gather. This return the data frame to its prior form (an observation for eachid
andyear
.)Here the
gather()
andspread()
functions were used temporarily create a variable from row. This is another technic for working on related rows. Themap()
functions from section ???????????? is another method for working with related rows.cps <- cps %>% gather( key = minority_name, value = minority_indc, black, hisp ) %>% group_by(id, year) %>% mutate( minority = sum(minority_indc) > 0 ) %>% ungroup() %>% spread( key = minority_name, value = minority_indc ) %>% arrange(id, year) cps %>% select(id, year, age, educ, minority, black, hisp) %>% head()
# A tibble: 6 x 7 id year age educ minority black hisp <int> <chr> <dbl> <dbl> <lgl> <dbl> <dbl> 1 1 74 45 11 FALSE 0 0 2 1 75 45 11 FALSE 0 0 3 1 78 45 11 FALSE 0 0 4 2 74 21 14 FALSE 0 0 5 2 75 21 14 FALSE 0 0 6 2 78 21 14 FALSE 0 0
We will create a single ethnicity variable that will contain a level identifier for the person's identified ethnicity.
We begin by creating an idicator variable for the ethnicity of white non-hispanic.
As in the prior examples,
gather()
is used to create two ethnicity variables. Here we want the resulting value to be data from the variable formed fromkey
, ethnicity name. Instead of usingspread()
to return to tidy form, filter is used to exclude the ethnicities that are not relevent. The result is one row for each observationscps <- cps %>% mutate( white_non_hisp = !minority ) %>% gather( key = ethnicity, value = ethnicity_id, black, hisp, white_non_hisp ) %>% filter(ethnicity_id == 1) %>% select(-ethnicity_id) %>% arrange(id, year) cps %>% select(id, year, age, educ, minority, ethnicity) %>% head()
# A tibble: 6 x 6 id year age educ minority ethnicity <int> <chr> <dbl> <dbl> <lgl> <chr> 1 1 74 45 11 FALSE white_non_hisp 2 1 75 45 11 FALSE white_non_hisp 3 1 78 45 11 FALSE white_non_hisp 4 2 74 21 14 FALSE white_non_hisp 5 2 75 21 14 FALSE white_non_hisp 6 2 78 21 14 FALSE white_non_hisp
Sets of related indicator variables can be difficult to wrangle. This are typically converted to an factor variable when the indicators are mutually exclusive and this would be the prefered format for the data.
Can you see a problem with the code from this example?
It ignores the possibility of missing values for ethnicity and
white_non_hisp
is assumed.
6.2.3 Examples - Python
These examples use the cps1.csv
data set.
We begin by loading the packages, importing the csv file, and naming the variables.
import os from pathlib import Path import pandas as pd import numpy as np
cps1_path = Path('..') / 'datasets' / 'cps1.csv' cps1_in = pd.read_csv(cps1_path) cps1_in = ( cps1_in .rename(columns={ 'nodeg': 'no_deg', 're74': 'real_earn_74', 're75': 'real_earn_75', 're78': 'real_earn_78'})) cps = ( cps1_in .copy(deep=True) .drop(columns='Unnamed: 0')) print(cps.dtypes)
trt int64 age int64 educ int64 black int64 hisp int64 marr int64 no_deg int64 real_earn_74 float64 real_earn_75 float64 real_earn_78 float64 dtype: object
(cps .drop(['real_earn_75', 'real_earn_78'], axis='columns') .head() .pipe(print))
trt age educ black hisp marr no_deg real_earn_74 0 0 45 11 0 0 1 1 21516.670 1 0 21 14 0 0 0 0 3175.971 2 0 38 12 0 0 1 0 23039.020 3 0 48 6 0 0 1 1 24994.370 4 0 18 8 0 0 1 1 1669.295
The
age
,marr
,educ
, andno_deg
variables may have changed over the three years of this study. We have no information on when these variables were recored.There are three variables for earnings in different years. An observation in this data set seems to be a person and year. We will tidy this data frame.
The data does not have an observation identifier. We will add this identifier to connect the different years for the same person.
The
melt()
method is used to transform a data frame from wide to tidy (long form.) Thevar_name
parameter names the variable that will identify which of the wide variables the new row is for. Thevalue_name
parameter names the variable that will hold the values from the variables that are being stacked into a single variable. Thevalue_vars
parameter is the variables that will be stacked into a single row. Theid_vars
is the variables that uniquely identify the observation. Any variable that is not listed in either thevalues_var
orid_vars
is dropped from the new data frame.cps = cps.assign(id = lambda df: df.index)
gather_vars = ['real_earn_74', 'real_earn_75', 'real_earn_78'] non_gather_vars = list(set(list(cps.columns)) - set(gather_vars)) cps = ( cps .melt( id_vars=non_gather_vars, value_vars=gather_vars, var_name='year', value_name='real_earn') .assign(year=lambda df: df['year'].str.extract('_(\d+)', expand=True)) .sort_values(['id', 'year'], ascending=[False, False])) (cps .loc[:, ['id', 'year', 'age', 'educ', 'marr', 'real_earn']] .head() .pipe(print))
id year age educ marr real_earn 47975 15991 78 40 10 0 7979.724 31983 15991 75 40 10 0 13144.550 15991 15991 74 40 10 0 13628.660 47974 15990 78 47 9 1 13671.930 31982 15990 75 47 9 1 11384.660
We also remove the real earning part of the year values. This results in each year value being the year identifier.
There are two identifiers for ethnicity (
black
andhisp
.) We will check to see if these are ethnicity are unique (no observations with bothblack
andhisp
set to 1.)We will do this by melting on the ethnicity variables. This creates a single variable containing all the indicators for ethnicity. We then group by observations (
id
andyear
.) These two transformations provide a vector of ethnicity indicators to operate on.Summing the ethnicity indicator (within the groups) provides the number of ethnicities identified with. The common test is the presence of an ethnicity identifier (a value of 1 or
TRUE
.) The test of the sum is for any indication of an ethnicity (greater than 1.) Filtering can be done when the there is more than one ethnicity (a sum greater than 1.) Thefilter()
method is similar to thequery()
method. It filters at the group level and not the observation level.gather_vars = ['black', 'hisp'] non_gather_vars = list(set(list(cps.columns)) - set(gather_vars)) (cps .melt( id_vars=non_gather_vars, value_vars=gather_vars, var_name='minority_name', value_name='minority_indc') .groupby(['id', 'year']) .filter(lambda x: x['minority_indc'].sum() > 1) .loc[:, ['id', 'year', 'age', 'educ', 'marr', 'minority_name', 'real_earn']] .head() .pipe(print))
Empty DataFrame Columns: [id, year, age, educ, marr, minority_name, real_earn] Index: []
There are no observations with more than one ethnicity.
We will create an identifier to indicate if the person identifies with any ethnicity.
We begin by melting on the ethnicity variables and grouping by observations (
id
andyear
) as in the prior example. Instead of filtering on the sum of ethnicties, here a new variable is created based the test of any identification with any ethnicity.The data frame is then pivoted on the same two variables created by the gather. This return the data frame to its prior form (an observation for each
id
andyear
.)Here the
melt()
andpivot_table()
functions were used temporarily create a variable from row. This is another technique for working on related rows. The related rows and relationships between columns sections show other method for working with related rows.gather_vars = ['black', 'hisp'] non_gather_vars = list(set(list(cps.columns)) - set(gather_vars)) cps = ( cps .melt( id_vars=non_gather_vars, value_vars=gather_vars, var_name='minority_name', value_name='minority_indc') .assign(minority = lambda df: df .groupby(['id', 'year']) ['minority_indc'] .transform('sum') > 0) .sort_values(['id', 'year']) .pivot_table( index=['id', 'year', 'trt', 'age', 'educ', 'marr', 'no_deg', 'real_earn', 'minority'], columns='minority_name', values='minority_indc', aggfunc='first') .reset_index() .rename_axis(None, axis='index') .sort_values(['id', 'year'])) (cps .head() .pipe(print))
minority_name id year trt age ... real_earn minority black hisp 0 0 74 0 45 ... 21516.670 False 0 0 1 0 75 0 45 ... 25243.550 False 0 0 2 0 78 0 45 ... 25564.670 False 0 0 3 1 74 0 21 ... 3175.971 False 0 0 4 1 75 0 21 ... 5852.565 False 0 0 [5 rows x 11 columns]
We will create a single ethnicity variable that will contain a level identifier for the person's identified ethnicity.
We begin by creating an idicator variable for the ethnicity of white non-hispanic.
As in the prior examples,
melt()
is used to create two ethnicity variables. Here we want the resulting value to be data from the variable formed from ethnicity name. Instead of usingpivot()
to return to tidy form,query()
is used to exclude the ethnicities that are not relevent. The result is one row for each observationsgather_vars = ['black', 'hisp', 'white_non_hisp'] non_gather_vars = list(set(list(cps.columns)) - set(gather_vars)) cps = ( cps .assign(white_non_hisp = lambda df: ~df['minority']) .melt( id_vars=non_gather_vars, value_vars=gather_vars, var_name='ethnicity', value_name='ethnicity_id') .query('ethnicity_id == 1') .drop(['ethnicity_id', 'minority'], axis='columns') .sort_values(['id', 'year'], ascending=[True, True])) (cps .loc[:, ['id', 'year', 'trt', 'age', 'educ', 'marr', 'real_earn', 'ethnicity']] .head() .pipe(print))
id year trt age educ marr real_earn ethnicity 95952 0 74 0 45 11 1 21516.670 white_non_hisp 95953 0 75 0 45 11 1 25243.550 white_non_hisp 95954 0 78 0 45 11 1 25564.670 white_non_hisp 95955 1 74 0 21 14 0 3175.971 white_non_hisp 95956 1 75 0 21 14 0 5852.565 white_non_hisp
Note the
age
variable looks incorrect in the data in this format. This is due to the age only being recorded once for each subject. This is a limitation of the data and not a problem with the data wrangling.Sets of related indicator variables can be difficult to wrangle. This are typically converted to an factor variable when the indicators are mutually exclusive and this would be the prefered format for the data.
Can you see a problem with the code from this example?
It ignores the possibility of missing values for ethnicity and
white_non_hisp
is assumed.
6.2.4 Exercises
These examples use the Car.csv
data set.
Load the
Car.csv
data set.Create a new data frame that uses only the variables for vehicle chosen, demographics, and body type options. Tidy this new data frame.
Using the data set with all the variables on the vehicle options, Tidy the data set.
To gather multiple variables from multiple sets of columns, one gathers on all columns from all the sets. Then modify the variable created as the
key
to contain the names of the new variables. Finish by spreading using the column that contains the variable names. This is a common set of steps used when shapping a data frame. That is make the data frame as tall enough to contain everything that will be spread. Then spread to the desired columns.