Supporting Statistical Analysis for Research
6.2 Tidy data
These examples use the Car.csv
data set.
Load the
Car.csv
data set.from pathlib import Path import os import numpy as np import pandas as pd
car_path = Path('..') / 'datasets' / 'Car.csv' car_in = pd.read_csv(car_path) car_in = ( car_in .rename(columns={ 'Unnamed: 0': 'id'})) car = car_in.copy(deep=True) print(car.dtypes)
id int64 choice object college int64 hsg2 int64 coml5 int64 type1 object type2 object type3 object type4 object type5 object type6 object fuel1 object fuel2 object fuel3 object fuel4 object fuel5 object fuel6 object price1 float64 price2 float64 price3 float64 price4 float64 price5 float64 price6 float64 range1 int64 range2 int64 range3 int64 range4 int64 range5 int64 range6 int64 acc1 float64 ... pollution1 float64 pollution2 float64 pollution3 float64 pollution4 float64 pollution5 float64 pollution6 float64 size1 int64 size2 int64 size3 int64 size4 int64 size5 int64 size6 int64 space1 float64 space2 float64 space3 float64 space4 float64 space5 int64 space6 int64 cost1 int64 cost2 int64 cost3 int64 cost4 int64 cost5 int64 cost6 int64 station1 float64 station2 float64 station3 float64 station4 float64 station5 float64 station6 float64 Length: 71, dtype: object
(car .head() .pipe(print))
id choice college hsg2 ... station3 station4 station5 station6 0 1 choice1 0 0 ... 0.3 0.3 1.0 1.0 1 2 choice2 1 1 ... 0.1 0.1 1.0 1.0 2 3 choice5 0 1 ... 0.3 0.3 1.0 1.0 3 4 choice5 0 0 ... 0.7 0.7 0.1 0.1 4 5 choice5 0 1 ... 0.7 0.7 1.0 1.0 [5 rows x 71 columns]
Create a new data frame that uses only the variables for vehicle chosen, demographics, and body type options. Tidy this new data frame.
gather_vars = ['type1', 'type2', 'type3', 'type4', 'type5', 'type6'] non_gather_vars = ['id', 'choice', 'college', 'hsg2', 'coml5'] car_type = ( car .loc[:, 'id':'type6'] .melt( id_vars=non_gather_vars, value_vars=gather_vars, var_name='type_order', value_name='type') .assign( order=lambda df: df['type_order'].str.extract('type([\\d]+)', expand=True)) .drop(columns='type_order') .sort_values(by=['id', 'order']) ) print(car_type.dtypes)
id int64 choice object college int64 hsg2 int64 coml5 int64 type object order object dtype: object
(car_type .loc[:, ['id', 'choice', 'order', 'college', 'hsg2', 'coml5', 'type']] .head(9) .pipe(print))
id choice order college hsg2 coml5 type 0 1 choice1 1 0 0 0 van 4654 1 choice1 2 0 0 0 regcar 9308 1 choice1 3 0 0 0 van 13962 1 choice1 4 0 0 0 stwagon 18616 1 choice1 5 0 0 0 van 23270 1 choice1 6 0 0 0 truck 1 2 choice2 1 1 1 1 regcar 4655 2 choice2 2 1 1 1 van 9309 2 choice2 3 1 1 1 regcar
Using the data set with all the variables on the vehicle options, Tidy the data set.
To gather multiple variables from multiple sets of columns, one gathers on all columns from all the sets. Then modify the variable created as the
key
to contain the names of the new variables. Finish by spreading using the column that contains the variable names. This is a common set of steps used when shapping a data frame. That is make the data frame as tall enough to contain everything that will be spread. Then spread to the desired columns.non_gather_vars = ['id', 'choice', 'college', 'hsg2', 'coml5'] non_pivot_vars = non_gather_vars.copy() non_pivot_vars.append('order') car_tidy = ( car .melt( id_vars=non_gather_vars, var_name='variable_order', value_name='value') .assign( order=lambda df: df['variable_order'].str.extract('([\\d]+)', expand=True), variable=lambda df: df['variable_order'].str.extract('([^\\d]+)', expand=True)) .drop(columns='variable_order') .pivot_table( index=non_pivot_vars, columns='variable', values='value', aggfunc='first') .reset_index() .rename_axis(None, axis='index') .set_index('order') .reset_index() .sort_values(by=['id', 'order']) ) print(car_tidy.shape)
(27924, 17)
(car_tidy .head(9) .pipe(print))
variable order id choice college hsg2 ... size space speed station type 0 1 1 choice1 0 0 ... 3 0.7 95 0.1 van 1 2 1 choice1 0 0 ... 3 0.7 95 0.1 regcar 2 3 1 choice1 0 0 ... 2 1 110 0.3 van 3 4 1 choice1 0 0 ... 2 1 110 0.3 stwagon 4 5 1 choice1 0 0 ... 3 1 140 1 van 5 6 1 choice1 0 0 ... 3 1 140 1 truck 6 1 2 choice2 1 1 ... 3 0.7 85 0 regcar 7 2 2 choice2 1 1 ... 3 0.7 85 0 van 8 3 2 choice2 1 1 ... 3 1 140 0.1 regcar [9 rows x 17 columns]