Supporting Statistical Analysis for Research
5.4 Factors and Indicators
These exercises use the mtcars.csv
data set.
Import the
mtcars.csv
data set.from pathlib import Path import pandas as pd import numpy as np
mtcars_path = Path('..') / 'datasets' / 'mtcars.csv' mtcars_in = pd.read_csv(mtcars_path) mtcars_in = mtcars_in.rename(columns={'Unnamed: 0': 'make_model'}) mtcars = mtcars_in.copy(deep=True) print(mtcars.dtypes)
make_model object mpg float64 cyl int64 disp float64 hp int64 drat float64 wt float64 qsec float64 vs int64 am int64 gear int64 carb int64 dtype: object
Factor the
cyl
,gear
andcarb
variables.mtcars = ( mtcars .apply( func=lambda x: x.astype('category') if x.name in ['cyl', 'gear', 'carb'] else x)) print(mtcars.dtypes)
make_model object mpg float64 cyl category disp float64 hp int64 drat float64 wt float64 qsec float64 vs int64 am int64 gear category carb category dtype: object
or
mtcars = mtcars_in.copy(deep=True) cyl_lev = pd.Series(mtcars['cyl'].unique()).sort_values() gear_lev = pd.Series(mtcars['gear'].unique()).sort_values() carb_lev = pd.Series(mtcars['carb'].unique()).sort_values() mtcars = ( mtcars .assign( cyl = lambda df: pd.Categorical(df['cyl'], categories=cyl_lev), gear = lambda df: pd.Categorical(df['gear'], categories=gear_lev), carb = lambda df: pd.Categorical(df['carb'], categories=carb_lev))) print(mtcars.dtypes)
make_model object mpg float64 cyl category disp float64 hp int64 drat float64 wt float64 qsec float64 vs int64 am int64 gear category carb category dtype: object
Create a variable that identifies the observations that are in the top 25 percent of miles per gallon. Display a few of these vehicles.
Hint, you will need to find a function to identify the percentage points of a variable.
Note, that the quantile function returns a series.
mtcars = ( mtcars .assign( efficient = lambda df: np.where( df['mpg'] >= df['mpg'].quantile([0.75]).at[0.75], True, False))) (mtcars .loc[:, ['make_model', 'mpg', 'efficient']] .head() .pipe(print))
make_model mpg efficient 0 Mazda RX4 21.0 False 1 Mazda RX4 Wag 21.0 False 2 Datsun 710 22.8 True 3 Hornet 4 Drive 21.4 False 4 Hornet Sportabout 18.7 False
or
mtcars = ( mtcars .assign( efficient = lambda df: np.where( df['mpg'] >= df['mpg'].quantile([0.75]).iloc[0], True, False))) (mtcars .loc[:, ['make_model', 'mpg', 'efficient']] .head() .pipe(print))
make_model mpg efficient 0 Mazda RX4 21.0 False 1 Mazda RX4 Wag 21.0 False 2 Datsun 710 22.8 True 3 Hornet 4 Drive 21.4 False 4 Hornet Sportabout 18.7 False
Create a variables that bins the values of
hp
using the following amounts of hp: 100, 170, 240, and 300.mtcars = ( mtcars .assign( power = lambda df: pd.cut(df['hp'], bins=[-np.inf, 100, 170, 240, 300, np.inf], labels=['gocart', 'slow', 'typical', 'fast', 'beast']))) (mtcars .loc[:, ['make_model', 'mpg', 'efficient', 'power']] .head() .pipe(print))
make_model mpg efficient power 0 Mazda RX4 21.0 False slow 1 Mazda RX4 Wag 21.0 False slow 2 Datsun 710 22.8 True gocart 3 Hornet 4 Drive 21.4 False slow 4 Hornet Sportabout 18.7 False typical