Supporting Statistical Analysis for Research
5.2 Character variables
These exercises use the mtcars.csv data set.
Import the
mtcars.csvdata set.from pathlib import Path import pandas as pd import numpy as npmtcars_path = Path('..') / 'datasets' / 'mtcars.csv' mtcars_in = pd.read_csv(mtcars_path) mtcars_in = mtcars_in.rename(columns={'Unnamed: 0': 'make_model'}) mtcars = mtcars_in.copy(deep=True) print(mtcars.dtypes)make_model object mpg float64 cyl int64 disp float64 hp int64 drat float64 wt float64 qsec float64 vs int64 am int64 gear int64 carb int64 dtype: objectDivide the column that has the car name into columns that contain the make and model of the car.
mtcars = ( mtcars .assign( make=lambda df: df['make_model'].str.extract('(\\S+)', expand=True), model=lambda df: df['make_model'].str.extract('\\s+(.+)', expand=True))) (mtcars .loc[:, ['make_model', 'make', 'model']] .head(10) .pipe(print))make_model make model 0 Mazda RX4 Mazda RX4 1 Mazda RX4 Wag Mazda RX4 Wag 2 Datsun 710 Datsun 710 3 Hornet 4 Drive Hornet 4 Drive 4 Hornet Sportabout Hornet Sportabout 5 Valiant Valiant NaN 6 Duster 360 Duster 360 7 Merc 240D Merc 240D 8 Merc 230 Merc 230 9 Merc 280 Merc 280Do all observations have a make and model value? If there are missing values, can you fix them? (Hint, use Google to help you.)
(mtcars .query('make != make | model != model') .head() .pipe(print))make_model mpg cyl disp hp drat ... vs am gear carb make model 5 Valiant 18.1 6 225.0 105 2.76 ... 1 0 3 1 Valiant NaN [1 rows x 14 columns]There is a missing
modelname for themakeofValiantGoogling shows that the
Valiantwas produced byPlymouth. We can correct theValiantobservation(s) to includePlymouth.mtcars = mtcars_in mtcars = ( mtcars .replace({'make_model': {'Valiant': 'Plymouth Valiant'}}) .assign( make=lambda df: df['make_model'].str.extract(r'(\S+)', expand=True), model=lambda df: df['make_model'].str.extract(r'\s+(.+)', expand=True))) (mtcars .loc[:, ['make_model', 'make', 'model']] .head(10) .pipe(print))make_model make model 0 Mazda RX4 Mazda RX4 1 Mazda RX4 Wag Mazda RX4 Wag 2 Datsun 710 Datsun 710 3 Hornet 4 Drive Hornet 4 Drive 4 Hornet Sportabout Hornet Sportabout 5 Plymouth Valiant Plymouth Valiant 6 Duster 360 Duster 360 7 Merc 240D Merc 240D 8 Merc 230 Merc 230 9 Merc 280 Merc 280Note, raw strings are used in the
str.extract()methods to avoid the double backslash.Some car companies have more than one make. In this data
Chrysler,Plymouth, andDodgewere all made byChrysler. LikewiseCadillacandPontiacare made byGMandLincolnandFordare both made byFord. Create a company variable based on the data in themakevariablemtcars = ( mtcars .assign( company=lambda df: df['make']) .replace({ 'company': { 'Plymouth': 'Chrysler', 'Dodge': 'Chrysler', 'Lincoln': 'Ford', 'Cadillac': 'GM', 'Pontiac': 'GM'}})) (mtcars .loc[:, ['company', 'make', 'model']] .head(15) .pipe(print))company make model 0 Mazda Mazda RX4 1 Mazda Mazda RX4 Wag 2 Datsun Datsun 710 3 Hornet Hornet 4 Drive 4 Hornet Hornet Sportabout 5 Chrysler Plymouth Valiant 6 Duster Duster 360 7 Merc Merc 240D 8 Merc Merc 230 9 Merc Merc 280 10 Merc Merc 280C 11 Merc Merc 450SE 12 Merc Merc 450SL 13 Merc Merc 450SLC 14 GM Cadillac FleetwoodPutting together all the code to create the company, make, and model would provide the following.
mtcars = mtcars_in mtcars = ( mtcars .replace({ 'make_model': {'Valiant': 'Plymouth Valiant'}}) .assign( make=lambda df: df['make_model'].str.extract('(\\S+)', expand=False), model=lambda df: df['make_model'].str.extract('\\s+(.+)', expand=True), company=lambda df: df.make) .replace({ 'company': { 'Plymouth': 'Chrysler', 'Dodge': 'Chrysler', 'Lincoln': 'Ford', 'Cadillac': 'GM', 'Pontiac': 'GM'}})) (mtcars .loc[:, ['company', 'make', 'model']] .head(10) .pipe(print))company make model 0 Mazda Mazda RX4 1 Mazda Mazda RX4 Wag 2 Datsun Datsun 710 3 Hornet Hornet 4 Drive 4 Hornet Hornet Sportabout 5 Chrysler Plymouth Valiant 6 Duster Duster 360 7 Merc Merc 240D 8 Merc Merc 230 9 Merc Merc 280Create a name for use in displaying results that is a character string composed of
make, a space character, if the company name is not the same as the make then the company in parentheses(), andmodel.mtcars = ( mtcars .assign( comp_parn=lambda df: np.where(df['company'] != df['make'], ' (' + df['company'] + ') ', ''), name=lambda df: df['make'] + df['comp_parn'] + df['model'])) (mtcars .loc[:, ['name', 'make', 'comp_parn', 'model']] .head(15) .pipe(print))name make comp_parn model 0 MazdaRX4 Mazda RX4 1 MazdaRX4 Wag Mazda RX4 Wag 2 Datsun710 Datsun 710 3 Hornet4 Drive Hornet 4 Drive 4 HornetSportabout Hornet Sportabout 5 Plymouth (Chrysler) Valiant Plymouth (Chrysler) Valiant 6 Duster360 Duster 360 7 Merc240D Merc 240D 8 Merc230 Merc 230 9 Merc280 Merc 280 10 Merc280C Merc 280C 11 Merc450SE Merc 450SE 12 Merc450SL Merc 450SL 13 Merc450SLC Merc 450SLC 14 Cadillac (GM) Fleetwood Cadillac (GM) Fleetwoodor
mtcars = ( mtcars .assign( comp_parn=lambda df: [' (' + comp + ') ' if comp != make else '' for comp, make in zip(df['company'], df['make'])], name=lambda df: df['make'] + df['comp_parn'] + df['model'])) (mtcars .loc[:, ['name', 'make', 'comp_parn', 'model']] .head(15) .pipe(print))name make comp_parn model 0 MazdaRX4 Mazda RX4 1 MazdaRX4 Wag Mazda RX4 Wag 2 Datsun710 Datsun 710 3 Hornet4 Drive Hornet 4 Drive 4 HornetSportabout Hornet Sportabout 5 Plymouth (Chrysler) Valiant Plymouth (Chrysler) Valiant 6 Duster360 Duster 360 7 Merc240D Merc 240D 8 Merc230 Merc 230 9 Merc280 Merc 280 10 Merc280C Merc 280C 11 Merc450SE Merc 450SE 12 Merc450SL Merc 450SL 13 Merc450SLC Merc 450SLC 14 Cadillac (GM) Fleetwood Cadillac (GM) Fleetwood