5.2 Character variables
These exercises use the mtcars.csv
data set.
Import the
mtcars.csv
data set.from pathlib import Path import pandas as pd import numpy as np
mtcars_path = Path('..') / 'datasets' / 'mtcars.csv' mtcars_in = pd.read_csv(mtcars_path) mtcars_in = mtcars_in.rename(columns={'Unnamed: 0': 'make_model'}) mtcars = mtcars_in.copy(deep=True) print(mtcars.dtypes)
make_model object mpg float64 cyl int64 disp float64 hp int64 drat float64 wt float64 qsec float64 vs int64 am int64 gear int64 carb int64 dtype: object
Divide the column that has the car name into columns that contain the make and model of the car.
mtcars = ( mtcars .assign( make=lambda df: df['make_model'].str.extract('(\\S+)', expand=True), model=lambda df: df['make_model'].str.extract('\\s+(.+)', expand=True))) (mtcars .loc[:, ['make_model', 'make', 'model']] .head(10) .pipe(print))
make_model make model 0 Mazda RX4 Mazda RX4 1 Mazda RX4 Wag Mazda RX4 Wag 2 Datsun 710 Datsun 710 3 Hornet 4 Drive Hornet 4 Drive 4 Hornet Sportabout Hornet Sportabout 5 Valiant Valiant NaN 6 Duster 360 Duster 360 7 Merc 240D Merc 240D 8 Merc 230 Merc 230 9 Merc 280 Merc 280
Do all observations have a make and model value? If there are missing values, can you fix them? (Hint, use Google to help you.)
(mtcars .query('make != make | model != model') .head() .pipe(print))
make_model mpg cyl disp hp drat ... vs am gear carb make model 5 Valiant 18.1 6 225.0 105 2.76 ... 1 0 3 1 Valiant NaN [1 rows x 14 columns]
There is a missing
model
name for themake
ofValiant
Googling shows that the
Valiant
was produced byPlymouth
. We can correct theValiant
observation(s) to includePlymouth
.mtcars = mtcars_in mtcars = ( mtcars .replace({'make_model': {'Valiant': 'Plymouth Valiant'}}) .assign( make=lambda df: df['make_model'].str.extract(r'(\S+)', expand=True), model=lambda df: df['make_model'].str.extract(r'\s+(.+)', expand=True))) (mtcars .loc[:, ['make_model', 'make', 'model']] .head(10) .pipe(print))
make_model make model 0 Mazda RX4 Mazda RX4 1 Mazda RX4 Wag Mazda RX4 Wag 2 Datsun 710 Datsun 710 3 Hornet 4 Drive Hornet 4 Drive 4 Hornet Sportabout Hornet Sportabout 5 Plymouth Valiant Plymouth Valiant 6 Duster 360 Duster 360 7 Merc 240D Merc 240D 8 Merc 230 Merc 230 9 Merc 280 Merc 280
Note, raw strings are used in the
str.extract()
methods to avoid the double backslash.Some car companies have more than one make. In this data
Chrysler
,Plymouth
, andDodge
were all made byChrysler
. LikewiseCadillac
andPontiac
are made byGM
andLincoln
andFord
are both made byFord
. Create a company variable based on the data in themake
variablemtcars = ( mtcars .assign( company=lambda df: df['make']) .replace({ 'company': { 'Plymouth': 'Chrysler', 'Dodge': 'Chrysler', 'Lincoln': 'Ford', 'Cadillac': 'GM', 'Pontiac': 'GM'}})) (mtcars .loc[:, ['company', 'make', 'model']] .head(15) .pipe(print))
company make model 0 Mazda Mazda RX4 1 Mazda Mazda RX4 Wag 2 Datsun Datsun 710 3 Hornet Hornet 4 Drive 4 Hornet Hornet Sportabout 5 Chrysler Plymouth Valiant 6 Duster Duster 360 7 Merc Merc 240D 8 Merc Merc 230 9 Merc Merc 280 10 Merc Merc 280C 11 Merc Merc 450SE 12 Merc Merc 450SL 13 Merc Merc 450SLC 14 GM Cadillac Fleetwood
Putting together all the code to create the company, make, and model would provide the following.
mtcars = mtcars_in mtcars = ( mtcars .replace({ 'make_model': {'Valiant': 'Plymouth Valiant'}}) .assign( make=lambda df: df['make_model'].str.extract('(\\S+)', expand=False), model=lambda df: df['make_model'].str.extract('\\s+(.+)', expand=True), company=lambda df: df.make) .replace({ 'company': { 'Plymouth': 'Chrysler', 'Dodge': 'Chrysler', 'Lincoln': 'Ford', 'Cadillac': 'GM', 'Pontiac': 'GM'}})) (mtcars .loc[:, ['company', 'make', 'model']] .head(10) .pipe(print))
company make model 0 Mazda Mazda RX4 1 Mazda Mazda RX4 Wag 2 Datsun Datsun 710 3 Hornet Hornet 4 Drive 4 Hornet Hornet Sportabout 5 Chrysler Plymouth Valiant 6 Duster Duster 360 7 Merc Merc 240D 8 Merc Merc 230 9 Merc Merc 280
Create a name for use in displaying results that is a character string composed of
make
, a space character, if the company name is not the same as the make then the company in parentheses()
, andmodel
.mtcars = ( mtcars .assign( comp_parn=lambda df: np.where(df['company'] != df['make'], ' (' + df['company'] + ') ', ''), name=lambda df: df['make'] + df['comp_parn'] + df['model'])) (mtcars .loc[:, ['name', 'make', 'comp_parn', 'model']] .head(15) .pipe(print))
name make comp_parn model 0 MazdaRX4 Mazda RX4 1 MazdaRX4 Wag Mazda RX4 Wag 2 Datsun710 Datsun 710 3 Hornet4 Drive Hornet 4 Drive 4 HornetSportabout Hornet Sportabout 5 Plymouth (Chrysler) Valiant Plymouth (Chrysler) Valiant 6 Duster360 Duster 360 7 Merc240D Merc 240D 8 Merc230 Merc 230 9 Merc280 Merc 280 10 Merc280C Merc 280C 11 Merc450SE Merc 450SE 12 Merc450SL Merc 450SL 13 Merc450SLC Merc 450SLC 14 Cadillac (GM) Fleetwood Cadillac (GM) Fleetwood
or
mtcars = ( mtcars .assign( comp_parn=lambda df: [' (' + comp + ') ' if comp != make else '' for comp, make in zip(df['company'], df['make'])], name=lambda df: df['make'] + df['comp_parn'] + df['model'])) (mtcars .loc[:, ['name', 'make', 'comp_parn', 'model']] .head(15) .pipe(print))
name make comp_parn model 0 MazdaRX4 Mazda RX4 1 MazdaRX4 Wag Mazda RX4 Wag 2 Datsun710 Datsun 710 3 Hornet4 Drive Hornet 4 Drive 4 HornetSportabout Hornet Sportabout 5 Plymouth (Chrysler) Valiant Plymouth (Chrysler) Valiant 6 Duster360 Duster 360 7 Merc240D Merc 240D 8 Merc230 Merc 230 9 Merc280 Merc 280 10 Merc280C Merc 280C 11 Merc450SE Merc 450SE 12 Merc450SL Merc 450SL 13 Merc450SLC Merc 450SLC 14 Cadillac (GM) Fleetwood Cadillac (GM) Fleetwood