4.8 Coding missing values - part 2
These exercises use the PSID.csv
data set
that was imported in the prior section.
Import the
PSID.csv
data set.from pathlib import Path import pandas as pd import numpy as np
psid_path = Path('..') / 'datasets' / 'PSID.csv' psid_in = pd.read_csv(psid_path) psid_in = ( psid_in .rename( columns={ 'Unnamed: 0': 'obs_num', 'intnum': 'intvw_num', 'persnum': 'person_id', 'married': 'marital_status'})) psid = psid_in.copy(deep=True) psid = psid.drop(columns='obs_num') print(psid.dtypes)
intvw_num int64 person_id int64 age int64 educatn float64 earnings int64 hours int64 kids int64 marital_status object dtype: object
Code
NA
for theNA/DF
andno histories
values for the marriage status variable.This first solution uses repeated applications of
mask()
to change the multiple missing indicators tonp.NaN
psid = ( psid .assign( marital_status=(lambda df: df .marital_status .mask(df['marital_status'] == 'NA/DF', np.NaN) .mask(df['marital_status'] == 'no histories', np.NaN)))) (psid .query('marital_status != marital_status') .loc[:, ['intvw_num', 'person_id', 'age', 'educatn', 'kids', 'marital_status']] .sort_values(by=['person_id', 'age']) .head(n=15) .pipe(print))
intvw_num person_id age educatn kids marital_status 1831 2704 2 39 0.0 99 NaN 2797 5806 2 45 0.0 99 NaN 3563 6583 2 46 12.0 0 NaN 4467 8444 2 48 8.0 99 NaN 1076 1709 2 50 0.0 99 NaN 1747 2614 3 37 0.0 99 NaN 1843 2714 3 38 12.0 1 NaN 1665 2508 3 45 17.0 0 NaN 3643 6655 4 30 11.0 2 NaN 2840 5834 4 46 14.0 0 NaN 3405 6408 4 46 12.0 99 NaN 357 633 5 34 99.0 99 NaN 2971 5942 9 31 14.0 2 NaN 749 1184 21 49 0.0 99 NaN 1215 1906 170 41 17.0 99 NaN
Or, this soultion which uses the
.isin()
method to identify the observations that need to be set tonp.NaN
.psid = ( psid .assign( marital_status=(lambda df: df .marital_status .mask(df['marital_status'].isin(['NA/DF']), np.NaN))))
Or, using
replace()
. Thereplace()
method is a good option when the conditions being tested are equalities.psid = psid.replace( {'marital_status': {'NA/DF': np.NaN, 'no histories': np.NaN}})
Note, this last approach can not be chained without the
inplace
parameter being set to false.Change the units on the
earnings
andhours
variables to be thousands of dollars or hours. Use a method that operates on multiple columns.Hint, to do the unit change on a variable
x
, one would dox / 1000
.psid = ( psid .apply( func=lambda x: x / 1000 if x.name in ['earnings', 'hours'] else x)) (psid .loc[:, ['intvw_num', 'person_id', 'age', 'educatn', 'kids', 'marital_status']] .head(n=15) .pipe(print))
intvw_num person_id age educatn kids marital_status 0 4 4 39 12.0 2 married 1 4 6 35 12.0 2 divorced 2 4 7 33 12.0 1 married 3 4 173 39 10.0 2 married 4 5 2 47 9.0 5 married 5 6 4 44 12.0 2 married 6 6 172 38 16.0 3 married 7 7 4 38 9.0 4 divorced 8 7 170 39 12.0 3 married 9 7 171 37 11.0 5 married 10 10 3 48 13.0 98 divorced 11 10 171 47 12.0 3 married 12 10 178 40 12.0 0 separated 13 11 171 38 16.0 0 married 14 13 3 41 12.0 2 married