Supporting Statistical Analysis for Research
4.7 Coding missing values
These exercises use the PSID.csv data set
that was imported in the prior section.
Import the
PSID.csvdata set.from pathlib import Path import pandas as pd import numpy as nppsid_path = Path('..') / 'datasets' / 'PSID.csv' psid_in = pd.read_csv(psid_path) psid_in = ( psid_in .rename( columns={ 'Unnamed: 0': 'obs_num', 'intnum': 'intvw_num', 'persnum': 'person_id', 'married': 'marital_status'})) psid = psid_in.copy(deep=True) psid = psid.drop(columns='obs_num') print(psid.dtypes)intvw_num int64 person_id int64 age int64 educatn float64 earnings int64 hours int64 kids int64 marital_status object dtype: objectCode
NAs for thekidsvariable.In the prepratory exercies it was seen that there are values that are varry large for the
kidsvariable, larger than 90. Change these toNA.psid = ( psid .assign( kids=[np.NaN if x > 90 else x for x in psid['kids']]))or
psid.loc[psid['kids'] > 90] = np.NaNDisplay observations that contain missing values in the
kidsvariable.(psid .query('kids != kids') .loc[:, ['intvw_num', 'person_id', 'age', 'educatn', 'kids', 'marital_status']] .sort_values(by=['person_id', 'age']) .head(n=15) .pipe(print))intvw_num person_id age educatn kids marital_status 4679 8937.0 1.0 36.0 12.0 NaN never married 4853 9302.0 1.0 37.0 8.0 NaN divorced 4146 7660.0 1.0 50.0 8.0 NaN divorced 1831 2704.0 2.0 39.0 0.0 NaN no histories 2797 5806.0 2.0 45.0 0.0 NaN no histories 544 878.0 2.0 47.0 12.0 NaN divorced 4467 8444.0 2.0 48.0 8.0 NaN no histories 4554 8652.0 2.0 48.0 1.0 NaN married 2429 5474.0 2.0 49.0 5.0 NaN married 3963 7269.0 2.0 49.0 11.0 NaN separated 4711 9004.0 2.0 49.0 99.0 NaN married 1076 1709.0 2.0 50.0 0.0 NaN no histories 2371 5413.0 2.0 50.0 17.0 NaN married 3933 7207.0 2.0 50.0 99.0 NaN married 4687 8955.0 2.0 50.0 11.0 NaN married