Supporting Statistical Analysis for Research
4.7 Coding missing values
These exercises use the PSID.csv
data set
that was imported in the prior section.
Import the
PSID.csv
data set.from pathlib import Path import pandas as pd import numpy as np
psid_path = Path('..') / 'datasets' / 'PSID.csv' psid_in = pd.read_csv(psid_path) psid_in = ( psid_in .rename( columns={ 'Unnamed: 0': 'obs_num', 'intnum': 'intvw_num', 'persnum': 'person_id', 'married': 'marital_status'})) psid = psid_in.copy(deep=True) psid = psid.drop(columns='obs_num') print(psid.dtypes)
intvw_num int64 person_id int64 age int64 educatn float64 earnings int64 hours int64 kids int64 marital_status object dtype: object
Code
NA
s for thekids
variable.In the prepratory exercies it was seen that there are values that are varry large for the
kids
variable, larger than 90. Change these toNA
.psid = ( psid .assign( kids=[np.NaN if x > 90 else x for x in psid['kids']]))
or
psid.loc[psid['kids'] > 90] = np.NaN
Display observations that contain missing values in the
kids
variable.(psid .query('kids != kids') .loc[:, ['intvw_num', 'person_id', 'age', 'educatn', 'kids', 'marital_status']] .sort_values(by=['person_id', 'age']) .head(n=15) .pipe(print))
intvw_num person_id age educatn kids marital_status 4679 8937.0 1.0 36.0 12.0 NaN never married 4853 9302.0 1.0 37.0 8.0 NaN divorced 4146 7660.0 1.0 50.0 8.0 NaN divorced 1831 2704.0 2.0 39.0 0.0 NaN no histories 2797 5806.0 2.0 45.0 0.0 NaN no histories 544 878.0 2.0 47.0 12.0 NaN divorced 4467 8444.0 2.0 48.0 8.0 NaN no histories 4554 8652.0 2.0 48.0 1.0 NaN married 2429 5474.0 2.0 49.0 5.0 NaN married 3963 7269.0 2.0 49.0 11.0 NaN separated 4711 9004.0 2.0 49.0 99.0 NaN married 1076 1709.0 2.0 50.0 0.0 NaN no histories 2371 5413.0 2.0 50.0 17.0 NaN married 3933 7207.0 2.0 50.0 99.0 NaN married 4687 8955.0 2.0 50.0 11.0 NaN married