SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

4.7 Coding missing values

These exercises use the PSID.csv data set that was imported in the prior section.

  1. Import the PSID.csv data set.

    from pathlib import Path
    import pandas as pd
    import numpy as np
    psid_path = Path('..') / 'datasets' / 'PSID.csv'
    psid_in = pd.read_csv(psid_path)
    psid_in = (
        psid_in
            .rename( columns={
                'Unnamed: 0': 'obs_num',
                'intnum': 'intvw_num', 
                'persnum': 'person_id',
                'married': 'marital_status'}))
    psid = psid_in.copy(deep=True)
    psid = psid.drop(columns='obs_num')
    
    print(psid.dtypes)
    intvw_num           int64
    person_id           int64
    age                 int64
    educatn           float64
    earnings            int64
    hours               int64
    kids                int64
    marital_status     object
    dtype: object
  2. Code NAs for the kids variable.

    In the prepratory exercies it was seen that there are values that are varry large for the kids variable, larger than 90. Change these to NA.

    psid = (
        psid
            .assign(
               kids=[np.NaN if x > 90 else x
                  for x in psid['kids']]))

    or

    psid.loc[psid['kids'] > 90] = np.NaN
  3. Display observations that contain missing values in the kids variable.

    (psid
        .query('kids != kids')
        .loc[:, ['intvw_num', 'person_id', 'age', 'educatn', 'kids', 'marital_status']]
        .sort_values(by=['person_id', 'age'])
        .head(n=15)
        .pipe(print))
          intvw_num  person_id   age  educatn  kids marital_status
    4679     8937.0        1.0  36.0     12.0   NaN  never married
    4853     9302.0        1.0  37.0      8.0   NaN       divorced
    4146     7660.0        1.0  50.0      8.0   NaN       divorced
    1831     2704.0        2.0  39.0      0.0   NaN   no histories
    2797     5806.0        2.0  45.0      0.0   NaN   no histories
    544       878.0        2.0  47.0     12.0   NaN       divorced
    4467     8444.0        2.0  48.0      8.0   NaN   no histories
    4554     8652.0        2.0  48.0      1.0   NaN        married
    2429     5474.0        2.0  49.0      5.0   NaN        married
    3963     7269.0        2.0  49.0     11.0   NaN      separated
    4711     9004.0        2.0  49.0     99.0   NaN        married
    1076     1709.0        2.0  50.0      0.0   NaN   no histories
    2371     5413.0        2.0  50.0     17.0   NaN        married
    3933     7207.0        2.0  50.0     99.0   NaN        married
    4687     8955.0        2.0  50.0     11.0   NaN        married