SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

4.5 Dropping unneeded observations

These exercises use the PSID.csv data set that was imported in the prior section.

  1. Import the PSID.csv data set.

    from pathlib import Path
    import pandas as pd
    psid_path = Path('..') / 'datasets' / 'PSID.csv'
    psid_in = pd.read_csv(psid_path)
    psid_in = (
        psid_in
            .rename( columns={
                'Unnamed: 0': 'obs_num',
                'intnum': 'intvw_num', 
                'persnum': 'person_id',
                'married': 'marital_status'}))
    psid = psid_in.copy(deep=True)
    psid = psid.drop(columns='obs_num')
    
    print(psid.dtypes)
    intvw_num           int64
    person_id           int64
    age                 int64
    educatn           float64
    earnings            int64
    hours               int64
    kids                int64
    marital_status     object
    dtype: object
  2. Display some of the observations where there are more than 90 kids in the household. Chose several of the pertinent variables to display.

    (psid
        .query(' kids > 90')
        .loc[:, ['person_id', 'age', 'educatn', 'kids', 'marital_status']]
        .head(n=15)
        .pipe(print))
          person_id  age  educatn  kids marital_status
    10            3   48     13.0    98       divorced
    150         186   41     12.0    98        married
    323         178   49     12.0    98        married
    357           5   34     99.0    99   no histories
    447           3   34     12.0    98       divorced
    544           2   47     12.0    98       divorced
    590         182   49     12.0    99   no histories
    739           3   48      3.0    98  never married
    749          21   49      0.0    99   no histories
    857         177   40      0.0    98        married
    1027          3   45     12.0    98        married
    1076          2   50      0.0    99   no histories
    1167        171   49      0.0    98       divorced
    1174        173   40      9.0    98       divorced
    1187        175   37      0.0    98       divorced
  3. Create a copy of the data frame that removes the observations where married was no history or NA/DF. You may have combined these categories into a missing category in the preparatory exercises.

    psid_copy = (
        psid.query(
            'marital_status == "no history" | marital_status == "NA/DF"'))
    
    (psid_copy        
        .loc[:, ['person_id', 'age', 'educatn', 'kids', 'marital_status']]
        .head(n=15)
        .pipe(print))
          person_id  age  educatn  kids marital_status
    1665          3   45     17.0     0          NA/DF
    1843          3   38     12.0     1          NA/DF
    2240        174   36     17.0     0          NA/DF
    2244        177   32     14.0     1          NA/DF
    2840          4   46     14.0     0          NA/DF
    2971          9   31     14.0     2          NA/DF
    3563          2   46     12.0     0          NA/DF
    3643          4   30     11.0     2          NA/DF
    3818        174   41     99.0     0          NA/DF