SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

4.9 Duplicate observations

These exercises use the PSID.csv data set that was imported in the prior section.

  1. Import the PSID.csv data set.

    from pathlib import Path
    import pandas as pd
    import numpy as np
    psid_path = Path('..') / 'datasets' / 'PSID.csv'
    psid_in = pd.read_csv(psid_path)
    psid_in = (
        psid_in
            .rename( columns={
                'Unnamed: 0': 'obs_num',
                'intnum': 'intvw_num', 
                'persnum': 'person_id',
                'married': 'marital_status'}))
    psid = psid_in.copy(deep=True)
    psid = psid.drop(columns='obs_num')
    
    print(psid.dtypes)
    intvw_num           int64
    person_id           int64
    age                 int64
    educatn           float64
    earnings            int64
    hours               int64
    kids                int64
    marital_status     object
    dtype: object
  2. What variables define an observation in this data set?

    The variable that contains the interviewer number, the variable that contains the number identifying a person, and the variable that contains the age of that person.

  3. Are there any duplicate observations?

    dup_person = (
       psid.duplicated(subset=['intvw_num', 'person_id', 'age'], keep=False))
    dup_person_age = (
        psid.loc[dup_person.values, ['intvw_num', 'person_id', 'age']])
    
    (dup_person_age
        .head(n=15)
        .pipe(print))
    Empty DataFrame
    Columns: [intvw_num, person_id, age]
    Index: []

    There are no duplicates occurances.