4.9 Duplicate observations

SSCC - Social Science Computing Cooperative

Supporting Statistical Analysis for Research

These exercises use the PSID.csv data set that was imported in the prior section.

Import the PSID.csv data set.

from pathlib import Path
import pandas as pd
import numpy as np

psid_path = Path('..') / 'datasets' / 'PSID.csv'
psid_in = pd.read_csv(psid_path)
psid_in = (
    psid_in
        .rename( columns={
            'Unnamed: 0': 'obs_num',
            'intnum': 'intvw_num', 
            'persnum': 'person_id',
            'married': 'marital_status'}))
psid = psid_in.copy(deep=True)
psid = psid.drop(columns='obs_num')

print(psid.dtypes)

intvw_num           int64
person_id           int64
age                 int64
educatn           float64
earnings            int64
hours               int64
kids                int64
marital_status     object
dtype: object

What variables define an observation in this data set?

The variable that contains the interviewer number, the variable that contains the number identifying a person, and the variable that contains the age of that person.

Are there any duplicate observations?

dup_person = (
   psid.duplicated(subset=['intvw_num', 'person_id', 'age'], keep=False))
dup_person_age = (
    psid.loc[dup_person.values, ['intvw_num', 'person_id', 'age']])

(dup_person_age
    .head(n=15)
    .pipe(print))

Empty DataFrame
Columns: [intvw_num, person_id, age]
Index: []

There are no duplicates occurances.