Supporting Statistical Analysis for Research
4.9 Duplicate observations
These exercises use the PSID.csv
data set
that was imported in the prior section.
Import the
PSID.csv
data set.from pathlib import Path import pandas as pd import numpy as np
psid_path = Path('..') / 'datasets' / 'PSID.csv' psid_in = pd.read_csv(psid_path) psid_in = ( psid_in .rename( columns={ 'Unnamed: 0': 'obs_num', 'intnum': 'intvw_num', 'persnum': 'person_id', 'married': 'marital_status'})) psid = psid_in.copy(deep=True) psid = psid.drop(columns='obs_num') print(psid.dtypes)
intvw_num int64 person_id int64 age int64 educatn float64 earnings int64 hours int64 kids int64 marital_status object dtype: object
What variables define an observation in this data set?
The variable that contains the interviewer number, the variable that contains the number identifying a person, and the variable that contains the age of that person.
Are there any duplicate observations?
dup_person = ( psid.duplicated(subset=['intvw_num', 'person_id', 'age'], keep=False)) dup_person_age = ( psid.loc[dup_person.values, ['intvw_num', 'person_id', 'age']]) (dup_person_age .head(n=15) .pipe(print))
Empty DataFrame Columns: [intvw_num, person_id, age] Index: []
There are no duplicates occurances.