4.9 Duplicate observations
4.9.1 Data skills
Duplicate observations occur when two or more rows have the same values or nearly the same values. Duplicate observation may be alright and cause no problem for further analysis. For example, the data set may be from a repeated measure experiment and a subject may have the same measure taken more than once. Other times duplicates can be an error in the data and could possibly influence later analyses of the data. Due to this, it is good practice to examine a data set for duplicate, and near duplicate, observations.
Nearly duplicate rows are the most difficult to address. One method of identifying nearly duplicate observations is to search for duplicates on a subset of the columns. This allows columns that are not exactly the same to be identified.
One of two actions is typically taken when there are duplicates. The first is to drop all but one of the observations. The other is leave the observation in as a repeated measure. Determining which of these actions to apply is beyond the scope of this book.
The search for nearly duplicate observations often uncovers inconsistencies in the data. Correcting these inconsistencies is needed when the observation is not being removed. The tools to correct these nearly identical observations will be covered in the remaining chapters.
4.9.2 Examples - R
These examples use the airAccs.csv
data set.
We begin by using the same code as in the prior sections of this chapter to load the packages, import the csv file, and rename the variables.
library(tidyverse) library(lubridate)
airAccs_path <- file.path("..", "datasets", "airAccs.csv") air_accidents_in <- read_csv(airAccs_path, col_types = cols())
Warning: Missing column names filled in: 'X1' [1]
air_accidents_in <- rename( air_accidents_in, obs_num = 1, date = Date, plane_type = planeType, dead = Dead, aboard = Aboard, ground = Ground ) air_accidents <- air_accidents_in glimpse(air_accidents)
Observations: 5,666 Variables: 8 $ obs_num <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ... $ date <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09-09, 19... $ location <chr> "Fort Myer, Virginia", "Atlantic City, New Jersey",... $ operator <chr> "Military - U.S. Army", "Military - U.S. Navy", "Pr... $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane"... $ dead <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23... $ aboard <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23... $ ground <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Find duplicates.
The tidyverse function
distinct()
will remove duplicates. This is typically not done until some investigation of the duplicates is done. There currently is no method within the tidyverse to do this. We will instead use theduplicated()
function from base R. This function identifies all rows that are a repeat of prior rows. As such, the first row of the duplicate will not be included. To see all rows that are duplicates we useduplicated()
twice, the second time with the order reversed. This will find what would have been the first duplicate. The result of the two searches are combined by the logical or|
operator. Thus the final set of duplicates will identify any row that is a duplicate of another row.Only the
date
column is used for this search of duplicate rows. This will allow us to look for near duplicates for any date that more than one air accident occurred on.dup_dates <- duplicated(pull(air_accidents,date)) | duplicated(pull(air_accidents,date), fromLast=TRUE) dup_dates_accidents <- air_accidents[dup_dates, c('date', 'location', 'operator')] head(dup_dates_accidents, n = 10)
# A tibble: 10 x 3 date location operator <date> <chr> <chr> 1 1920-08-16 College Park, Maryland US Aerial Mail Service 2 1920-08-16 Bedford, England By Air 3 1927-04-22 Floh, Germany Deutsche Lufthansa 4 1927-04-22 Goshen, Indiana US Aerial Mail Service 5 1927-09-17 Hadley, New Jersey Reynolds Airways 6 1927-09-17 Near Dunellen, New Jersey Reynolds Airways 7 1927-09-23 Schleiz, Germany Deutsche Lufthansa 8 1927-09-23 Near schleiz, Thuringia,, Germany Deutsche Lufthansa 9 1927-11-16 Strasburg, France CIDNA 10 1927-11-16 Over the Gulf of Finland Aero O-Y
Note, only the first 10 rows of three of the variables are displayed.
The first duplicates are on August 16th 1920. These accidents occurred in different countries. These do not appear to be duplicates. No action is needed on these two observations. The same is true of the next duplicates. The third duplicates occur on September 17th 1927. Here they both occurred in New Jersey and have the same operator. A check of a map on google show that the two New Jersey towns are right next to each other. This is likely a duplicate. Further investigation would be needed to know for sure. The rest of the duplicates would similarly be examined. A decision to retain or delete would be made for each duplicated row.
Note, duplicates should not be removed from the original .csv file. Code should be written to identify the duplicated rows and to remove the identified duplicates.
4.9.3 Examples - Python
These examples use the airAccs.csv data set.
We begin by using the same code as in the prior sections of this chapter to load the packages, import the csv file, and rename the variables.
from pathlib import Path import pandas as pd
airAccs_path = Path('..') / 'datasets' / 'airAccs.csv' air_accidents_in = pd.read_csv(airAccs_path) air_accidents_in = ( air_accidents_in .rename( columns={ air_accidents_in.columns[0]: 'obs_num', 'Date': 'date', 'planeType': 'plane_type', 'Dead': 'dead', 'Aboard': 'aboard', 'Ground': 'ground'})) air_accidents = air_accidents_in.copy(deep=True) print(air_accidents.dtypes)
obs_num int64 date object location object operator object plane_type object dead float64 aboard float64 ground float64 dtype: object
Find duplicates.
The pandas
duplicated()
method will be used to identify the the duplicate observations. Thesubset
parameter is used to search on only thedate
column. This will allow us to look for nearly duplicates for any date that more than one air accident occurred on. Thekeep
parameter set toFalse
is used to include all the duplicate row that were found. This parameter can also acceptfirst
to identify all but the first occurrence of duplicates. This is useful when you want to remove all but one of the duplicated rows.dup_dates = air_accidents.duplicated(subset=['date'], keep=False) dup_dates_accidents = ( air_accidents .loc[dup_dates.values, ['date', 'location', 'operator']]) dup_dates_accidents['operator'] = dup_dates_accidents['operator'].str[:15] dup_dates_accidents['location'] = dup_dates_accidents['location'].str[:25] print(dup_dates_accidents.head(10))
date location operator 35 1920-08-16 College Park, Maryland US Aerial Mail 36 1920-08-16 Bedford, England By Air 119 1927-04-22 Floh, Germany Deutsche Luftha 120 1927-04-22 Goshen, Indiana US Aerial Mail 125 1927-09-17 Hadley, New Jersey Reynolds Airway 126 1927-09-17 Near Dunellen, New Jersey Reynolds Airway 127 1927-09-23 Schleiz, Germany Deutsche Luftha 128 1927-09-23 Near schleiz, Thuringia,, Deutsche Luftha 131 1927-11-16 Strasburg, France CIDNA 132 1927-11-16 Over the Gulf of Finland Aero O-Y
Note, only the first 10 rows of three of the variables are displayed.
The first duplicates are on August 16th 1920. These accidents occurred in different countries. These do not appear to be duplicates. No action is needed on these two observations. The same is true of the next duplicates. The third duplicates occur on September 17th 1927. Here they both occurred in New Jersey and have the same operator. A check of a map on google show that the two New Jersey towns are right next to each other. This is likely a duplicate. Further investigation would be needed to know for sure. The rest of the duplicates would similarly be examined. A decision to retain or delete would be made for each duplicated row.
Note, the duplicates should not be removed from the original .csv file. Code should be written to identify the duplicated rows and to remove the identified duplicates.
4.9.4 Exercises
These exercises use the PSID.csv
data set
that was imported in the prior section.
Import the
PSID.csv
data set.What variables define an observation in this data set?
Are there any duplicate observations?