SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

4.9 Duplicate observations

4.9.1 Data skills

Duplicate observations occur when two or more rows have the same values or nearly the same values. Duplicate observation may be alright and cause no problem for further analysis. For example, the data set may be from a repeated measure experiment and a subject may have the same measure taken more than once. Other times duplicates can be an error in the data and could possibly influence later analyses of the data. Due to this, it is good practice to examine a data set for duplicate, and near duplicate, observations.

Nearly duplicate rows are the most difficult to address. One method of identifying nearly duplicate observations is to search for duplicates on a subset of the columns. This allows columns that are not exactly the same to be identified.

One of two actions is typically taken when there are duplicates. The first is to drop all but one of the observations. The other is leave the observation in as a repeated measure. Determining which of these actions to apply is beyond the scope of this book.

The search for nearly duplicate observations often uncovers inconsistencies in the data. Correcting these inconsistencies is needed when the observation is not being removed. The tools to correct these nearly identical observations will be covered in the remaining chapters.

4.9.2 Examples - R

These examples use the airAccs.csv data set.

  1. We begin by using the same code as in the prior sections of this chapter to load the packages, import the csv file, and rename the variables.

    library(tidyverse)
    library(lubridate)
    airAccs_path <- file.path("..", "datasets", "airAccs.csv")
    air_accidents_in <- read_csv(airAccs_path, col_types = cols())
    Warning: Missing column names filled in: 'X1' [1]
    air_accidents_in <-
      rename(
        air_accidents_in,
        obs_num = 1,
        date = Date,
        plane_type = planeType,
        dead = Dead,
        aboard = Aboard,
        ground = Ground
        )
    
    air_accidents <-  air_accidents_in
    
    glimpse(air_accidents)
    Observations: 5,666
    Variables: 8
    $ obs_num    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
    $ date       <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09-09, 19...
    $ location   <chr> "Fort Myer, Virginia", "Atlantic City, New Jersey",...
    $ operator   <chr> "Military - U.S. Army", "Military - U.S. Navy", "Pr...
    $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane"...
    $ dead       <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23...
    $ aboard     <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23...
    $ ground     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
  2. Find duplicates.

    The tidyverse function distinct() will remove duplicates. This is typically not done until some investigation of the duplicates is done. There currently is no method within the tidyverse to do this. We will instead use the duplicated() function from base R. This function identifies all rows that are a repeat of prior rows. As such, the first row of the duplicate will not be included. To see all rows that are duplicates we use duplicated() twice, the second time with the order reversed. This will find what would have been the first duplicate. The result of the two searches are combined by the logical or | operator. Thus the final set of duplicates will identify any row that is a duplicate of another row.

    Only the date column is used for this search of duplicate rows. This will allow us to look for near duplicates for any date that more than one air accident occurred on.

    dup_dates <-
      duplicated(pull(air_accidents,date)) |
        duplicated(pull(air_accidents,date), fromLast=TRUE)
    dup_dates_accidents <-
      air_accidents[dup_dates, c('date', 'location', 'operator')]
    
    head(dup_dates_accidents, n = 10)
    # A tibble: 10 x 3
       date       location                          operator              
       <date>     <chr>                             <chr>                 
     1 1920-08-16 College Park, Maryland            US Aerial Mail Service
     2 1920-08-16 Bedford, England                  By Air                
     3 1927-04-22 Floh, Germany                     Deutsche Lufthansa    
     4 1927-04-22 Goshen, Indiana                   US Aerial Mail Service
     5 1927-09-17 Hadley, New Jersey                Reynolds Airways      
     6 1927-09-17 Near Dunellen, New Jersey         Reynolds Airways      
     7 1927-09-23 Schleiz, Germany                  Deutsche Lufthansa    
     8 1927-09-23 Near schleiz, Thuringia,, Germany Deutsche Lufthansa    
     9 1927-11-16 Strasburg, France                 CIDNA                 
    10 1927-11-16 Over the Gulf of Finland          Aero O-Y              

    Note, only the first 10 rows of three of the variables are displayed.

    The first duplicates are on August 16th 1920. These accidents occurred in different countries. These do not appear to be duplicates. No action is needed on these two observations. The same is true of the next duplicates. The third duplicates occur on September 17th 1927. Here they both occurred in New Jersey and have the same operator. A check of a map on google show that the two New Jersey towns are right next to each other. This is likely a duplicate. Further investigation would be needed to know for sure. The rest of the duplicates would similarly be examined. A decision to retain or delete would be made for each duplicated row.

    Note, duplicates should not be removed from the original .csv file. Code should be written to identify the duplicated rows and to remove the identified duplicates.

4.9.3 Examples - Python

These examples use the airAccs.csv data set.

  1. We begin by using the same code as in the prior sections of this chapter to load the packages, import the csv file, and rename the variables.

    from pathlib import Path
    import pandas as pd
    airAccs_path = Path('..') / 'datasets' / 'airAccs.csv'
    air_accidents_in = pd.read_csv(airAccs_path)
    air_accidents_in = (
        air_accidents_in
            .rename(
                columns={
                    air_accidents_in.columns[0]: 'obs_num',
                    'Date': 'date', 
                    'planeType': 'plane_type',
                    'Dead': 'dead',
                    'Aboard': 'aboard',
                    'Ground': 'ground'}))
    
    air_accidents = air_accidents_in.copy(deep=True)
    
    print(air_accidents.dtypes)
    obs_num         int64
    date           object
    location       object
    operator       object
    plane_type     object
    dead          float64
    aboard        float64
    ground        float64
    dtype: object
  2. Find duplicates.

    The pandas duplicated() method will be used to identify the the duplicate observations. The subset parameter is used to search on only the date column. This will allow us to look for nearly duplicates for any date that more than one air accident occurred on. The keep parameter set to False is used to include all the duplicate row that were found. This parameter can also accept first to identify all but the first occurrence of duplicates. This is useful when you want to remove all but one of the duplicated rows.

    dup_dates = air_accidents.duplicated(subset=['date'], keep=False)
    dup_dates_accidents = (
        air_accidents
            .loc[dup_dates.values, ['date', 'location', 'operator']])
    
    dup_dates_accidents['operator'] = dup_dates_accidents['operator'].str[:15]
    dup_dates_accidents['location'] = dup_dates_accidents['location'].str[:25]
    
    print(dup_dates_accidents.head(10))
               date                   location         operator
    35   1920-08-16     College Park, Maryland  US Aerial Mail 
    36   1920-08-16           Bedford, England           By Air
    119  1927-04-22              Floh, Germany  Deutsche Luftha
    120  1927-04-22            Goshen, Indiana  US Aerial Mail 
    125  1927-09-17         Hadley, New Jersey  Reynolds Airway
    126  1927-09-17  Near Dunellen, New Jersey  Reynolds Airway
    127  1927-09-23           Schleiz, Germany  Deutsche Luftha
    128  1927-09-23  Near schleiz, Thuringia,,  Deutsche Luftha
    131  1927-11-16          Strasburg, France            CIDNA
    132  1927-11-16   Over the Gulf of Finland         Aero O-Y

    Note, only the first 10 rows of three of the variables are displayed.

    The first duplicates are on August 16th 1920. These accidents occurred in different countries. These do not appear to be duplicates. No action is needed on these two observations. The same is true of the next duplicates. The third duplicates occur on September 17th 1927. Here they both occurred in New Jersey and have the same operator. A check of a map on google show that the two New Jersey towns are right next to each other. This is likely a duplicate. Further investigation would be needed to know for sure. The rest of the duplicates would similarly be examined. A decision to retain or delete would be made for each duplicated row.

    Note, the duplicates should not be removed from the original .csv file. Code should be written to identify the duplicated rows and to remove the identified duplicates.

4.9.4 Exercises

These exercises use the PSID.csv data set that was imported in the prior section.

  1. Import the PSID.csv data set.

  2. What variables define an observation in this data set?

  3. Are there any duplicate observations?