SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

3.1 Preparatory exercises

The skills in these exercises are used in the exercises at the end of the discourses of this chapter. Take a moment and complete these to confirm that you are prepared for this chapter. If these exercises are difficult, review the prior chapter.

  1. Import the MplsStops.csv data set.

    Hint: It may take a large number of rows to determine the correct type for some variables.

    The following is used at the RStudio prompt to enter Python mode.

    library(reticulate)
    repl_python()

    The remainer is Python code.

    from pathlib import Path
    import pandas as pd
    mpls_stops_path = Path('..') / 'datasets' / 'MplsStops.csv'
    mpls_stops = pd.read_csv(mpls_stops_path)
    
    print(mpls_stops.dtypes)
    Unnamed: 0          int64
    idNum              object
    date               object
    problem            object
    MDC                object
    citationIssued     object
    personSearch       object
    vehicleSearch      object
    preRace            object
    race               object
    gender             object
    lat               float64
    long              float64
    policePrecinct      int64
    neighborhood       object
    dtype: object
  2. Are there any rows that need to be ignored in the MplsStops data set?

    print(mpls_stops.head())
       Unnamed: 0      idNum  ... policePrecinct     neighborhood
    0        6823  17-000003  ...              1  Cedar Riverside
    1        6824  17-000007  ...              1    Downtown West
    2        6825  17-000073  ...              5         Whittier
    3        6826  17-000092  ...              5         Whittier
    4        6827  17-000098  ...              1    Downtown West
    
    [5 rows x 15 columns]
    print(mpls_stops.tail())
           Unnamed: 0      idNum  ... policePrecinct      neighborhood
    51915       60834  17-491442  ...              2      Marcy Holmes
    51916       60835  17-491445  ...              5          Whittier
    51917       60836  17-491462  ...              2  St. Anthony East
    51918       60837  17-491480  ...              2      Marcy Holmes
    51919       60838  17-491482  ...              5   Lowry Hill East
    
    [5 rows x 15 columns]

    The rows at the start and end of the data frame look like observations. There is no indication of non data rows.

  3. Are there any special symbols that need to be set to missing in the MplsStops data set? If so, change the special symbols to the missing indicator.

    There are several ways to look at more of the data frame. The data frame can be opened with the viewer in RStudio. The complete head() of the data frame came be viewed in a markdown file by using the to_string() method. The scroll bar at the bottom of the html dispaly is used to see all the columns. The to_string() method is often not as useful when viewing the results in the console window.

    print(mpls_stops.head().to_string())
       Unnamed: 0      idNum                 date     problem  MDC citationIssued personSearch vehicleSearch  preRace          race   gender        lat       long  policePrecinct     neighborhood
    0        6823  17-000003  2017-01-01 00:00:42  suspicious  MDC            NaN           NO            NO  Unknown       Unknown  Unknown  44.966617 -93.246458               1  Cedar Riverside
    1        6824  17-000007  2017-01-01 00:03:07  suspicious  MDC            NaN           NO            NO  Unknown       Unknown     Male  44.980450 -93.271340               1    Downtown West
    2        6825  17-000073  2017-01-01 00:23:15     traffic  MDC            NaN           NO            NO  Unknown         White   Female  44.948350 -93.275380               5         Whittier
    3        6826  17-000092  2017-01-01 00:33:48  suspicious  MDC            NaN           NO            NO  Unknown  East African     Male  44.948360 -93.281350               5         Whittier
    4        6827  17-000098  2017-01-01 00:37:58     traffic  MDC            NaN           NO            NO  Unknown         White   Female  44.979078 -93.262076               1    Downtown West

    Several column are using unknown to identify missing data. The unknown value can be added to the na_values parameter list.

    mpls_stops = pd.read_csv(mpls_stops_path, na_values=['', 'Unknown'])
    
    print(mpls_stops.dtypes)
    Unnamed: 0          int64
    idNum              object
    date               object
    problem            object
    MDC                object
    citationIssued     object
    personSearch       object
    vehicleSearch      object
    preRace            object
    race               object
    gender             object
    lat               float64
    long              float64
    policePrecinct      int64
    neighborhood       object
    dtype: object
  4. Sort the data frame by policePrecinct. Hint, This requires searching outside of the material that has been covered.

    mpls_stops = mpls_stops.sort_values(by='policePrecinct')
    
    print(mpls_stops.head())
           Unnamed: 0      idNum  ... policePrecinct     neighborhood
    0            6823  17-000003  ...              1  Cedar Riverside
    19486       27077  17-162198  ...              1    Downtown East
    46443       55167  17-436199  ...              1       North Loop
    19508       27099  17-162356  ...              1    Downtown West
    19514       27105  17-162423  ...              1    Downtown West
    
    [5 rows x 15 columns]