3.1 Preparatory exercises

SSCC - Social Science Computing Cooperative

Supporting Statistical Analysis for Research

The skills in these exercises are used in the exercises at the end of the discourses of this chapter. Take a moment and complete these to confirm that you are prepared for this chapter. If these exercises are difficult, review the prior chapter.

Import the MplsStops.csv data set.

Hint: It may take a large number of rows to determine the correct type for some variables.

The following is used at the RStudio prompt to enter Python mode.

library(reticulate)
repl_python()

The remainer is Python code.

from pathlib import Path
import pandas as pd

mpls_stops_path = Path('..') / 'datasets' / 'MplsStops.csv'
mpls_stops = pd.read_csv(mpls_stops_path)

print(mpls_stops.dtypes)

Unnamed: 0          int64
idNum              object
date               object
problem            object
MDC                object
citationIssued     object
personSearch       object
vehicleSearch      object
preRace            object
race               object
gender             object
lat               float64
long              float64
policePrecinct      int64
neighborhood       object
dtype: object

Are there any rows that need to be ignored in the MplsStops data set?

print(mpls_stops.head())

   Unnamed: 0      idNum  ... policePrecinct     neighborhood
0        6823  17-000003  ...              1  Cedar Riverside
1        6824  17-000007  ...              1    Downtown West
2        6825  17-000073  ...              5         Whittier
3        6826  17-000092  ...              5         Whittier
4        6827  17-000098  ...              1    Downtown West

[5 rows x 15 columns]

print(mpls_stops.tail())

       Unnamed: 0      idNum  ... policePrecinct      neighborhood
51915       60834  17-491442  ...              2      Marcy Holmes
51916       60835  17-491445  ...              5          Whittier
51917       60836  17-491462  ...              2  St. Anthony East
51918       60837  17-491480  ...              2      Marcy Holmes
51919       60838  17-491482  ...              5   Lowry Hill East

[5 rows x 15 columns]

The rows at the start and end of the data frame look like observations. There is no indication of non data rows.

Are there any special symbols that need to be set to missing in the MplsStops data set? If so, change the special symbols to the missing indicator.

There are several ways to look at more of the data frame. The data frame can be opened with the viewer in RStudio. The complete head() of the data frame came be viewed in a markdown file by using the to_string() method. The scroll bar at the bottom of the html dispaly is used to see all the columns. The to_string() method is often not as useful when viewing the results in the console window.

print(mpls_stops.head().to_string())

   Unnamed: 0      idNum                 date     problem  MDC citationIssued personSearch vehicleSearch  preRace          race   gender        lat       long  policePrecinct     neighborhood
0        6823  17-000003  2017-01-01 00:00:42  suspicious  MDC            NaN           NO            NO  Unknown       Unknown  Unknown  44.966617 -93.246458               1  Cedar Riverside
1        6824  17-000007  2017-01-01 00:03:07  suspicious  MDC            NaN           NO            NO  Unknown       Unknown     Male  44.980450 -93.271340               1    Downtown West
2        6825  17-000073  2017-01-01 00:23:15     traffic  MDC            NaN           NO            NO  Unknown         White   Female  44.948350 -93.275380               5         Whittier
3        6826  17-000092  2017-01-01 00:33:48  suspicious  MDC            NaN           NO            NO  Unknown  East African     Male  44.948360 -93.281350               5         Whittier
4        6827  17-000098  2017-01-01 00:37:58     traffic  MDC            NaN           NO            NO  Unknown         White   Female  44.979078 -93.262076               1    Downtown West

Several column are using unknown to identify missing data. The unknown value can be added to the na_values parameter list.

mpls_stops = pd.read_csv(mpls_stops_path, na_values=['', 'Unknown'])

print(mpls_stops.dtypes)

Unnamed: 0          int64
idNum              object
date               object
problem            object
MDC                object
citationIssued     object
personSearch       object
vehicleSearch      object
preRace            object
race               object
gender             object
lat               float64
long              float64
policePrecinct      int64
neighborhood       object
dtype: object

Sort the data frame by policePrecinct. Hint, This requires searching outside of the material that has been covered.

mpls_stops = mpls_stops.sort_values(by='policePrecinct')

print(mpls_stops.head())

       Unnamed: 0      idNum  ... policePrecinct     neighborhood
0            6823  17-000003  ...              1  Cedar Riverside
19486       27077  17-162198  ...              1    Downtown East
46443       55167  17-436199  ...              1       North Loop
19508       27099  17-162356  ...              1    Downtown West
19514       27105  17-162423  ...              1    Downtown West

[5 rows x 15 columns]