3.1 Preparatory exercises
The skills in these exercises are used in the exercises at the end of the discourses of this chapter. Take a moment and complete these to confirm that you are prepared for this chapter. If these exercises are difficult, review the prior chapter.
Import the
MplsStops.csv
data set.Hint: It may take a large number of rows to determine the correct type for some variables.
The following is used at the RStudio prompt to enter Python mode.
library(reticulate) repl_python()
The remainer is Python code.
from pathlib import Path import pandas as pd
mpls_stops_path = Path('..') / 'datasets' / 'MplsStops.csv' mpls_stops = pd.read_csv(mpls_stops_path) print(mpls_stops.dtypes)
Unnamed: 0 int64 idNum object date object problem object MDC object citationIssued object personSearch object vehicleSearch object preRace object race object gender object lat float64 long float64 policePrecinct int64 neighborhood object dtype: object
Are there any rows that need to be ignored in the
MplsStops
data set?print(mpls_stops.head())
Unnamed: 0 idNum ... policePrecinct neighborhood 0 6823 17-000003 ... 1 Cedar Riverside 1 6824 17-000007 ... 1 Downtown West 2 6825 17-000073 ... 5 Whittier 3 6826 17-000092 ... 5 Whittier 4 6827 17-000098 ... 1 Downtown West [5 rows x 15 columns]
print(mpls_stops.tail())
Unnamed: 0 idNum ... policePrecinct neighborhood 51915 60834 17-491442 ... 2 Marcy Holmes 51916 60835 17-491445 ... 5 Whittier 51917 60836 17-491462 ... 2 St. Anthony East 51918 60837 17-491480 ... 2 Marcy Holmes 51919 60838 17-491482 ... 5 Lowry Hill East [5 rows x 15 columns]
The rows at the start and end of the data frame look like observations. There is no indication of non data rows.
Are there any special symbols that need to be set to missing in the
MplsStops
data set? If so, change the special symbols to the missing indicator.There are several ways to look at more of the data frame. The data frame can be opened with the viewer in RStudio. The complete
head()
of the data frame came be viewed in a markdown file by using theto_string()
method. The scroll bar at the bottom of the html dispaly is used to see all the columns. Theto_string()
method is often not as useful when viewing the results in the console window.print(mpls_stops.head().to_string())
Unnamed: 0 idNum date problem MDC citationIssued personSearch vehicleSearch preRace race gender lat long policePrecinct neighborhood 0 6823 17-000003 2017-01-01 00:00:42 suspicious MDC NaN NO NO Unknown Unknown Unknown 44.966617 -93.246458 1 Cedar Riverside 1 6824 17-000007 2017-01-01 00:03:07 suspicious MDC NaN NO NO Unknown Unknown Male 44.980450 -93.271340 1 Downtown West 2 6825 17-000073 2017-01-01 00:23:15 traffic MDC NaN NO NO Unknown White Female 44.948350 -93.275380 5 Whittier 3 6826 17-000092 2017-01-01 00:33:48 suspicious MDC NaN NO NO Unknown East African Male 44.948360 -93.281350 5 Whittier 4 6827 17-000098 2017-01-01 00:37:58 traffic MDC NaN NO NO Unknown White Female 44.979078 -93.262076 1 Downtown West
Several column are using
unknown
to identify missing data. Theunknown
value can be added to thena_values
parameter list.mpls_stops = pd.read_csv(mpls_stops_path, na_values=['', 'Unknown']) print(mpls_stops.dtypes)
Unnamed: 0 int64 idNum object date object problem object MDC object citationIssued object personSearch object vehicleSearch object preRace object race object gender object lat float64 long float64 policePrecinct int64 neighborhood object dtype: object
Sort the data frame by
policePrecinct
. Hint, This requires searching outside of the material that has been covered.mpls_stops = mpls_stops.sort_values(by='policePrecinct') print(mpls_stops.head())
Unnamed: 0 idNum ... policePrecinct neighborhood 0 6823 17-000003 ... 1 Cedar Riverside 19486 27077 17-162198 ... 1 Downtown East 46443 55167 17-436199 ... 1 North Loop 19508 27099 17-162356 ... 1 Downtown West 19514 27105 17-162423 ... 1 Downtown West [5 rows x 15 columns]