2.3 More challenging csv and deliminated files
Import the
amis.csv
data set.Set RStudio to python mode
library(reticulate) repl_python()
from pathlib import Path import pandas as pd
amis_path = Path('..') / 'datasets' / 'amis.csv'
amis = pd.read_csv(amis_path)
The above
read_csv()
will cause an error. We will need to look at part of the data frame to see what we need to change to import the data set.amis = ( pd.read_csv( amis_path, sep='^', nrows=25)) print(amis.head())
speed 0 Speeds of cars (in miles per hour). 1 period 2 A numeric column indicating the time that th... 3 warning 4 A numeric column indicating whether the loca...
Are there any rows that need to be ignored in the amis data set? If so, modify your import to account for them.
print(amis.head(15))
speed 0 Speeds of cars (in miles per hour). 1 period 2 A numeric column indicating the time that th... 3 warning 4 A numeric column indicating whether the loca... 5 pair 6 A numeric column giving the pair number at w... 7 Source 8 The data were kindly made available by Mr. G... 9 ,"speed","period","warning","pair" 10 1,26,1,1,1 11 2,26,1,1,1 12 3,26,1,1,1 13 4,26,1,1,1 14 5,27,1,1,1
There are 10 rows that do not contain either data or column names in the file. We will import the csv file skipping these rows.
amis = pd.read_csv(amis_path, skiprows=10)
Are there any special symbols that need to be set to missing in the
amis
data set? If so, modify your import to account for them.print(amis.head(15))
Unnamed: 0 speed period warning pair 0 1 26 1 1 1 1 2 26 1 1 1 2 3 26 1 1 1 3 4 26 1 1 1 4 5 27 1 1 1 5 6 28 1 1 1 6 7 28 1 1 1 7 8 28 1 1 1 8 9 28 1 1 1 9 10 29 1 1 1 10 11 29 1 1 1 11 12 29 1 1 1 12 13 29 1 1 1 13 14 29 1 1 1 14 15 29 1 1 1
There does not appear to be any missing identifiers in the data set.
Import the
mifem.csv
data set.mifem_path = Path('..') / 'datasets' / 'mifem.csv' mifem = pd.read_csv(mifem_path)
Is there any meta data at the top or bottom of the
mifem
data set? You will need to determine how to view the bottom of a data set. If so, modify your import to account for them.print(mifem.head())
Unnamed: 0 outcome age yronset premi ... diabetes highbp hichol angina stroke 0 1 live 63 85 n ... n y y n n 1 6 live 55 85 n ... n y y n n 2 8 live 68 85 y ... nk y nk y n 3 10 live 64 85 n ... n y n y n 4 11 dead 67 85 n ... nk nk nk nk nk [5 rows x 11 columns]
print(mifem.tail())
Unnamed: 0 outcome age yronset ... highbp hichol angina stroke 1290 6347 live 69 93 ... y nk n y 1291 6359 live 54 93 ... y y n n 1292 6360 live 64 93 ... y y n n 1293 6361 live 36 93 ... n y n n 1294 6366 live 65 93 ... n nk n n [5 rows x 11 columns]
There does not appear to be any meta data at the top or bottom of the data frame.
Are there any special symbols that need to be set to missing in the
mifem
data set? If so, modify your import to account for them.print(mifem.head().to_string())
Unnamed: 0 outcome age yronset premi smstat diabetes highbp hichol angina stroke 0 1 live 63 85 n x n y y n n 1 6 live 55 85 n c n y y n n 2 8 live 68 85 y nk nk y nk y n 3 10 live 64 85 n x n y n y n 4 11 dead 67 85 n nk nk nk nk nk nk
There are values of
nk
in several of the variables. The data descriptions do not identify any other indicator for missing.mifem = pd.read_csv(mifem_path, na_values=['', 'nk']) print(mifem.head().to_string())
Unnamed: 0 outcome age yronset premi smstat diabetes highbp hichol angina stroke 0 1 live 63 85 n x n y y n n 1 6 live 55 85 n c n y y n n 2 8 live 68 85 y NaN NaN y NaN y n 3 10 live 64 85 n x n y n y n 4 11 dead 67 85 n NaN NaN NaN NaN NaN NaN