Supporting Statistical Analysis for Research

2.3 More challenging csv and delimited files

Not all data files are as easy to import as the Chile.csv data. For example a data file may have extra rows that do not contain data or column names and there may be some missing values. This section examines methods to address such data file issues.

2.3.1 Data concepts

2.3.1.1 Meta data in data files

Data files may have rows at the beginning, and sometimes the end, that are not data or column names. This text is often metadata, information about the data in file. While this information maybe of interest to a data analyst, it is not part of the row and column information of the data table contained in the file. As such, this text needs to be excluded from the data frame.

A csv file with meta data may look like the following made up data file.

    Wisconsin data set on something or other
Collected by the unkown agency on some date

A,B,C,D
-3,*,8,Sun Prairie
9,13,,Verona
* indicates an error occured when collecting the data

This made up file has three lines (two with text on them and one empty) at the start of the file and one line at the end of the file that describe information about the data. These are the lines that need to be excluded from the data frame.

2.3.1.2 Missing data

Real world data is often incomplete: some observations may not have a value for all columns. These values that are not provided are called missing data. Some examples of why this can happen are, someone skipping a question on a survey, an unreadable number on a lab report, and test equipment failures.

Data files can use two methods to indicate that a value is missing. One is to not provide a value in the data set. The other is to use a special string of characters or numbers. Almost any string of characters or numbers can be used to indicate a value is missing. Special indicators for missing may be documented for you either in the data file with meta data, as was done in the made up file, or in a separate file that documents the data set. Sometimes there is no documentation of the missing values and they have to be discovered by inspecting the data.

The made up data set above uses both a special indicator and not providing a value to indicate missing. The row with "Verona" has no value for the third column and the "*" character is used in column 2 of the row with "Sun Prairie".

2.3.2 Cleaning

Most data processing languages have their own special object to indicate a missing data value. (Some languages have multiple special objects to identify different types of missing data.) These missing data objects let the functions of a programming language know not to try to use the missing data in a calculation. For example, a function that calculates the mean of a column could exclude the missing values from its calculation.

The missing values indicators in a data file need to be changed to the missing value object used by the program. Some of this work is done automatically in R and Python's read functions. An example of this automatic change to the missing object is when no value is provided in the file. (Some functions will use "", the empty character, instead of the missing object when the column is being read as a character.)

When a data file uses a special indicator for missing the read function can assign the missing object in place of the special indicator, assuming the special indicator is not a valid value in any of the other columns. This is the case in the made up file above using "*", since "*" is not a valid number or name of a town in Wisconsin. An example where this is not the case is a data set that uses 3 to indicate there is no answer to a yes or no question, and there are also numeric columns in the data set that can take on the value of 3. In cases such as this where the special missing value can be a valid value in another column, it is better to correct for missing values after the file has been imported.

2.3.3 Examples - R

The attendance data set used in this example has a number of features that occur with real world data. We will use this data file to demonstrate addressing several format issues that can occur. The remaining issues of the data set will be addressed in the cleaning chapter.

2.3.3.1 Acquisition - Importing a csv file with meta data

1. We begin by reading the data set in.

attnd_path <- file.path("..", "datasets", "attendance.csv")
attnd <- read_csv(attnd_path, col_types = cols())
Warning: Missing column names filled in: 'X2' [2], 'X3' [3], 'X4' [4],
'X5' [5], 'X6' [6], 'X7' [7], 'X8' [8], 'X9' [9], 'X10' [10], 'X11' [11],
'X12' [12], 'X13' [13], 'X14' [14], 'X15' [15], 'X16' [16], 'X17' [17]

The read_csv() function produces a warning that there are missing variable names. It looks like only first column has a varible name. This is an indication that there may be text prior to the data.

2. We next look at the beginning of the data to see what is in the first few rows. The purpose of this is to identify where the data starts in the file. There may be a row that has column names in it. Note, the data file could also be opened in a text editor, like notepad, to examine the structure of the start and end of the file. This method can be easier and less confusing if there is a lot of non-data in the file. One common workflow is to start by importing the data set. If it works, great. If not and it is clear what needs to be fixed in the import, the file is then imported with the correction. If it is not clear what corrections need to be done, the file is then viewed with an editor of some kind (Notepad.)

head(attnd)
# A tibble: 6 x 17
Table 43. Aver~ X2    X3    X4    X5    X6    X7    X8    X9    X10
<chr>            <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 <NA>             Tota~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  Elem~
2 <NA>             ADA ~ <NA>  Aver~ <NA>  Aver~ <NA>  Aver~ <NA>  ADA ~
3 1                2     <NA>  3     <NA>  4     <NA>  5     <NA>  6
4 United States .~ 93.1  (0.2~ 6.6   (0.0~ 180   (0.1) 1,193 (3.1) 94.0
5 Alabama .......~ 93.8  (1.2~ 7.0   (0.0~ 180   (0.8) 1,267 (12.~ 93.8
6 Alaska ........~ 89.9  (1.2~ 6.5   (0.0~ 180   (3.4) 1,163 (22.~ 91.3
# ... with 7 more variables: X11 <chr>, X12 <chr>, X13 <chr>, X14 <chr>,
#   X15 <chr>, X16 <chr>, X17 <chr>
glimpse(attnd)
Observations: 64
Variables: 17
$Table 43. Average daily attendance (ADA) as a percentage of total enrollment, school day length, and school year length in public schools, by school level and state: 2007-08 <chr> ...$ X2                                                                                                                                                                              <chr> ...
$X3 <chr> ...$ X4                                                                                                                                                                              <chr> ...
$X5 <chr> ...$ X6                                                                                                                                                                              <chr> ...
$X7 <chr> ...$ X8                                                                                                                                                                              <chr> ...
$X9 <chr> ...$ X10                                                                                                                                                                             <chr> ...
$X11 <chr> ...$ X12                                                                                                                                                                             <chr> ...
$X13 <chr> ...$ X14                                                                                                                                                                             <chr> ...
$X15 <chr> ...$ X16                                                                                                                                                                             <chr> ...
\$ X17                                                                                                                                                                             <chr> ...

The third row of the file (second row of the data frame) is the first row that has character data in numerous columns. This might be column names. The fourth row has a numbers in the same columns. This might be data. This appears to be the next best guess at importing the data set.

3. We will import the data set using the third row as column names and exclude the first two rows from the data set. This is done with the skip parameter of the read_csv() function.

attnd <- read_csv(attnd_path, skip = 2, col_types = cols())
Warning: Missing column names filled in: 'X1' [1], 'X3' [3], 'X5' [5],
'X7' [7], 'X9' [9], 'X11' [11], 'X13' [13], 'X15' [15], 'X17' [17]
Warning: Duplicated column names deduplicated: 'ADA as percent of
enrollment' => 'ADA as percent of enrollment_1' [10], 'Average hours in
school day' => 'Average hours in school day_1' [12], 'ADA as percent of
enrollment' => 'ADA as percent of enrollment_2' [14], 'Average hours in
school day' => 'Average hours in school day_2' [16]

There are now fewer columns without names in the read_csv() warning. There is a new warning about duplicate variable names. Even though there are now two warnings the data set is closer to being clean.

4. We again look at the beginning of the data to see if anything else needs to be addressed.

head(attnd)
# A tibble: 6 x 17
X1    ADA as percent~ X3    Average hours ~ X5    Average days i~
<chr>            <dbl> <chr>            <dbl> <chr>            <dbl>
1 1                  2   <NA>               3   <NA>                 4
2 Unit~             93.1 (0.2~              6.6 (0.0~              180
3 Alab~             93.8 (1.2~              7   (0.0~              180
4 Alas~             89.9 (1.2~              6.5 (0.0~              180
5 Ariz~             89   (2.9~              6.4 (0.0~              181
6 Arka~             91.8 (1.3~              6.9 (0.0~              179
# ... with 11 more variables: X7 <chr>, Average hours in school
#   year <dbl>, X9 <chr>, ADA as percent of enrollment_1 <dbl>,
#   X11 <chr>, Average hours in school day_1 <dbl>, X13 <chr>, ADA as
#   percent of enrollment_2 <chr>, X15 <chr>, Average hours in school
#   day_2 <chr>, X17 <chr>

The first row does not appear to be data. Fixing this issue will be addressed in the cleaning chapter. The remaining displayed rows look like they have data in them. The type for all the columns is character but some of the columns appear to be numeric. This is another issue that will be addressed in the cleaning chapter.

2.3.3.2 Cleaning - Coding missing data on import of a csv file

1. There is another issue that is not seen in the head data. The following code displays a small part of the attendance data further into the data. (The code that displays this part of the data will be explained in a later chapter.)

attnd[10:13, 12:16]
# A tibble: 4 x 5
Average hours in s~ X13    ADA as percent of~ X15   Average hours in ~
<dbl> <chr>  <chr>               <chr> <chr>
1                  6.5 (0.11) 93.7                (0.6~ 6.5
2                  6.8 (0.06) ?                   (?)   6.5
3                  6.9 (0.10) ?                   (?)   ?
4                  6.5 (0.05) 89.9                (1.4~ 6.3                

From the above display we see that "?" and "(?)" used to identify missing data. These do not seem to be valid values for any other columns.

2. We will import the data set using the na parameter of the read_csv() function. The default value for na is c("", "NA"). This means that any value that has a value of the empty character, "", or "NA" will be marked with NA, the missing object in R. The following reads the data again with "?" and "(?)" added to the na parameter. Then the same cells as above are displayed to see the results from using na.

attnd <-
attnd_path,
skip = 2,
na = c("", "NA", "?", "(?)"),
col_types = cols()
)
Warning: Missing column names filled in: 'X1' [1], 'X3' [3], 'X5' [5],
'X7' [7], 'X9' [9], 'X11' [11], 'X13' [13], 'X15' [15], 'X17' [17]
Warning: Duplicated column names deduplicated: 'ADA as percent of
enrollment' => 'ADA as percent of enrollment_1' [10], 'Average hours in
school day' => 'Average hours in school day_1' [12], 'ADA as percent of
enrollment' => 'ADA as percent of enrollment_2' [14], 'Average hours in
school day' => 'Average hours in school day_2' [16]
attnd[10:13, 12:16]
# A tibble: 4 x 5
Average hours in s~ X13    ADA as percent of~ X15   Average hours in ~
<dbl> <chr>                <dbl> <chr>               <dbl>
1                  6.5 (0.11)                93.7 (0.6~                 6.5
2                  6.8 (0.06)                NA   <NA>                  6.5
3                  6.9 (0.10)                NA   <NA>                 NA
4                  6.5 (0.05)                89.9 (1.4~                 6.3

From this display we see that several columns that previously were character columns are now both numeric columns.

This is all that will be corrected using read_csv(). The remaining issues will be address in the cleaning chapter.

2.3.4 Examples - Python

python
from pathlib import Path
import pandas as pd


The attendance data set used in this example has a number of features that occur with real world data. We will use this data file to demonstrate addressing several format issues that can occur. The remaining issues in the data set will be addressed in the cleaning chapter.

2.3.4.1 Acquisition - Importing a csv file with meta data

1. We begin by reading the data set in.

attnd_path = Path('..') / 'datasets' / 'attendance.csv'
attnd = pd.read_csv(attnd_path)
2. We next look at the beginning of the data to see what is in the first few rows.

print(attnd.head())
  Table 43. Average daily attendance (ADA) as a percentage of total enrollment, school day length, and school year length in public schools, by school level and state: 2007-08  ... Unnamed: 16
0                                                NaN                                                                                                                             ...         NaN
1                                                NaN                                                                                                                             ...         NaN
2                                                  1                                                                                                                             ...         NaN
3                             United States ........                                                                                                                             ...      (0.04)
4                          Alabama .................                                                                                                                             ...      (0.17)

[5 rows x 17 columns]

The name of the first variable looks more like a description of the data set. This is an indication that there is text prior to the data, and from the first variable it looks like meta data.

The column of ellipses, three dots, in this display indicates that the displayed data has been truncated to fit the width of the display. Here the truncation is the columns between 0 and 16 have been omitted due not enough display width. This truncation can also occur when rows are truncated. The dots after the state and country names in the first row are part of the data in the attendence data file.

Note, the data file could also be opened in a text editor, like Notepad, to examine the structure of the start of the file instead of reading the file and examining it in Python. This method can be easier and less confussing if there is a lot of non-data in the file. One common workflow is to start by importing the data set. If it works, great. If not and it is clear what needs to be fixed in the import, the file is then imported with the correction. If it is not clear what corrections need to be done, the file is then viewed with an editor of some kind (Notepad.)

It is not clear if there are column names in the file from this display. It look like data might start in row 2. (Remember that Python numbers items starting with 0.)

3. We will try starting at row 1 with column names and see if this results in a better formatted data frame. This is done with the skiprows parameter of the read_csv() function.

attnd = pd.read_csv(attnd_path, skiprows=2, header=0)
print(attnd.head())
                  Unnamed: 0  ...  Unnamed: 16
0                          1  ...          NaN
1     United States ........  ...       (0.04)
2  Alabama .................  ...       (0.17)
4  Arizona .................  ...       (0.25)

[5 rows x 17 columns]

This looks a little better. It is still hard to tell if there are column names. Working with column names will be covered in the cleaning chapter.

There are several ways to look at more of the data frame. The data frame can be opened with the viewer in RStudio. The complete head() of the data frame came be viewed in a markdown file by using the to_string() method. The scroll bar at the bottom of the html display is used to see all the columns. The to_string() method is often not as useful when viewing the results in the console window.

print(attnd.head().to_string())
                  Unnamed: 0  ADA as percent of enrollment Unnamed: 2  Average hours in school day Unnamed: 4  Average days in school year Unnamed: 6 Average hours in school year Unnamed: 8  ADA as percent of enrollment.1 Unnamed: 10  Average hours in school day.1 Unnamed: 12 ADA as percent of enrollment.2 Unnamed: 14 Average hours in school day.2 Unnamed: 16
0                          1                           2.0        NaN                          3.0        NaN                          4.0        NaN                            5        NaN                             6.0         NaN                            7.0         NaN                              8         NaN                             9         NaN
1     United States ........                          93.1     (0.22)                          6.6     (0.02)                        180.0      (0.1)                        1,193      (3.1)                            94.0      (0.27)                            6.7      (0.02)                           91.1      (0.43)                           6.6      (0.04)
2  Alabama .................                          93.8     (1.24)                          7.0     (0.07)                        180.0      (0.8)                        1,267     (12.3)                            93.8      (1.84)                            7.0      (0.08)                           94.6      (0.38)                           7.1      (0.17)
3  Alaska ..................                          89.9     (1.22)                          6.5     (0.05)                        180.0      (3.4)                        1,163     (22.9)                            91.3      (1.56)                            6.5      (0.05)                           93.2      (1.57)                           6.2      (0.15)
4  Arizona .................                          89.0     (2.95)                          6.4     (0.09)                        181.0      (1.7)                        1,159     (14.4)                            88.9      (3.91)                            6.4      (0.10)                           89.0      (3.22)                           6.4      (0.25)

2.3.4.2 Cleaning - Coding missing data on import of a csv file

1. There is another issue that is not seen in the head data. The following code displays a small part of the attendance data further into the data. (The code that displays this part of the data will be explained in a later chapter.)

print(attnd.iloc[9:13, 13:15])
   ADA as percent of enrollment.2 Unnamed: 14
9                            93.7      (0.68)
10                              ?         (?)
11                              ?         (?)
12                           89.9      (1.43)

From the above display we see that "?" and "(?)" are used to identify missing data. These do not seem to be valid values for any other columns.

2. The na_values parameter of the read_csv() function is used to correct the ? and (?). The default value for na_values is [""]. This means that any value that has a value of the empty character, "" will be marked as the missing object in Python. The following reads the data again with "?" and "(?)" added to the na_values parameter. Then the same cells as above are displayed to see the results from using na.

attnd = pd.read_csv(attnd_path,
skiprows=2,
keep_default_na=True,
na_values=['?', '(?)'])
print(attnd.iloc[9:13, 13:15])
    ADA as percent of enrollment.2 Unnamed: 14
9                             93.7      (0.68)
10                             NaN         NaN
11                             NaN         NaN
12                            89.9      (1.43)

The "?" and "(?)" have been changed to NaN, the Python object for not a number.

3. We will now see what type was used for each of the columns of the data frame.

print(attnd.dtypes)
Unnamed: 0                         object
ADA as percent of enrollment      float64
Unnamed: 2                         object
Average hours in school day       float64
Unnamed: 4                         object
Average days in school year       float64
Unnamed: 6                         object
Average hours in school year       object
Unnamed: 8                         object
ADA as percent of enrollment.1    float64
Unnamed: 10                        object
Average hours in school day.1     float64
Unnamed: 12                        object
ADA as percent of enrollment.2    float64
Unnamed: 14                        object
Average hours in school day.2     float64
Unnamed: 16                        object
dtype: object

From this display we see that there are both numeric columns and columns that contain text (type of object). We can also see that some of the variable have names and others names are missing.

This is all that will be corrected using read_csv(). The remaining will be address in the cleaning chapter.

4. Files that produce errors.

Some csv files will produce errors from read_csv(). When this happens you do not get to see what pandas was trying to read. There are two approaches to using panadas to get a look at these kinds of files.

You can set the number of columns you would like it to use for the data. This is done using the names parameter. You would typically pick a few columns and given them simple names, as is done below. The read_csv() function will only import the number of columns given in the names parameter.

attnd = (
attnd_path,
names=["a", "b", "c", "d"],
nrows=25))
print(attnd.head())
                                                                                                                                                                                                                                                                                                                        a  ...       d
Table 43. Average daily attendance (ADA) as a p... NaN                                                NaN    NaN                         NaN    NaN                         NaN   NaN                          NaN   NaN                          NaN    NaN                         NaN                              NaN  ...     NaN
NaN                                                Total elementary, secondary, and combined eleme... NaN    NaN                         NaN    NaN                         NaN   NaN                          NaN   Elementary schools           NaN    NaN                         NaN                Secondary schools  ...     NaN
ADA as percent of enrollment                       NaN    Average hours in school day NaN    Average days in school year NaN   Average hours in school year NaN   ADA as percent of enrollment NaN    Average hours in school day NaN     ADA as percent of enrollment  ...     NaN
1                                                  2                                                  NaN    3                           NaN    4                           NaN   5                            NaN   6                            NaN    7                           NaN                                8  ...     NaN
United States ........                          93.1                                               (0.22) 6.6                         (0.02) 180                         (0.1) 1,193                        (3.1) 94.0                         (0.27) 6.7                         (0.02)                          91.1  ...  (0.04)

[5 rows x 4 columns]

Note, the nrows parameter is used here to only read the first 25 lines of the file here.

Another approach is to read each row in as one column. This is done by setting the sep parameter to ^.

attnd = (
attnd_path,
sep='^',
nrows=15))
print(attnd.head())
  Table 43. Average daily attendance (ADA) as a percentage of total enrollment, school day length, and school year length in public schools, by school level and state: 2007-08,,,,,,,,,,,,,,,,
0  ,"Total elementary, secondary, and combined el...
1  ,ADA as percent of enrollment,,Average hours i...
2                          1,2,,3,,4,,5,,6,,7,,8,,9,
3     United States ........,93.1,(0.22),6.6,(0.0...
4  Alabama .................,93.8,(1.24),7.0,(0.0...                                                                                                                                           

2.3.5 Exercises

1. Import the "amis.csv" data set.

2. Are there any rows that need to be ignored in the amis data set? If so, modify your import to account for them.

3. Are there any special symbols that need to be set to missing in the amis data set? If so, modify your import to account for them.

4. Import the "mifem.csv" data set.

5. Is there any meta data at the top or bottom of the mifem data set? You will need to determine how to view the bottom of a data set. If so, modify your import to account for them.

6. Are there any special symbols that need to be set to missing in the mifem data set? If so, modify your import to account for them.