2.4 Importing csv files and parsers

SSCC - Social Science Computing Cooperative

Supporting Statistical Analysis for Research

The tidyverse function to read a csv file is read_csv(). The following are a few important parameters of read_csv().

file, the path to the file to be imported.
col_names, setting this to FALSE indicates the first row does not contains variable names.
col_types, setting this to col() uses guessed types for the columns. Alternatively, the parameters of col() can be used to define the types of each column.
na, list of strings that indicate missing data.
guess_max, specifies the number of row to consider before making a guess of what type the columns are. The default value of 1000 works well on most csv files.
skip, number of lines at the front of the file to be ignored. This is used when a csv file contains metadata at the beginning of the file.

The read_*() functions of the tidyverse use a common set of parsers. These parser are used to format data such as numeric, factors, date and time, etc. These parsers can be directly called to parse a column. The parse_factor() function will be demonstrated in the Modifying variables section below.

2.4.1 Examples

Importing a csv file

cps_in <- read_csv(file.path("..", "datasets", "cps1.csv"), col_types = cols())

Warning: Missing column names filled in: 'X1' [1]

Note, one of the columns did not have name and the read_csv() function gave it a name.

The head() function returns the beginning values of an object

head(cps_in, 3)

# A tibble: 3 x 11
     X1   trt   age  educ black  hisp  marr nodeg   re74   re75   re78
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1     1     0    45    11     0     0     1     1 21517. 25244. 25565.
2     2     0    21    14     0     0     0     0  3176.  5853. 13496.
3     3     0    38    12     0     0     1     0 23039. 25131. 25565.

The glimpse() function displays the column types and the first few values of each column of a data frame.

glimpse(cps_in)

Observations: 15,992
Variables: 11
$ X1    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
$ trt   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ age   <dbl> 45, 21, 38, 48, 18, 22, 48, 18, 48, 45, 34, 16, 53, 19, ...
$ educ  <dbl> 11, 14, 12, 6, 8, 11, 10, 11, 9, 12, 14, 10, 10, 12, 12,...
$ black <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ hisp  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ marr  <dbl> 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0,...
$ nodeg <dbl> 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0,...
$ re74  <dbl> 21516.6700, 3175.9710, 23039.0200, 24994.3700, 1669.2950...
$ re75  <dbl> 25243.550, 5852.565, 25130.760, 25243.550, 10727.610, 18...
$ re78  <dbl> 25564.670, 13496.080, 25564.670, 25564.670, 9860.869, 25...