SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

2.4 Importing csv files and parsers

The tidyverse function to read a csv file is read_csv(). The following are a few important parameters of read_csv().

  • file, the path to the file to be imported.

  • col_names, setting this to FALSE indicates the first row does not contains variable names.

  • col_types, setting this to col() uses guessed types for the columns. Alternatively, the parameters of col() can be used to define the types of each column.

  • na, list of strings that indicate missing data.

  • guess_max, specifies the number of row to consider before making a guess of what type the columns are. The default value of 1000 works well on most csv files.

  • skip, number of lines at the front of the file to be ignored. This is used when a csv file contains metadata at the beginning of the file.

The read_*() functions of the tidyverse use a common set of parsers. These parser are used to format data such as numeric, factors, date and time, etc. These parsers can be directly called to parse a column. The parse_factor() function will be demonstrated in the Modifying variables section below.

2.4.1 Examples

  1. Importing a csv file

    cps_in <- read_csv(file.path("..", "datasets", "cps1.csv"), col_types = cols())
    Warning: Missing column names filled in: 'X1' [1]

    Note, one of the columns did not have name and the read_csv() function gave it a name.

  2. The head() function returns the beginning values of an object

    head(cps_in, 3)    
    # A tibble: 3 x 11
         X1   trt   age  educ black  hisp  marr nodeg   re74   re75   re78
      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
    1     1     0    45    11     0     0     1     1 21517. 25244. 25565.
    2     2     0    21    14     0     0     0     0  3176.  5853. 13496.
    3     3     0    38    12     0     0     1     0 23039. 25131. 25565.
  3. The glimpse() function displays the column types and the first few values of each column of a data frame.

    glimpse(cps_in)
    Observations: 15,992
    Variables: 11
    $ X1    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
    $ trt   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
    $ age   <dbl> 45, 21, 38, 48, 18, 22, 48, 18, 48, 45, 34, 16, 53, 19, ...
    $ educ  <dbl> 11, 14, 12, 6, 8, 11, 10, 11, 9, 12, 14, 10, 10, 12, 12,...
    $ black <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
    $ hisp  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
    $ marr  <dbl> 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0,...
    $ nodeg <dbl> 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0,...
    $ re74  <dbl> 21516.6700, 3175.9710, 23039.0200, 24994.3700, 1669.2950...
    $ re75  <dbl> 25243.550, 5852.565, 25130.760, 25243.550, 10727.610, 18...
    $ re78  <dbl> 25564.670, 13496.080, 25564.670, 25564.670, 9860.869, 25...